Vladimir Miheev
Alexei Vopilov
Ivan Shabalin
The MP13 method is best summarized as recognition based on voting decision
trees using "pipes" in potential space..
We did not have enough hardware capacity to work with all the learning samples. Hence, we prepared our own learning data set as 10% of the given training database (about 400,000 entries). Since the task description pointed out the fact of different distributions in the training and testing datasets, we randomly removed some of the DOS and "normal" connections from the full training database. The resulting rediced dataset was used for all further manipulations.
Next, we proceeded with learning based on the "one against the rest"' principle. Classes were separated from each other using five sets of decision trees. Each connection was assigned a vector of proximity to the five classes. Each vector component was calculated as a sum of weighted votes from one set of decision trees. This representation forms a so-called "potential space," while a multidimensional interval on proximity vectors is called a "pipe."
Next, we converted the testing dataset into the potential space representation.
We obtained two types of connections in this space: well-recognized and
unrecognized. Some more work was done to improve recognition. First,
expert knowledge made it possible to construct some simple but quite stable
verbal rules. Also, a second echelon (see Section 1) of decision
trees was constructed to separate unrecognized pairs of classes.
For constructing a decision tree, training data is split into learning
and testing samples. The learning sample is used to find the
structure of a tree (with sufficient complexity) and to generate a
hierarchy of models on this tree. The testing sample is used to select
a subtree having optimal complexity.
Repeated application of the algorithm to various splits of the training
data into different subsamples allows us to generate a set of voting decision
trees. Some heuristic approaches are used to make the trees independent.
Special thanks to V. Pereversev-Orlov as forever guru in KDD science
and original inventor of the approach used.