` traffic classification based on machine learning

`

Traffic Classification based on Machine Learning using Flow-level Information

Jong Gun Lee ([email protected])Advanced Networking Lab.

`

Table of Contents

• Motivation of this work

• Background about machine learning

• Our approach using machine learning

• Experiment (dataset and result)

• Conclusion

`

Motivation

• We cannot effectively classify the traffic of some new emergent applications, – such as online games and streaming applications– because there is no application information, such as port

number or a common byte sequence in payload

We propose a methodology to classify Internet traffic with supervised and unsupervised learning

`

Basic Terminologies of Machine Learning

• Classifieris mapping unlabeled instances into classes

• Instance is a single object of the world

• Attribute is a single object of the world

• Feature is the specification of an attribute and its value

• Feature vectoris a list of features describing an instance

`

Unsupervised and Supervised Learning

• Supervised learning (with answer/teacher)– With a training set, a classifier learns the characteristics of each

class. And when entering new instance, the classifier predicts the class of the instance.

• Unsupervised learning (without answer/teacher)– With only a set of data (feature vectors), a classifier make a set

of clusters.

`

K-Means

• One of the unsupervised learning methods• K value is the number of clusters and this value is given as

the initial parameter• Procedure

– First, the classifier randomly chooses K points as the centers of K subspaces

– Second, it divides the overall vector space into K subspaces according to the centers

– Third, it picks new K centers for each subspaces– And then, it iterates 2nd and 3rd steps until all of the centers are

not changed or moved within the threshold value

`

Example of K-Means

• # of instance: 8, K=2

`

Overall Process of Our Method

UnsupervisedLearning

FeatureExtraction

SupervisedLearning

N packets N featurevectors

Classifier

K Clusters

ClassificationMethod

`

Flow-level Feature Information

• Protocol number: 6(TCP) or 17(UDP) • Duration: seconds• Number of packets per second (PPS) • Mean of size of all packets

• Mean of size of non-ACK packets• Rate of ACK packets• Interaction Information

`

Feature Extraction (Interaction Information)

• Interaction Information– H: 2-dimensional histogram, 16x16– p1, p2, p3, …, pn

• a sequence of packets size of a flow and its partner flow according to timestamp

For i = 1 : n-1H[pi/100][pi+1/100]++

A sequence of packets’ size: 40, 80, 1500, …, 40, 1500

Pair-wise representation: [40, 80], [80, 1500], …, [40, 1500]

Histogram: [40/100, 80/100], [80/100, 1500/100], … , [40/100, 1500/100] [0, 0], [0, 15], …, [0, 15]

`

Guideline

UnsupervisedLearning

SupervisedLearning

FeatureExtraction

Packets N featurevectors

K clusters

yes

no

Classifier

Rx and TxRx onlyTx only

#bins, bin sizeDynamic/static

Initial ?? packets

Effetive Kestimation

Efficienttheshold

What kind of learning methodFeature

extraction

Unknown TRaffic

`

Dataset

• 6412 bittorrent.arff• 4913 clubbox.arff• 101355 edonkey.arff• 21060 fileguri.arff• 635 ftp.arff• 200274 http.arff• 3611 https.arff• 22 melon.arff• 4986 msnp.arff• 1565 nateon.arff• 169 nntp.arff• 63 pop3.arff• 224 sayclub.arff• 40556 smtp.arff• 67 ssh.arff• 385912 total

• 1500 bittorrent.arff• 1500 clubbox.arff• 1500 edonkey.arff• 1500 fileguri.arff• 0 ftp.arff• 1500 http.arff• 1500 https.arff• 0 melon.arff• 1500 msnp.arff• 1500 nateon.arff• 0 nntp.arff• 0 pop3.arff• 0 sayclub.arff• 1500 smtp.arff• 0 ssh.arff• 13500 total

`

Sum of Squared Error (SSE)

• How to get SSE

• #bins: 8*8• #clusters: 1~20

`

Fitting of SSE

Y=1.446e004 * X^(-1.194) + 755.8

`

Estimation of SSE

`

Decrease Rate of SSE

0.1% decrease

`

To do list

• Direction– Rx and Tx, Rx only, and Tx only

• Dynamic bin size • Initial N packets or all the packets• Different (un)supervised learning method• Different feature extraction method

` traffic classification based on machine learning

Documents