` traffic classification based on machine learning

19
` Traffic Classification based on Machine Learning using Flow-level Information Jong Gun Lee ([email protected]) Advanced Networking Lab.

Upload: butest

Post on 02-Dec-2014

919 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ` Traffic Classification based on Machine Learning

`

Traffic Classification based on Machine Learning using Flow-level Information

Jong Gun Lee ([email protected])Advanced Networking Lab.

Page 2: ` Traffic Classification based on Machine Learning

`

Table of Contents

• Motivation of this work

• Background about machine learning

• Our approach using machine learning

• Experiment (dataset and result)

• Conclusion

Page 3: ` Traffic Classification based on Machine Learning

`

Motivation

• We cannot effectively classify the traffic of some new emergent applications, – such as online games and streaming applications– because there is no application information, such as port

number or a common byte sequence in payload

We propose a methodology to classify Internet traffic with supervised and unsupervised learning

Page 4: ` Traffic Classification based on Machine Learning

`

Basic Terminologies of Machine Learning

• Classifieris mapping unlabeled instances into classes

• Instance is a single object of the world

• Attribute is a single object of the world

• Feature is the specification of an attribute and its value

• Feature vectoris a list of features describing an instance

Page 5: ` Traffic Classification based on Machine Learning

`

Unsupervised and Supervised Learning

• Supervised learning (with answer/teacher)– With a training set, a classifier learns the characteristics of each

class. And when entering new instance, the classifier predicts the class of the instance.

• Unsupervised learning (without answer/teacher)– With only a set of data (feature vectors), a classifier make a set

of clusters.

Page 6: ` Traffic Classification based on Machine Learning

`

K-Means

• One of the unsupervised learning methods• K value is the number of clusters and this value is given as

the initial parameter• Procedure

– First, the classifier randomly chooses K points as the centers of K subspaces

– Second, it divides the overall vector space into K subspaces according to the centers

– Third, it picks new K centers for each subspaces– And then, it iterates 2nd and 3rd steps until all of the centers are

not changed or moved within the threshold value

Page 7: ` Traffic Classification based on Machine Learning

`

Example of K-Means

• # of instance: 8, K=2

Page 8: ` Traffic Classification based on Machine Learning

`

Overall Process of Our Method

UnsupervisedLearning

FeatureExtraction

SupervisedLearning

N packets N featurevectors

Classifier

K Clusters

ClassificationMethod

Page 9: ` Traffic Classification based on Machine Learning

`

Flow-level Feature Information

• Protocol number: 6(TCP) or 17(UDP) • Duration: seconds• Number of packets per second (PPS) • Mean of size of all packets

• Mean of size of non-ACK packets• Rate of ACK packets• Interaction Information

Page 10: ` Traffic Classification based on Machine Learning

`

Feature Extraction (Interaction Information)

• Interaction Information– H: 2-dimensional histogram, 16x16– p1, p2, p3, …, pn

• a sequence of packets size of a flow and its partner flow according to timestamp

For i = 1 : n-1H[pi/100][pi+1/100]++

A sequence of packets’ size: 40, 80, 1500, …, 40, 1500

Pair-wise representation: [40, 80], [80, 1500], …, [40, 1500]

Histogram: [40/100, 80/100], [80/100, 1500/100], … , [40/100, 1500/100] [0, 0], [0, 15], …, [0, 15]

Page 11: ` Traffic Classification based on Machine Learning

`

Guideline

UnsupervisedLearning

SupervisedLearning

FeatureExtraction

Packets N featurevectors

K clusters

yes

no

Classifier

Rx and TxRx onlyTx only

#bins, bin sizeDynamic/static

Initial ?? packets

Effetive Kestimation

Efficienttheshold

What kind of learning methodFeature

extraction

Unknown TRaffic

Page 12: ` Traffic Classification based on Machine Learning

`

Dataset

• 6412 bittorrent.arff• 4913 clubbox.arff• 101355 edonkey.arff• 21060 fileguri.arff• 635 ftp.arff• 200274 http.arff• 3611 https.arff• 22 melon.arff• 4986 msnp.arff• 1565 nateon.arff• 169 nntp.arff• 63 pop3.arff• 224 sayclub.arff• 40556 smtp.arff• 67 ssh.arff• 385912 total

• 1500 bittorrent.arff• 1500 clubbox.arff• 1500 edonkey.arff• 1500 fileguri.arff• 0 ftp.arff• 1500 http.arff• 1500 https.arff• 0 melon.arff• 1500 msnp.arff• 1500 nateon.arff• 0 nntp.arff• 0 pop3.arff• 0 sayclub.arff• 1500 smtp.arff• 0 ssh.arff• 13500 total

Page 13: ` Traffic Classification based on Machine Learning

`

Page 14: ` Traffic Classification based on Machine Learning

`

Page 15: ` Traffic Classification based on Machine Learning

`

Sum of Squared Error (SSE)

• How to get SSE

• #bins: 8*8• #clusters: 1~20

Page 16: ` Traffic Classification based on Machine Learning

`

Fitting of SSE

Y=1.446e004 * X^(-1.194) + 755.8

Page 17: ` Traffic Classification based on Machine Learning

`

Estimation of SSE

Page 18: ` Traffic Classification based on Machine Learning

`

Decrease Rate of SSE

0.1% decrease

Page 19: ` Traffic Classification based on Machine Learning

`

To do list

• Direction– Rx and Tx, Rx only, and Tx only

• Dynamic bin size • Initial N packets or all the packets• Different (un)supervised learning method• Different feature extraction method