` traffic classification based on machine learning
DESCRIPTION
TRANSCRIPT
`
Traffic Classification based on Machine Learning using Flow-level Information
Jong Gun Lee ([email protected])Advanced Networking Lab.
`
Table of Contents
• Motivation of this work
• Background about machine learning
• Our approach using machine learning
• Experiment (dataset and result)
• Conclusion
`
Motivation
• We cannot effectively classify the traffic of some new emergent applications, – such as online games and streaming applications– because there is no application information, such as port
number or a common byte sequence in payload
We propose a methodology to classify Internet traffic with supervised and unsupervised learning
`
Basic Terminologies of Machine Learning
• Classifieris mapping unlabeled instances into classes
• Instance is a single object of the world
• Attribute is a single object of the world
• Feature is the specification of an attribute and its value
• Feature vectoris a list of features describing an instance
`
Unsupervised and Supervised Learning
• Supervised learning (with answer/teacher)– With a training set, a classifier learns the characteristics of each
class. And when entering new instance, the classifier predicts the class of the instance.
• Unsupervised learning (without answer/teacher)– With only a set of data (feature vectors), a classifier make a set
of clusters.
`
K-Means
• One of the unsupervised learning methods• K value is the number of clusters and this value is given as
the initial parameter• Procedure
– First, the classifier randomly chooses K points as the centers of K subspaces
– Second, it divides the overall vector space into K subspaces according to the centers
– Third, it picks new K centers for each subspaces– And then, it iterates 2nd and 3rd steps until all of the centers are
not changed or moved within the threshold value
`
Example of K-Means
• # of instance: 8, K=2
`
Overall Process of Our Method
UnsupervisedLearning
FeatureExtraction
SupervisedLearning
N packets N featurevectors
Classifier
K Clusters
ClassificationMethod
`
Flow-level Feature Information
• Protocol number: 6(TCP) or 17(UDP) • Duration: seconds• Number of packets per second (PPS) • Mean of size of all packets
• Mean of size of non-ACK packets• Rate of ACK packets• Interaction Information
`
Feature Extraction (Interaction Information)
• Interaction Information– H: 2-dimensional histogram, 16x16– p1, p2, p3, …, pn
• a sequence of packets size of a flow and its partner flow according to timestamp
For i = 1 : n-1H[pi/100][pi+1/100]++
A sequence of packets’ size: 40, 80, 1500, …, 40, 1500
Pair-wise representation: [40, 80], [80, 1500], …, [40, 1500]
Histogram: [40/100, 80/100], [80/100, 1500/100], … , [40/100, 1500/100] [0, 0], [0, 15], …, [0, 15]
`
Guideline
UnsupervisedLearning
SupervisedLearning
FeatureExtraction
Packets N featurevectors
K clusters
yes
no
Classifier
Rx and TxRx onlyTx only
#bins, bin sizeDynamic/static
Initial ?? packets
Effetive Kestimation
Efficienttheshold
What kind of learning methodFeature
extraction
Unknown TRaffic
`
Dataset
• 6412 bittorrent.arff• 4913 clubbox.arff• 101355 edonkey.arff• 21060 fileguri.arff• 635 ftp.arff• 200274 http.arff• 3611 https.arff• 22 melon.arff• 4986 msnp.arff• 1565 nateon.arff• 169 nntp.arff• 63 pop3.arff• 224 sayclub.arff• 40556 smtp.arff• 67 ssh.arff• 385912 total
• 1500 bittorrent.arff• 1500 clubbox.arff• 1500 edonkey.arff• 1500 fileguri.arff• 0 ftp.arff• 1500 http.arff• 1500 https.arff• 0 melon.arff• 1500 msnp.arff• 1500 nateon.arff• 0 nntp.arff• 0 pop3.arff• 0 sayclub.arff• 1500 smtp.arff• 0 ssh.arff• 13500 total
`
`
`
Sum of Squared Error (SSE)
• How to get SSE
• #bins: 8*8• #clusters: 1~20
`
Fitting of SSE
Y=1.446e004 * X^(-1.194) + 755.8
`
Estimation of SSE
`
Decrease Rate of SSE
0.1% decrease
`
To do list
• Direction– Rx and Tx, Rx only, and Tx only
• Dynamic bin size • Initial N packets or all the packets• Different (un)supervised learning method• Different feature extraction method