internet traffic classification using constrained clustering
TRANSCRIPT
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID 1
Internet Traffic Classification Using Constrained Clustering
Yu Wang*, Yang Xiang, Senior Member, IEEE, Jun Zhang, Wanlei Zhou, Senior Member, IEEE, Guiyi Wei, and Laurence T. Yang, Senior Member, IEEE
Abstract—Statistics-based Internet traffic classification using machine learning techniques has attracted extensive research interests lately, because of the increasing ineffectiveness of traditional port-based and payload-based approaches. In particular, unsupervised learning, i.e. traffic clustering, is very important in real-life applications, where labelled training data are difficult to obtain and new patterns keep emerging. Although previous studies have applied some classic clustering algorithms such as K-Means and EM for the task, the quality of resultant traffic clusters was far from satisfactory. In order to improve the accuracy of traffic clustering, we propose a constrained clustering scheme that makes decisions with consideration of some background information in addition to the observed traffic statistics. Specifically, we make use of equivalence set constraints indicating that particular sets of flows are using the same application layer protocols, which can be efficiently inferred from packet headers according to the background knowledge of TCP/IP networking. We model the observed data and constraints using Gaussian mixture density and adapt an approximate algorithm for the maximum likelihood estimation of model parameters. Moreover, we study the effects of unsupervised feature discretization on traffic clustering by using a fundamental binning method. A number of real-world Internet traffic traces have been used in our evaluation, and the results show that the proposed approach not only improves the quality of traffic clusters in terms of overall accuracy and per-class metrics, but also speeds up the convergence.
Index Terms—Algorithms, Clustering, Machine learning, Traffic analysis, Network security.
—————————— ——————————
1 INTRODUCTION
xxxx-xxxx/0x/$xx.00 © 2012 IEEE
————————————————
Digital Object Indentifier 10.1109/TPDS.2013.307 1045-9219/13/$31.00 © 2013 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
2 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID
2 RELATED WORK
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
AUTHOR ET AL.: TITLE 3
3 CONSTRAINED TRAFFIC CLUSTERING SCHEME
3.1 Flows, Features, and Classes
TABLE 1 FLOW FEATURE SET
Observation Statistics Feature # Packets Number of packets 2 Bytes Volume of bytes 2
Packet Size Min., Max., Mean and Std. Dev. 8 Inter Packet Time Min., Max., Mean and Std. Dev. 8
Total 20
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
4 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID
3.2 Unsupervised Traffic Clustering
3.3 Background Information in Internet Traffic
“ …} ”
3.4 Constrained Traffic Clustering
Packet Trace
Feature Extraction
Feature Processing
Unsupervised Learning
Classifier
Clusters
Cluster Labeling
UnlabelledFlow Samples
Cluster-Classifier Transformation
Flow Identification
EquivalenceSets
Constrained Clustering
Set-based Constraints
Fig. 1. The framework of Internet traffic classification using unsu-pervised clustering methods and constrained clustering methods (in dotted boxes)
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
AUTHOR ET AL.: TITLE 5
3.5 Feature Processing
4 CONSTRAINED CLUSTERING ALGORITHM
4.1 Problem Definition
4.2 Constrained Gaussian Mixture Model
4.3 Maximum Likelihood Estimates
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
6 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID
4.4 An Approximate Method
4.5 Convergence of SBCK
TABLE 2 SET-BASED CONSTRAINED K-MEANS ALGORITHM
SBCK ( ) begin
preparation rearrange into equivalence sets according to the given set-based constraints
initialization set the means with random samples do
assignment classify the samples in to cluster where
update re-compute the means until no change happens return final means and cluster assignments
end
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
AUTHOR ET AL.: TITLE 7
to zero:
□
□
5 DATA SETS
5.1 Packet Traces
’
5.2 Background Information in Data
TABLE 3 TRAFFIC TRACE
Trace Date / Length Network / Link Type Volume keio 2006-08-06/30 mins campus/edge 16.99 G
wide-08 2008-03-18/5 hours organization/backbone 197.2 G wide-09 2009-03-31/5 hours organization/backbone 224.2 G
isp 2010-11-27/7 days ISP/edge 665.7 G
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
8 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID
6 EVALUATIONS
6.1 Evaluating Methodology
20,…, 100, 200,…, 500.
‘
‘
6.2 Clustering Results
TABLE 4 STATISTICS OF BACKGROUND INFORMATION
Trace Flow #
(5-tuple) Service # (3-tuple)
Service Size (Avg./Max.)
Linked Flow %
keio 170019 8241 20.63/10461 98.1% wide-08 2701771 47607 56.75/89233 98.9 % wide-09 1772781 32539 53.87/105572 98.9 %
isp (2-hour) 165274 11321 14.60/6373 96.8 %
Fig.2. Overall accuracy results of the constrained clustering algo-rithm SBCK in both continuous and discrete feature space in com-parison with K-Means and EM.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
AUTHOR ET AL.: TITLE 9
6.3 Run-time Performance
Fig.3. Per-class f-measure results obtained in the clustering experi-ments in isp and keio data sets, with the number of clusters being 100 and 500,
Fig. 4. Number of Iterations.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
10 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID
7 CONCLUSION
REFERENCES [1] T. Nguyen and G. Armitage, "A survey of techniques for internet
traffic classification using machine learning," IEEE Communication Surveys & Tutorials, Vol. 10, No. 4, pp. 56-76, 2008.
[2] F. Hernandez, A. Nobel, F. Smith, and K. Jeffay, “Statistical Cluster-ing of Internet Communication Patterns.” In Proceedings of Symposi-um on the Interface of Computing Science and Statistics, 2003.
[3] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow clustering using machine learning techniques,” in Proc. Passive and Active Measurement Workshop (PAM '04), France, April 2004.
[4] S. Zander, T. Nguyen, and G. Armitage, “Automated traffic classifica-tion and application identification using machine learning,” in IEEE 30th Conference on Local Computer Networks (LCN '05), Sydney, Australia, November 2005.
[5] J. Erman, A. Mahanti, and M. Arlitt, “Internet traffic identification using machine learning techniques,” in Proc. of 49th IEEE Global Telecommunications Conference (GLOBECOM '06), Dec 2006.
[6] J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification using clustering algorithms,” in MineNet ’06: Proc. SIGCOMM workshop.
[7] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic classification on the fly,” SIGCOMM Computer Communica-tion Review, vol. 36- 2, 2006.
[8] L. Bernaille and R. Teixeira, "Early recognition of encrypted applica-tions," in the Proc. PAM 2007, April 2007.
[9] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Semi-supervised network traffic classification,” SIGMETRICS Performance Evaluation Review, vol. 35, no. 1, pp. 369–370, 2007.
[10] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-service mapping for QoS: A statistical signature-based approach to IP traffic classification,” in Proc. IMC 2004, Sicily, Italy, Oct 2004.
[11] Jun Zhang, Yang Xiang, Yu Wang, Wanlei Zhou, Yong Xiang, and Yong Guan, "Network Traffic Classification Using Correlation Infor-mation", IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 1, pp. 104-117, 2013.
[12] A. Moore and D. Zuev, “Internet traffic classification using Bayesian analysis techniques,” in Proc. SIGMETRICS’05, Canada, June 2005.
[13] T. Auld, A. W. Moore, and S. F. Gull, “Bayesian neural networks for Internet traffic classification,” IEEE Trans. Neural Networks, no. 1, pp. 223–239, January 2007.
[14] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, “Traffic classifica-tion through simple statistical fingerprinting,” SIGCOMM Comput. Commun. Rev., vol. 37, no. 1, pp. 5–16, 2007.
[15] A. Este, F. Gringoli, and L. Salgarelli, "Support Vector Machines for TCP traffic classification", Computer Networks, Vol. 53, No. 14, 2009, pp. 2476-2490.
[16] D. Schatzmann, W. Muhlbauer, T. Spyropoulos, and X. Dimitropou-los, “Digging into HTTPS: flow-based classification of webmail traf-fic,” in Proceedings of the 10th annual conference on Internet meas-urement (IMC '10). ACM, New York, NY, USA, 322-327, 2010.
[17] Y. Wang and S. Yu, "Supervised Learning Real-time Traffic Classifi-ers," Journal of Networks, Vol 4, No 7, pp. 622-629, Sep 2009.
[18] T. T. Nguyen, G. Armitage, P. Branch, S. Zander, "Timely and Con-tinuous Machine-Learning-Based Classification for Interactive IP Traffic", IEEE/ACM Transactions on Networking, in press, 2012.
[19] N. Williams, S. Zander, and G. Armitage, "A Preliminary Perfor-mance Comparison of Five Machine Learning Algorithms for Practi-cal IP Traffic Flow Classification," ACM SIGCOMM Computer Communication Review, Vol. 36, No. 5, Oct 2006, pp 7-15.
[20] S. Zander, G. Armitage, "Practical Machine Learning Based Multime-dia Traffic Classification for Distributed QoS Management," in 36th Annual IEEE Conference on Local Computer Networks (LCN 2011), Bonn, Germany, 4-7 October 2011.
[21] M. Pietrzyk, J. Costeux, G. Urvoy-Keller, and T. En-Najjary, "Chal-lenging statistical classification for operational usage: the ADSL case," in Proceedings of the 9th ACM SIGCOMM conference on In-ternet measurement conference (IMC '09), pp. 122-135, 2009.
[22] A. Este, F. Gringoli, and L. Salgarelli, "On the Stability of the Infor-mation Carried by Traffic Flow Features at the Packet Level", ACM SIGCOMM Computer Communication Review, Vol. 39, No. 3, Jul 2009, pp. 13-18.
[23] Y. Lim, H. Kim, J. Jeong, C. Kim, T.Kwon and Y.Choi, "Internet Traffic Classification Demystified: On the Sources of the Discrimina-tive Power," in Proceeding of the ACM CoNEXT 2010.
[24] J. MacQueen, "Some methods for classification and analysis of multi-variate observations," In Proceedings of the Fifth Berkeley Symposi-um on Mathematics, Statistics and Probability, pp. 281-296, 1967.
[25] A. Dempster, N. Laird, and D. Rubin, "Maximum-likelihood from incomplete data via the em algorithm," Journal of the Royal Statistical Society B,39:1-39, 1977.
[26] J. Blimes, "A gentle tutorial on the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models," TR. ICSI-TR-97-021, University of California Berkeley.
[27] N. Shental, A. Bar-Hillel, T. Hertz, AND D. Weinshall, "Computing gaussian mixture models with em using equivalence constraints," in NIPS 16, MIT Press, Cambridge, MA, 2004.
[28] R. Duda and P. Hart, "Pattern Classification and Scene Analysis," John Wiley and Sons, 1973.
[29] MAWI Traffic Archive. http://mawi.wide.ad.jp/mawi/.
[30] K. Wagstaff, S. Basu, and I. Davidson, “When is constrained cluster-ing beneficial, and why?” In Proceedings of the Twenty-Sixth Confer-ence on Artificial Intelligence (AAAI-06), 2006.
[31] Weka 3. http://www.cs.waikato.ac.nz/ml/weka/.
[32] Yu Wang, Yang Xiang, Jun Zhang, and Shunzheng Yu, "A Novel Semi-Supervised Approach for Network Traffic Clustering", in Proc. NSS’11, Milan, Italy, September 2011.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
AUTHOR ET AL.: TITLE 11
Yu Wang received his PhD in Computer Science from Deakin University, Australia. He is currently with the School of Computer Science and Tech-nology, Huazhong University of Science and Technology, China and the School of Information Technology, Deakin University, Australia. His research interests include network traffic model-ing and classification, mobile networks, and network security.
Yang Xiang received his PhD in Computer Sci-ence from Deakin University, Australia. He is currently with School of Information Technology, Deakin University. His research interests include network and system security, distributed sys-tems, and networking. In particular, he is current-ly leading in a research group developing active defense systems against large-scale distributed network attacks. He has published more than
100 research papers in many international journals and conferences, such as IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Information Security and Forensics, and IEEE Journal on Selected Areas in Communications. He has published two books, Software Similarity and Classification (Springer) and Dynamic and Advanced Data Mining for Progressing Technological Development (IGI-Global). He has served as the Program/General Chair for many international conferences such as ICA3PP 12/11, IEEE/IFIP EUC 11, IEEE TrustCom 11, IEEE HPCC 10/09, IEEE ICPADS 08, NSS 11/10/09/08/07. He has been the PC member for more than 50 international conferences in distributed systems, net-working, and security. He serves as the Associate Editor of IEEE Transactions on Parallel and Distributed Systems and the Editor of Journal of Network and Computer Applications. He is a member of the IEEE.
Jun Zhang received his PhD degree in 2011 from University of Wollongong, Australia. He is cur-rently with School of Information Technology at Deakin University, Melbourne, Australia. His research interests include network and system security, pattern recognition, and multimedia processing. He has published more than 20 re-search papers in the international journals and conferences, such as IEEE Transactions on Image Processing, The Computer Journal, and
IEEE International Conference on Image Processing. Jun Zhang received 2009 Chinese government award for outstanding self-financed student abroad.
Wanlei Zhou received his PhD degree in 1991 from the Australian National University, Canber-ra, Australia, and the DSc degree from Deakin University, Victoria, Australia, in 2002. He is currently the chair professor of Information Technology and the Head of School of Infor-mation Technology, Deakin University, Mel-bourne. His research interests include distributed and parallel systems, network security, mobile
computing, bioinformatics, and e-learning. He has published more than 200 papers in refereed international journals and refereed inter-national conference proceedings. Since 1997, he has been involved in more than 50 international conferences as the general chair, a steering chair, a PC chair, a session chair, a publication chair, and a PC member. He is a senior member of the IEEE.
Guiyi Wei received his Ph.D. in December 2006 from Zhejiang University. He has re-search interests in wireless networks, mobile computing, cloud computing, social networks and network security. He is a full professor of the School of Computer Science and Infor-mation Engineering at Zhejiang Gongshang University. He is also the director of the Net-working and Distributed Computing Laboratory.
Laurence T. Yang received the BE degree in Computer Science and Technology from Tsing-hua University, China and the PhD degree in Computer Science from University of Victoria, Canada. He is a professor in the School of Computer Science and Technology at Huazhong University of Science and Technology, China, and in the Department of Computer Science, St. Francis Xavier University, Canada. His research interests include parallel and distributed compu-
ting, embedded and ubiquitous/pervasive computing. His research has been supported by the National Sciences and Engineering Re-search Council, and the Canada Foundation for Innovation.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.