[ieee 2010 third international workshop on advanced computational intelligence (iwaci) - suzhou,...

Third International Workshop on Advanced Computational Intelligence August 25-27,2010 - Suzhou, Jiangsu, China

Exploring a Swarm Intelligence Methodology to Identify Command

and Control Flow

Yan Zhang, Y. Wang, and L. Qi

Abstract-Botnet poses a significant threat to the Internet

today. Reactive techniques that try to detect such an attack and throttle down malicious traffic prevail today but seem not to be

very effective. In this paper we present an approach to Botnet

detection that is based on the methodology of swarm intelligence.

Specifically, particle swarm optimization, a robust stochastic

evolutionary algorithm based on the movement and intelligence of swarms, is applied to track the remote controsl activities, namely C&C. There exist in literature a few papers in which

PSO is used to face the optimization problem. However, no

paper exists showing the effectiveness of PSO on this problem. Therefore, PSO is examined in this paper to face the

identification of C&C flow. Comparing with other classification techniques, PSO performs a high accuracy.

I. INTRODUCTION

SWARM intelligence optimum algorithm provides the foundation on finding complex distributed solutions to a problem in the absence of centralized control and without

providing the overall model situation. Particle Swarm Optimization (PSO) is a heuristic technique for search of optimal solutions strongly based on the concept of swarm intelligence. In this paper, we propose the use of PSO as a new tool for botnet command and control flow detection.

A. Background

Malicious code (or malware) has become one of the most pressing security problems on the Internet. In particular, this is true for botnets [1], network of compromised computers used for nefarious means. They are significant contributors to the malicious and criminal activity on the Internet today, and, more importantly, are an underground network whose size and scope is not fully known. The information security community does know that botnets are a major source of Internet-scale problems, including host scans exploit attempts and attacks, spam [2], and stealing personal data such as mail accounts or bank credentials [3, 4]. Botnets are also a common tool used to conduct distributed denial of service (DDoS) [5] attacks due to the immense aggregate bandwidth that botnets command. Especially with the large amount of home users on broadband Internet, a few hundred cable modems can easily flood even a medium-sized host off the Internet. With some botnets as large as tens of thousands of machines, few hosts could withstand a sustained assault without severe degradation of service.

Once infected, the victim host creates a communication and control channel to the attacker, which is used to conduct

Manuscript received March 31, 2010. Authors are with the Electronic Engineering Institute.

978-1-4244-6337·4/10/$26.00 @2010 IEEE 318

over ten thousands of the compromised computers. This communication and control channel is called as Command and Control (C&C) channel. The attacker controls bots through this channel to execute distributed attacks and other activities. Until the bot receives a command from the attacker, it stays dormant in the compromised host. This makes bots harder to be detected than other malicious codes, such as viruses and worms.

B. Related Wark

The most prevailing botnets are the IRC-botnets [6], which have a centralized architecture. Besides, these botnets are usually very large and powerful, consisting of thousands of bots [7].

A number of different IRC botnet detection techniques have been proposed recently. Goebel and Holz [8] detect bot infected machines by matching IRC nickname similarity. This is an effective detection technique, but requires payload for analysis. Dewes et al. [9] propose a scheme for identifying chat traffic. Their approach relies on a combination of discriminating criteria, including service port number, packet size distribution, and packet content. Sen et al. [10] use a signature-based scheme to discern traffic. Their approach relies on identifying particular characteristics in the syntax of packet payloads exchanged as part of the operation of the particular applications. The recent trend toward using non-standard ports and encryption may reduce the effectiveness or, even, prevent the use of these techniques.

Others [11-14] have proposed approaches using statistical techniques to characterize and classify traffic streams. Karasaridis et al. [11] apply statistical properties of bot Command and Control (C&C) traffic to detect IRC botnet traffic and the C&C controller. Livadas et al. [12] apply data mining to detect C&C traffic passing through IRC channel. Roughan et al. [13] use traffic classification to identify the Class of Service (CoS) of traffic streams, which enable the on-the-fly provision of distinct levels of Quality of Service (QoS). The authors use multitudes of traffic statistics to classify flows, which pertain to either packets, flows, connections, intra-flow, intra-connection, or multi-flow characteristics. They also investigate the effectiveness of using average packet size, RMS packet size, and average flow duration to discriminate among flows. Given these characteristics, simple classification schemes produced very accurate traffic flow classification.

In a similar approach, Moore and Zuev [14] apply variants of the naive Bayesian classification scheme to classify flows into 10 distinct application groups. They also search through the various traffic characteristics to identify those that are

most effective at discriminating among the various traffic flow classes. By also identifying highly correlated traffic flow characteristics, this search is also effective in pruning the number of traffic flow characteristics used for classification. Highly-correlated characteristics provide comparable and, often, redundant information about the traffic flows. Thus, in many cases it suffices to use only one of the correlated characteristics to discriminate among traffic flows.

In this research, an efficient approach to detect the computer compromised by bots is explored. Considered as a common activity of bots, the C&C, therefore, is taken into account. Moreover, the ability of PSO to detect botnet traffic is examined. However, no other botnet detection approaches, to our best knowledge, have applied swarm intelligence technique yet.

II. THEORIES AND METHODS

A. Particle Swarm Optimization PSO is based on a swarm of n individuals called particles,

each representing a solution to a problem with N dimensions. Its genotype consists of 2*N parameters, the first N representing the coordinates of particle position, while the latter N representing its velocity components in the N-dimensional problem space. From the evolutionary point of view, a particle moves with an adaptable velocity within the search space and retains in its own memory the best position it ever reached. The parameters are changed when going from an iteration to the next one as described below.

Velocity �i (t + 1) of i-th particle at next step t+ 1 is a linear

combination of current velocity �i (t) of i-th particle at time t, of the difference between the position hi (t) of the best

solution found up to this time by i-th particle and current

position Pi(t) ofi-th particle, and of the difference between

best position ever found in the population hg (t) and that of

i-th particle Pi (t) :

- - -

Vi(t+l) = W'Vi(t)+C\ ·U(O,l)®(Mt)-Pi (t» +c2 ·U(O,I)®(bg(t)- pJt»

(1)

where ® denotes point-wise vector multiplication, U(O,I) is

a function that returns a vector whose positions are randomly

generated by a uniform distribution in [0, 1], c\ is the

cognitive parameter, c2 is the social parameter, and W is the

inertia factor whose range is [0.0, 1.0]. Velocity values must be within a range defined by two parameters vrnin and vrnax'

An improvement to original PSO is in W not being kept constant during execution; rather, starting from a maximal

value Wrnax ' it is linearly decremented as the number of

iterations increases down to a minimal value wrnin as

follows[15] :

319

(2)

where t and T max are the current and the maximum allowed

number of iterations respectively. The position of each particle at next step is then evaluated

as the sum of its current position and of the velocity obtained by Eq.(1):

(3)

These operations are repeated for a predefined number of iterations Tmax or until some other stopping criterion gets

verified. The pseudocode of PSO is as follows:

for each particle do initialize particle position and velocity

endfor while stopping criteria are notfulfilled do

for each particle do calculate fitness value if (fitness value is better than best fitness

value hi (I) in particle history) then take current particle as new hi(l) end if

endfor choose as hi(l) the particle with bestfitness value among all particles in current iteration

for each particle do calculate particle velocity based on Eq. (1)

update particle position based on Eq. (3) endfor update the inertiafactor based on Eq. (2)

end while

B. Adapting PSO to C&C Identification Actually, the problem of C&C Identification can be seen as

a sort of classification, i.e. determining whether a certain flow

is a C&C or not. Given a database with two classes (Cp C2) and N parameters, the problem can be translated into that of finding the optimal positions of the two centroids in an N-dimensional space, i.e. which of determining for any centroid its N coordinates. With these premises, the i-th individual of the population is encoded as follows:

(4)

where the position of the j-th centroid is constituted by N real numbers representing its N coordinates in the problem space:

p! = {Pi,i""'P�,i} (5)

and similarly the velocity of the j-th centroid is made up of N real numbers representing its N velocity components in the

problem space:

Vi - {vi . • • Vi } i - I,i' ' N,i (6)

Then, any individual in the population consists of 4N components, each of which is represented by a real value.

To evaluate the quality of solutions, the fitness function I.j/ is taken into account, which is computed as the sum on all training set instances of Euclidean distance in N-dimensional

space between generic instance xf and the centroid of the

-eLK""W" (x j ) class CL it belongs to according to database ( Pi ). This

sum is divided by DTrain' which is the number of instances

composing the training set. In symbols, i-th individual fitness is given by:

(7)

When computing distance, any of its components in the N-dimensional space is normalized with respect to the maximal range in the dimension, and the sum of distance components is divided by N. With this choice, any distance can range within [0.0, 1.0], and so can I.j/. Given the chosen

fitness function, the problem becomes a typical minimization problem.

III. PRELIMINARY

In this section, we firstly explain the collection of the mal ware dataset which we used to analyze the bot traffic. Then, we determine the flow characteristics of the traffic which vary in the ability of differentiation. Finally, we clean the flow data by removing the irrelevant instances so as to facilitate the subsequent work.

A. Source Data We used the collection of malware data set which is

captured by our honeypot system. The honeypot system uses nepenthes [16] collect bots from the Internet. It is operated by the bot analysis team of the Institute of Information Security. In this research, we used 2161 unique binary files that were captured by this system. The malware collection was scanned by ClamA V[17] antivirus tool (version 0.88.2 and signature file number 2416) and the result are shown in Table I. The antivirus program identified 1473 files as bots and 483 unknown files.

We captured all IP packets during execution of each

TABLE I CONTENTS OF MALWARE SPECIOUS COLLECTION

malware type sub type num. percentage SDBot 123 598 27.67% MyBot 197 539 24.94% PoeBot 19 243 11.24% IRCBot 18 93 430% Others 84 205 9.49% Unknown 483 2235% Total 2161 100.00%

320

malware on the sandbox environment for traffic analysis. We executed a malware under Windows XP (with no service pack applied) in VMware[l8] for 3 minutes. All packets from/to this sandbox environment were captured and stored as files in tcpdump format.

Table II shows the result of malware execution. By hand-analysis, we identified 957 active bots and 1229 C&C server sessions. Those bots accessed to 97 unique servers. Some of them were running on the same machine. That machine was compromised by attackers and multiple C&C

TABLE II DETECTION OF ACTIVE BOTS AND C&C SERVERS

Active bot programs (C&C session detected) Unknown (ClarnA V not detected as a malware) Total C&C Sessions

957 67 0%) 1229

Unique Servers (pair of IP addresses & port number) Unique IP addresses

97 71

servers were installed and configured. We used these 1229 sessions as bot C&C sessions for classification. More details of their analysis and examination of the classification are given in the following section.

B. Flow Characteristics Since IRC-based botnets use TCP, we retain TCP packets

and discard all others (UDP, ICMP, etc.). We characterize flows using attributes based on TCP and IP packet headers. These can be interpreted even if the encapsulated payload is encrypted.

Table III summarizes the flow characteristics that we collected for each of the flows in the traffic traces used in our work. These include the cumulative application payload size, the IP protocol type (TCP), the IP source and destination addresses, the source and destination ports, and TCP flags.

TABLE III TRAFFIC FLOW CHARACTERISTICS

start/end IP-proto TCP flags pkts Bytes pushed pkts duration maxwin Role Bpp Bps Pps PctPktsPushed

PctBppHistBinO-7

variAT varBpp

Flow startJend times IP protocol of flow Summary of TCP SYNIFIN/ACK flags Total pkts exchanged in flow Total Bytes exchanged in flow Total packets pushed in flow Flow duration Maximum initial congestion window Whether client or server initiated flow Average Bytes-per-packet for flow Average bits-per-second for flow Average packets-per-second for flow Percentage of packets pushed in flow Percent of packets in one of eight packet size bins; these variables collectively form a histogram of packet size for flow Variance of packet inter-arrival time for flow Variance of Bytes-per-packet for flow

Moreover, we record flow start and end times, packet counts, byte counts, statistics for variance, client/server role for the connection(as indicated by the initial three-way-handshake of TCP), and a histogram of application payload sizes. For experimental purposes, we also recorded the packet counts associated with TCP push and maximum window size.

C. Noise Elimination

To reduce the total number of flow considered, we use a set of heuristics crafted to discard flow that are unlikely to be botnet flows. We eliminate all port-scanning activity from the data set. Flows containing only TCP Syn or TCP Rst indicate that communication was never established.

Moreover, we eliminated short-lived flows. That is flows of only a few packets or a few seconds. These do not correspond to bots that are standing by "at the ready." These eliminations are more significant for the non-chat subset of flows and serves to focus subsequent PSO techniques on the more important area of overlap between either IRC and non-IRC, or botnet and real IRC flows.

IV. EXPERIMENTAL STUDY AND RESULTS ANALYSIS

In this section, we present our work on using PSO-based classifiers to identify IRC-based botnet C&C flows. We first classify flows into IRC and non-IRC flows; then, among the flows identified as IRC flows, we distinguish between authentic IRC and botnet flows.

The False Negative Rate (FNR) and the False Positive

Rate (FPR) were utilized to evaluate the performance of the classifiers considered. A low FNR guarantees that only a small fraction of the IRClbotnet flows will be discarded during our botnet identification process. A low FPR guarantees that the set of flows identified as IRClbotnet will not be infested by non-IRClbotnet flows. In the first stage of our approach that of identifying IRC traffic we expect traffic traces to involve a large number of non-IRC/botnet flows. A low FPR would thus be beneficial in cutting down the number of flows that need to be examined during the second stage that of discerning botnet from chat flows. In Stage I, we evaluate the classifiers considered in terms of the FNR in identifying our botnet flows.

We explore the effectiveness of PSO-based classification in identifying IRC traffic by comparing with three distinct classification techniques: J48, PSO, and Bayesian networks.

FNR _su, FPR F« IRCInon-IRC Flows 'OO�--------------�--------r=��

LPSO I

+ �s�

10

.. "

+ O" '�-

---------------!:'O'---------------------;-!'OO FNR(%)

Fig. I. FNR and FPR of 148, PSO, and Bayesian Net Classification Schemes for IRC/non-IRC Flows of the Trace

321

J48 is the WEKA [19] implementation of C4.5 decision trees [20]. The Bayesian networks technique uses a directed acyclic graph to capture the dependence among sample features.

We use the flow characteristics in the lower part of Table 3 as the initial set of flow attributes. We do not use the characteristics in the upper part of the table for classification purposes they either are inconsequential in classifying flows, or correspond to accumulated quantities, which are indirectly captured by the corresponding rates or percentages and the flow duration.

Figure 1 depicts the FNR vs. FPR scatter plot for ten runs of J48, PSO, and Bayesian networks for the labeled trace. Each data point corresponds to a different subset of the initial flow attribute set. Figure 1 reveals clustering in the performance of each of three classification techniques. PSO seems to have low FNR, but higher FPR. The Bayesian networks technique seems to have low FPR, but higher FNR. J48 seems to strike a balance between FNR and FPR.

Only the PSO classifiers were successful in achieving low FNR. Notably, the PSO classifiers accurately classified 35 out of the 38 background flows, thus achieving an FNR of 7.89%. In contrast, the J48 and the Bayesian networks classifiers, possibly tuned too tightly to the training set, performed very poorly. Since the PSO classifier is the only one that showed potential in accurately classifying IRC flows, it would be preferable to the J48 and Bayesian network classifiers.

We investigated which of the attribute sets provide the most differentiation benefit in classifying botnet C&C traffic by PSO. Firstly, we defined three kinds of vectors for session classification. Then, we examine the results of the C&C session classification by PSO using each vector definition.

To evaluate the effectiveness and accuracy of the classification, we defined three kinds of vectors for session classification, namely session information vector, packet sequence vector, and packet histogram vector. Session information vector is defined as total receive packet numbers, total receive packet data size, total send packet numbers, total send packet data size and session time. Packet sequence vector consists of the packet size, and packet interval time of the first 16 packets from the session established. Packet histogram vector is the histogram data by packet payload size and packet interval time in the session.

Figure 2 shows the result of C&C session classification on different attribute vectors. According to the session information vector, the detection rate of the training dataset is 82.68% and 80.85% for the testing dataset. That was a good classification for the bot C&C session using simple vector data to represent the session characteristics. However, the FPR is higher (9.8%) for the IRC chat session. It misclassified the normal IRC chat session as the C&C session. For the packet sequence vector, all of the C&C sessions in the training dataset were correctly identified. However, there is an 82.55% FNR for classification of the C&C sessions in the testing dataset. The packet histogram vector was better than the other two vector definitions. It classified the C&C session in the training dataset and testing dataset well. The FPR is

100

80

20

C&C Session Classification Results on Training Set

_Seuion Information Vector _ Packet Sequence Vector c:JPaeket Histogram Vector

C&C Snsion Classifieation Results on Testing Set

Fig. 2. Comparison of C&C Session Classification Results on Training and Testing set using Different Attribute Vectors

0.09% in the training dataset; the other data had no FPR. The FNR is 3.15% in the training dataset and 5.25% in the testing dataset.

V. CONCLUSIONS

In this paper, we use PSO techniques to identify C&C traffic of IRC-based botnet. We split this task into two stages: (I) distinguishing between IRC and non-IRC traffic, and (II) distinguishing between botnet and real IRC traffic. In Stage I, only the PSO classifiers were successful in achieving low FNR in classifying the IRC flows. For Stage II, the packet histogram vector showed its superiorities over the other two attribute vectors in identifying bot C&C session, which performed not only high in detection rate but also low in FPR.

ACKNOWLEDGMENT

The authors thank the research team of Tanaka Laboratory in the Institute of Information Security for the collection of malware dataset.

REFERENCES

[I] D. Dagon, G. Gu, C. Lee, and W. Lee. A Taxonomy of Bot net Structures. In Annual Computer Security Applications Conference (ACSAC), 2007.

322

[2] A. Ramachandran and N. Feamster. Understanding the Network-level Behavior of Spammers. In ACM SIGCOMM, 2006.

[3] T Holz, M. Engelberth, and F. Freiling. Learning More About the Underground Economy: A Case-Study of Keyloggers and Dropzones. Reihe Informatik TR-2008-006, University of Mannheim, 2008.

[4] S. Saroiu, S. Gribble, and H. Levy. Measurement and Analysis of Spyware in a University Environment. In Networked Systems Design and Implementation (NSDJ), 2004.

[5] D. Moore, G. Voelker, and S. Savage. Inferring Internet Denial of Service Activity. In Usenix Security Symposium, 2001.

[6] B. Saha and A. Gairola. Botnet: An overview. CERT-In White Paper CIWP-2005-05, 2005.

[7] M. A. Rajab, J. Zarfoss, F. Momose, and A. Terzis. A multifaceted approach to understanding the botnet phenomenon. In Proc. of the 6th ACM SIGCOMM on Internet Measurement Conference (IMC), 2006.

[8] J. Goebel and T Holz. Rishi: Identify bot contaminated hosts by irc nickname evaluation. In UsenixIHotbots '07 Workshop, 2007.

[9] C. Dewes, A. Watchman, and A. Feldman. An analysis of internet chat systems. In IMC '03: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 51-64, New York, NY, USA, 2003. ACM Press.

[10] S. Sen, O. Spatscheck, and D. Wang. Accurate, scalable in network identification of p2p traffic using application signatures. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 512-521, New York, NY, USA, 2004. ACM Press.

[II] A. Karasaridis, B. Rexroad, and D. Hoeflin. Wide-scale botnet detection and characterization. In UsenixIHotbots '07 Workshop, 2007.

[12] C. Livadas, B. Walsh, D. Lapsley, and T Strayer. Using machine learning techniques to identify botnet traffic. In 2nd IEEE LCN Workshop on Network Security (WoNS'2006), November 2006.

[13] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield. Classof-service mapping for qos: a statistical signature-based approach to ip traffic classification. In IMC '04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 135-148, New York, NY,

USA, 2004. ACM Press. [14] A. W. Moore and D. Zuev. Internet traffic classification using Bayesian

analysis techniques. In SIGMETRICS '05: Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 50-60, New York, NY, USA, 2005. ACM Press.

[15] Y. Shi, R.C. Eberhart, A modified Particle Swarm Optimizer, in: Proceedings of the IEEE International Conference on Evolutionary Computation, IEEE Press, Piscataway, NJ, 1998, pp. 69-71

[16] Nepenthes Development Team: Nepenthes - Finest Collection. Available: http://nepenthes.mwcollect.org/

[17] ClamAV project: ClamAV. Available: http://www.clamav.netl [18] VMware Inc.: VMware workstation. Software available:

http://www.vmware.com/ [19] L H. Witten and E. Frank. Data Mining: Practical Machine Learning

Tools and Techniques (2nd Edition). Morgan Kaufmann, San Francisco, CA, 2005.

[20] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, Inc., 2 editiokn, 2001.

[ieee 2010 third international workshop on advanced computational intelligence (iwaci) - suzhou,...

Documents