reporter : fong-ruei , li
DESCRIPTION
BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection. Guofei Gu , Roberto Perdisci, Junjie Zhang, and Wenke Lee. In Proceedings of the 17th USENIX Security Symposium (Security'08) , San Jose, CA, 2008. Reporter : Fong-Ruei , Li. Outline. - PowerPoint PPT PresentationTRANSCRIPT
2009/6/22 1
BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection
Reporter : Fong-Ruei , Li
Machine Learning and Bioinformatics Lab
Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. In Proceedings of the 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008.
Outline
Introduction BotMiner : Detection Framework
Problem statement Architecture overview
Experiments Conclusion
2009/6/22 2Machine Learning and Bioinformatics Lab
Introduction
Botnets are becoming one of the most serious threats to Internet security Such as SPAM , DDoS …
Botnet is a network of compromised machines under the influence of malware code Bot BotMaster
2009/6/22 3Machine Learning and Bioinformatics Lab
Introduction
Most of the current botnet detection approaches work on Specific botnet command and
control(C&C) protocol e.g., IRC
Structure e.g., centralized
2009/6/22 4Machine Learning and Bioinformatics Lab
Introduction
Almost all of these approaches are designed for detecting botnets that use IRC or HTTP based C&C Rish is designed to detect IRC botnets using
known bot nickname patterns as signature Another recent system is designed for
detecting C&C activities with centralized servers BotSniffer
2009/6/22 5Machine Learning and Bioinformatics Lab
Introduction
We need to develop a next generation botnet detection system which should be independent of the C&C protocol and Structure
2009/6/22 6Machine Learning and Bioinformatics Lab
Botnet is characterized by C&C communication channel Malicious activities
Botnet structure Centralized P2P
2009/6/22 7
Problem Statement
Machine Learning and Bioinformatics Lab
Assumptions
We assume that bots within the same botnet will be characterized by similar malicious activities and similar C&C communications
2009/6/22 8Machine Learning and Bioinformatics Lab
Architecture overview
2009/6/22 9
Clustering similar malicious activities
Clustering similar malicious activities
Clustering similar communication
Clustering similar communication
Cross-checkingCross-checking
Machine Learning and Bioinformatics Lab
C-plane Monitor
The C-plane monitor captures network flows and records information on who is talking to whom
We limit our interest to TCP and UDP flows Each flow record contains the information:
Time , Duration IP 、 Port (Source , Destination) Number of packets Bytes transferred
2009/6/22 10Machine Learning and Bioinformatics Lab
A-plane Monitor
The A-plane monitor logs information on who is doing what
It analyzes : Outbound traffic through the monitored
network Detecting several malicious activities that
the internal hosts may perform
2009/6/22 11Machine Learning and Bioinformatics Lab
C-plane Clustering
Be responsible for : Reading the logs generated by the C-
plane monitor Finding clusters of machines that share
similar communication patterns
2009/6/22 12Machine Learning and Bioinformatics Lab
C-plane Clustering-Flow Chart
2009/6/22 13
Filter out irrelevant traffic flows
Filter out irrelevant traffic flows
Machine Learning and Bioinformatics Lab
C-plane Clustering-Basic Filtering
Filter Rule 1 (F1): Ignore the flows that are not directly from
internal host to external hosts Filter Rule 2 (F2):
Ignore the flows that only contain one-way traffic
2009/6/22 14Machine Learning and Bioinformatics Lab
Filter Rule 3 (F3): Ignore the flows whose destinations are
well known as the legitimate servers Google Yahoo!
2009/6/22 15
C-plane Clustering-White Listing
Machine Learning and Bioinformatics Lab
Aggregate related flows into communication flows
Given an period , all m TCP/UDP flows share the same protocol , source IP ,
destination IP and port aggregate them into the same C-flow
2009/6/22 16
1..C { }i j j mf
C-plane Clustering-Aggregation (C-Flow)
Machine Learning and Bioinformatics Lab
C-plane Clustering-Vector representation
Extract a number of statistical features from each C-flow Ci
Translate them into d-dimensional pattern vectors :
2009/6/22 17
diP
Machine Learning and Bioinformatics Lab
Discrete sample distribution of four random variable :
1. the number of flows per hour (fph). fph is computed by counting the number of
TCP/IP flows in ci that are present for each hour of the epoch E.
2. the number of packets per flow (ppf). ppf is computed by summing the total number of
packets sent within each TCP/UDP flow in ci.
2009/6/22 18
C-plane Clustering-Vector representation
Machine Learning and Bioinformatics Lab
3. the average number of bytes per packets (bpp).
For each TCP/UDP flow fj ci we divide the overall number of bytes transferred within f j by the number of packets sent within fj .
4. the average number of bytes per second (bps).
bps is computed as the total number of bytes transferred within each fj ci divided by the duration of fj .
2009/6/22 19
C-plane Clustering-Vector representation
Machine Learning and Bioinformatics Lab
2009/6/22 20
C-plane Clustering-Vector representation
Machine Learning and Bioinformatics Lab
13 intervalsas [0, k1], (k1, k2], ..., (k12,1).
Quantiles : q5%, q10%, q15%, q20%, q25%, q30%, q40%,q50%, q60%, q70%, q80%, q90%,
The quantile ql% of a random variable X is the value q for which P(X < q) = l%.
13 intervalsas [0, k1], (k1, k2], ..., (k12,1).
Quantiles : q5%, q10%, q15%, q20%, q25%, q30%, q40%,q50%, q60%, q70%, q80%, q90%,
The quantile ql% of a random variable X is the value q for which P(X < q) = l%.
2009/6/22 21
C-plane Clustering-Two-step clustering
Machine Learning and Bioinformatics Lab
C-plane Clustering-Two-step clustering
First Step : Data set : Using coarse-grained clustering on a
reduced feature space : d=52 features into d’=8 features
X-means clustering algorithm The result is a set
2009/6/22 22
' ', d with d d
1..{ ( )}i j niD p F c
'1.. 1{ }i i rC
Machine Learning and Bioinformatics Lab
Second Step : We use all the d=52 available features to
represent the C-flows X-means clustering algorithm
The result is a set
2009/6/22 23
''1.. 2{ }i i rC
C-plane Clustering-Two-step clustering
Machine Learning and Bioinformatics Lab
A-plane Clustering
2009/6/22 24Machine Learning and Bioinformatics Lab
Cross-plane Correlation
The idea is to cross-check clusters in the two plans to find out intersections that a host being part of a botnet
In order to do this , we compute botnet score s(h) for each host h
2009/6/22 25Machine Learning and Bioinformatics Lab
Cross-plane Correlation – botnet score
2009/6/22 26
( )1..
( )1..
L H be the set of hosts reported in the output of the A-plane clustering, h
A { } the set of m A-clusters that contain h
{ } the set of n C-clusters that contain h
(
h
h
hi i m h
hi i n h
et H
A be
C C be
s h
( ) ( )
, ,
( )
| | | |) w( )w( ) w( )
| | | |
where , and
( ) is the type of activity cluster (e.g.,scanning or spamming)
w( ) 1 is an activity weight assigned t
j it A t Ai j
i j i ki j i
i j i ki j i k
hi j k
i
i
A A A CA A A
A A A C
A A C C
t A
A
o
w( ) : higher value strong activity(e.g.,spam or exploit)
w( ) : lower value weak activity(e.g.,scanning or binary download)
i
i
i
A
A
A
Machine Learning and Bioinformatics Lab
Cross-plane Correlation - similarity
2009/6/22 27
( )1..
( )1..
( )
L B H be the of detected bots
A { } the set of A-clusters that each contain at least one bot h B
{ } the set of C-clusters that each contain at least one bot h B
Let K
B
B
Bi i m
Bi i n
B
et
A be
C C be
( ) ( ) ( )1..
|K(B)|
A { } be an order union/set of A- and C-clusters
We then describe each bot h B as a binary vector b(h) {0, 1} ,
whereby the i-th element b = 1 if h K (B) , and b = 0 other
B B
B B Bi i m n
i i i
C K
wise.
Machine Learning and Bioinformatics Lab
Cross-plane Correlation - similarity
We define the following similarity between bots hi and hj as
where :
2009/6/22 28
( ) ( )( ) , ( )
( ) is the indication function, which equal to one when X is true
equal to Zero when X is false
i ji jb b h b b h
I X
and
Machine Learning and Bioinformatics Lab
Cross-plane Correlation - similarity
2009/6/22 29Machine Learning and Bioinformatics Lab
Setup and Collection
We set up traffic monitors to work on router at the campus network of the College of Computing at Georgia Tech.
We ran the C-plane and A-plane monitors for a continuous 10-day period in late 2007.
2009/6/22 30Machine Learning and Bioinformatics Lab
Setup and Collection
2009/6/22 31Machine Learning and Bioinformatics Lab
Generated by executing modified bot code
Generated based on Web-based C&C communication
a real-world trace containing two P2P botnets
Evaluation Results
2009/6/22 32Machine Learning and Bioinformatics Lab
Filtration Aggregation
Evaluation Results
2009/6/22 Machine Learning and Bioinformatics Lab 33
Two-step clustering
Evaluation Results
2009/6/22 34Machine Learning and Bioinformatics Lab
Conclusion
We proposed a novel network anomaly-base botnet detection system that is independent of the protocol and structure used by botnet
2009/6/22 35Machine Learning and Bioinformatics Lab
Thank you for listening
2009/6/22 36
The end
Machine Learning and Bioinformatics Lab