reporter : fong-ruei , li

36
2009/6/22 1 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection Reporter : Fong-Ruei , Li Machine Learning and Bioinformatics Lab Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. In Proceedings of the 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008.

Upload: larue

Post on 11-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection. Guofei Gu , Roberto Perdisci, Junjie Zhang, and Wenke Lee. In  Proceedings of the 17th USENIX Security Symposium (Security'08) , San Jose, CA, 2008. Reporter : Fong-Ruei , Li. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Reporter : Fong-Ruei , Li

2009/6/22 1

BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection

Reporter : Fong-Ruei , Li

Machine Learning and Bioinformatics Lab

Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. In Proceedings of the 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008.

Page 2: Reporter : Fong-Ruei , Li

Outline

Introduction BotMiner : Detection Framework

Problem statement Architecture overview

Experiments Conclusion

2009/6/22 2Machine Learning and Bioinformatics Lab

Page 3: Reporter : Fong-Ruei , Li

Introduction

Botnets are becoming one of the most serious threats to Internet security Such as SPAM , DDoS …

Botnet is a network of compromised machines under the influence of malware code Bot BotMaster

2009/6/22 3Machine Learning and Bioinformatics Lab

Page 4: Reporter : Fong-Ruei , Li

Introduction

Most of the current botnet detection approaches work on Specific botnet command and

control(C&C) protocol e.g., IRC

Structure e.g., centralized

2009/6/22 4Machine Learning and Bioinformatics Lab

Page 5: Reporter : Fong-Ruei , Li

Introduction

Almost all of these approaches are designed for detecting botnets that use IRC or HTTP based C&C Rish is designed to detect IRC botnets using

known bot nickname patterns as signature Another recent system is designed for

detecting C&C activities with centralized servers BotSniffer

2009/6/22 5Machine Learning and Bioinformatics Lab

Page 6: Reporter : Fong-Ruei , Li

Introduction

We need to develop a next generation botnet detection system which should be independent of the C&C protocol and Structure

2009/6/22 6Machine Learning and Bioinformatics Lab

Page 7: Reporter : Fong-Ruei , Li

Botnet is characterized by C&C communication channel Malicious activities

Botnet structure Centralized P2P

2009/6/22 7

Problem Statement

Machine Learning and Bioinformatics Lab

Page 8: Reporter : Fong-Ruei , Li

Assumptions

We assume that bots within the same botnet will be characterized by similar malicious activities and similar C&C communications

2009/6/22 8Machine Learning and Bioinformatics Lab

Page 9: Reporter : Fong-Ruei , Li

Architecture overview

2009/6/22 9

Clustering similar malicious activities

Clustering similar malicious activities

Clustering similar communication

Clustering similar communication

Cross-checkingCross-checking

Machine Learning and Bioinformatics Lab

Page 10: Reporter : Fong-Ruei , Li

C-plane Monitor

The C-plane monitor captures network flows and records information on who is talking to whom

We limit our interest to TCP and UDP flows Each flow record contains the information:

Time , Duration IP 、 Port (Source , Destination) Number of packets Bytes transferred

2009/6/22 10Machine Learning and Bioinformatics Lab

Page 11: Reporter : Fong-Ruei , Li

A-plane Monitor

The A-plane monitor logs information on who is doing what

It analyzes : Outbound traffic through the monitored

network Detecting several malicious activities that

the internal hosts may perform

2009/6/22 11Machine Learning and Bioinformatics Lab

Page 12: Reporter : Fong-Ruei , Li

C-plane Clustering

Be responsible for : Reading the logs generated by the C-

plane monitor Finding clusters of machines that share

similar communication patterns

2009/6/22 12Machine Learning and Bioinformatics Lab

Page 13: Reporter : Fong-Ruei , Li

C-plane Clustering-Flow Chart

2009/6/22 13

Filter out irrelevant traffic flows

Filter out irrelevant traffic flows

Machine Learning and Bioinformatics Lab

Page 14: Reporter : Fong-Ruei , Li

C-plane Clustering-Basic Filtering

Filter Rule 1 (F1): Ignore the flows that are not directly from

internal host to external hosts Filter Rule 2 (F2):

Ignore the flows that only contain one-way traffic

2009/6/22 14Machine Learning and Bioinformatics Lab

Page 15: Reporter : Fong-Ruei , Li

Filter Rule 3 (F3): Ignore the flows whose destinations are

well known as the legitimate servers Google Yahoo!

2009/6/22 15

C-plane Clustering-White Listing

Machine Learning and Bioinformatics Lab

Page 16: Reporter : Fong-Ruei , Li

Aggregate related flows into communication flows

Given an period , all m TCP/UDP flows share the same protocol , source IP ,

destination IP and port aggregate them into the same C-flow

2009/6/22 16

1..C { }i j j mf

C-plane Clustering-Aggregation (C-Flow)

Machine Learning and Bioinformatics Lab

Page 17: Reporter : Fong-Ruei , Li

C-plane Clustering-Vector representation

Extract a number of statistical features from each C-flow Ci

Translate them into d-dimensional pattern vectors :

2009/6/22 17

diP

Machine Learning and Bioinformatics Lab

Page 18: Reporter : Fong-Ruei , Li

Discrete sample distribution of four random variable :

1. the number of flows per hour (fph). fph is computed by counting the number of

TCP/IP flows in ci that are present for each hour of the epoch E.

2. the number of packets per flow (ppf). ppf is computed by summing the total number of

packets sent within each TCP/UDP flow in ci.

2009/6/22 18

C-plane Clustering-Vector representation

Machine Learning and Bioinformatics Lab

Page 19: Reporter : Fong-Ruei , Li

3. the average number of bytes per packets (bpp).

For each TCP/UDP flow fj ci we divide the overall number of bytes transferred within f j by the number of packets sent within fj .

4. the average number of bytes per second (bps).

bps is computed as the total number of bytes transferred within each fj ci divided by the duration of fj .

2009/6/22 19

C-plane Clustering-Vector representation

Machine Learning and Bioinformatics Lab

Page 20: Reporter : Fong-Ruei , Li

2009/6/22 20

C-plane Clustering-Vector representation

Machine Learning and Bioinformatics Lab

13 intervalsas [0, k1], (k1, k2], ..., (k12,1).

Quantiles : q5%, q10%, q15%, q20%, q25%, q30%, q40%,q50%, q60%, q70%, q80%, q90%,

The quantile ql% of a random variable X is the value q for which P(X < q) = l%.

13 intervalsas [0, k1], (k1, k2], ..., (k12,1).

Quantiles : q5%, q10%, q15%, q20%, q25%, q30%, q40%,q50%, q60%, q70%, q80%, q90%,

The quantile ql% of a random variable X is the value q for which P(X < q) = l%.

Page 21: Reporter : Fong-Ruei , Li

2009/6/22 21

C-plane Clustering-Two-step clustering

Machine Learning and Bioinformatics Lab

Page 22: Reporter : Fong-Ruei , Li

C-plane Clustering-Two-step clustering

First Step : Data set : Using coarse-grained clustering on a

reduced feature space : d=52 features into d’=8 features

X-means clustering algorithm The result is a set

2009/6/22 22

' ', d with d d

1..{ ( )}i j niD p F c

'1.. 1{ }i i rC

Machine Learning and Bioinformatics Lab

Page 23: Reporter : Fong-Ruei , Li

Second Step : We use all the d=52 available features to

represent the C-flows X-means clustering algorithm

The result is a set

2009/6/22 23

''1.. 2{ }i i rC

C-plane Clustering-Two-step clustering

Machine Learning and Bioinformatics Lab

Page 24: Reporter : Fong-Ruei , Li

A-plane Clustering

2009/6/22 24Machine Learning and Bioinformatics Lab

Page 25: Reporter : Fong-Ruei , Li

Cross-plane Correlation

The idea is to cross-check clusters in the two plans to find out intersections that a host being part of a botnet

In order to do this , we compute botnet score s(h) for each host h

2009/6/22 25Machine Learning and Bioinformatics Lab

Page 26: Reporter : Fong-Ruei , Li

Cross-plane Correlation – botnet score

2009/6/22 26

( )1..

( )1..

L H be the set of hosts reported in the output of the A-plane clustering, h

A { } the set of m A-clusters that contain h

{ } the set of n C-clusters that contain h

(

h

h

hi i m h

hi i n h

et H

A be

C C be

s h

( ) ( )

, ,

( )

| | | |) w( )w( ) w( )

| | | |

where , and

( ) is the type of activity cluster (e.g.,scanning or spamming)

w( ) 1 is an activity weight assigned t

j it A t Ai j

i j i ki j i

i j i ki j i k

hi j k

i

i

A A A CA A A

A A A C

A A C C

t A

A

o

w( ) : higher value strong activity(e.g.,spam or exploit)

w( ) : lower value weak activity(e.g.,scanning or binary download)

i

i

i

A

A

A

Machine Learning and Bioinformatics Lab

Page 27: Reporter : Fong-Ruei , Li

Cross-plane Correlation - similarity

2009/6/22 27

( )1..

( )1..

( )

L B H be the of detected bots

A { } the set of A-clusters that each contain at least one bot h B

{ } the set of C-clusters that each contain at least one bot h B

Let K

B

B

Bi i m

Bi i n

B

et

A be

C C be

( ) ( ) ( )1..

|K(B)|

A { } be an order union/set of A- and C-clusters

We then describe each bot h B as a binary vector b(h) {0, 1} ,

whereby the i-th element b = 1 if h K (B) , and b = 0 other

B B

B B Bi i m n

i i i

C K

wise.

Machine Learning and Bioinformatics Lab

Page 28: Reporter : Fong-Ruei , Li

Cross-plane Correlation - similarity

We define the following similarity between bots hi and hj as

where :

2009/6/22 28

( ) ( )( ) , ( )

( ) is the indication function, which equal to one when X is true

equal to Zero when X is false

i ji jb b h b b h

I X

and

Machine Learning and Bioinformatics Lab

Page 29: Reporter : Fong-Ruei , Li

Cross-plane Correlation - similarity

2009/6/22 29Machine Learning and Bioinformatics Lab

Page 30: Reporter : Fong-Ruei , Li

Setup and Collection

We set up traffic monitors to work on router at the campus network of the College of Computing at Georgia Tech.

We ran the C-plane and A-plane monitors for a continuous 10-day period in late 2007.

2009/6/22 30Machine Learning and Bioinformatics Lab

Page 31: Reporter : Fong-Ruei , Li

Setup and Collection

2009/6/22 31Machine Learning and Bioinformatics Lab

Generated by executing modified bot code

Generated based on Web-based C&C communication

a real-world trace containing two P2P botnets

Page 32: Reporter : Fong-Ruei , Li

Evaluation Results

2009/6/22 32Machine Learning and Bioinformatics Lab

Filtration Aggregation

Page 33: Reporter : Fong-Ruei , Li

Evaluation Results

2009/6/22 Machine Learning and Bioinformatics Lab 33

Two-step clustering

Page 34: Reporter : Fong-Ruei , Li

Evaluation Results

2009/6/22 34Machine Learning and Bioinformatics Lab

Page 35: Reporter : Fong-Ruei , Li

Conclusion

We proposed a novel network anomaly-base botnet detection system that is independent of the protocol and structure used by botnet

2009/6/22 35Machine Learning and Bioinformatics Lab

Page 36: Reporter : Fong-Ruei , Li

Thank you for listening

2009/6/22 36

The end

Machine Learning and Bioinformatics Lab