borges rprojectcs691y

27
EMPIRICAL METHODS CS691Y SLIDES BY: RAYMOND BORGES Profiling Creepy Crawlers on the World Wide Web

Upload: rayborg

Post on 29-Nov-2014

161 views

Category:

Education


1 download

DESCRIPTION

Empirical Methods final presentation

TRANSCRIPT

Page 1: Borges rprojectcs691y

EMPIRICAL METHODS CS691Y

SLIDES BY: RAYMOND BORGES

Profiling Creepy Crawlers on the World Wide Web

Page 2: Borges rprojectcs691y

OutlineI. Introduction

II. Related Work

III. Research Problem

IV. Methodology

V. Results

VI. Conclusion

VII. Future Work

VIII. References

2

Page 3: Borges rprojectcs691y

IntroductionExplore the web attack space and profile hacker/bot strategies

To better design intrusion detection systems

To better design our applications and networks

3

Page 4: Borges rprojectcs691y

IntroductionTreat Space

Countries web resources seeded with malware

◦ Spam Po

4

http://www.securelist.com/en/analysis/204792231/IT_Threat_Evolution_Q1_2012

Page 5: Borges rprojectcs691y

IntroductionBotnets

Brute-forcing remote machines services

Worms to recruit more bots

Wiki/Blog spam posting?

5

Page 6: Borges rprojectcs691y

Related WorkMachine learning for intrusion detection has been explored

Still actively explored but mostly signature based

Characterization of crawlers has been explored

Needed

Better solutions for automated attacks and malicious bots

6

Page 7: Borges rprojectcs691y

Related Work

7

http://www.sicherheitstacho.eu/

Page 8: Borges rprojectcs691y

Research QuestionsRQ1: Identify patterns in logs without manually labeling data?

Find anomalies in the data

Anomalies: things that don’t follow the normal trend

RQ2: What patterns can we identify that may be significant?

Find what feature(s) are good predictors for those anomalies

A set of or a specific characters, or request lengths

8

Page 9: Borges rprojectcs691y

Methodology1. Gather Data

2. Pre-processing

3. Data Mining

4. Result Analysis

5. Result Validation

6. Conclusions

9

Page 10: Borges rprojectcs691y

MethodologyGathering Data

2 Sensors

Apache Logs

10

Page 11: Borges rprojectcs691y

MethodologyPre-processing Log Data

1. Gather Data (concatenate apache logs)

2. Remove local WVU traffic

(nagios and local 157.182. IPs 10.10.150.4)

3. Add session number identifier per IP

(30 minutes threshold)

4. Clean data (deal with missing values, errors in fields)

5. Extract features

6. Format for data mining (csv, arff)

11

Page 12: Borges rprojectcs691y

MethodologyFeature Selection

Attempting to use non-conventional features

Low-level features obtained directly (faster)

Assuming Learners will discover underlying pattern(s)

12

Page 13: Borges rprojectcs691y

MethodologyList of features extracted (69 total):

1. Individual counts for every letter [a-z, A-z] (26)

◦ For example for the string “aaaAA”, a = 5

2. Individual counts for every number [0-9] (10)

◦ For example for the string “33”, numberThree = 2

3. Individual counts for every symbol (28)

[:;,”!@#$%^&*()-_+={}[]\/?.~`]

For example for “$$$”, moneySymbol = 3

13

Page 14: Borges rprojectcs691y

MethodologyMore features extracted:

1. Http Request Length in characters (1)

2. IP number (1) nominal

3. Http Server Response code (1) nominal

4. Bytes returned from server (1)

5. User Agent used (1) nominal

14

Page 15: Borges rprojectcs691y

MethodologySome machine learning and visualization in Weka

1. Observe visualizations looking for correlations

2. Run OneR learner (various class variables)

3. Discretize Numeric attributes

4. Repeat above

5. Reach conclusions

15

Page 16: Borges rprojectcs691y

Methodology

Percent Symbols Vs HttpRequestLength

16

Page 17: Borges rprojectcs691y

Methodology

17

BackSlash Vs HttpRequestLength

Page 18: Borges rprojectcs691y

Results

18

Datasets Response Code User Agent IP Address

Accuracy Attribute Accuracy Attribute Accuracy Attribute

Advertised

Server

99.1% Http

Request

Length

64% IP 29.2% User

Agent

Unadvertised

Server

99.3% Http

Request

Length

82.75% Http

Request

Length

13% User

Agent

OneR Learner Results

Page 19: Borges rprojectcs691y

Results

19

Log Characteristics Web Advertised Web Unadvertised

Http Requests 23055 14234

Unique IPs 3394 1554

Unique IP/Sessions 11341 12134

Unique Http requests 2630 620

Unique referrer link with UA 1864 788

Unique Referred links 410 132

Unique User Agents 768 374

Unique IP/UserAgents 5245 2009

IPs that don’t report themselves 77.7% (2636) 89% (1382)

IPs that do state themselves as bots 22.3% (758) 11% (172)

Unique IPs present in both logs 8.6% (428)

Page 20: Borges rprojectcs691y

ResultsLog Composition Statistics Apache Web

Advertised

Apache Web

Unadvertised

String “/Wiki” in requests 65.3% 7.7%

String “/Blog” in requests 2.2% 79.3%

Total 67.5% 87%

20

Page 21: Borges rprojectcs691y

Results

Country Advertised Web Server Unadvertised Web Server

1 United States China

2 China United States

3 Malaysia Russian Federation

4 France Saudi Arabia

5 Germany France

6 Pakistan Hong Kong

7 Philippines Brazil

8 United Kingdom Thailand

9 Poland Ukraine

10 Sweden Malaysia

21

Top Countries visiting sensors

Page 22: Borges rprojectcs691y

ResultsAdvertised Web Server Traffic Distribution by IP

22

Page 23: Borges rprojectcs691y

ResultsUnadvertised Web Server Traffic Distribution by IP

23

Page 24: Borges rprojectcs691y

ConclusionRQ1: Identify patterns in logs without manually labeling data?

Common patterns found

Need validation with labeled attack data

24

Page 25: Borges rprojectcs691y

ConclusionRQ2: What patterns can we identify that may be significant?

Specific symbols such as “=“ and % i.e.

25

RAPPELZ」大型アップデートEPIC6第2章「黄金の君主」の情報が公開

Translates to:

Page 26: Borges rprojectcs691y

Future WorkCompare low-level and high-level feature prediction models

Compare results with production servers

Find good predictors for features with anomalies

26

Page 27: Borges rprojectcs691y

ReferencesNCSA log format http://publib.boulder.ibm.com/iseries/v5r2/ic2924/info/rzaie/rzaielogformat.htm

Identifying Agents

http://www.jafsoft.com/searchengines/spider_hunting.html

27