borges rprojectcs691y

EMPIRICAL METHODS CS691Y

SLIDES BY: RAYMOND BORGES

Profiling Creepy Crawlers on the World Wide Web

OutlineI. Introduction

II. Related Work

III. Research Problem

IV. Methodology

V. Results

VI. Conclusion

VII. Future Work

VIII. References

2

IntroductionExplore the web attack space and profile hacker/bot strategies

To better design intrusion detection systems

To better design our applications and networks

3

IntroductionTreat Space

Countries web resources seeded with malware

◦ Spam Po

4

http://www.securelist.com/en/analysis/204792231/IT_Threat_Evolution_Q1_2012

IntroductionBotnets

Brute-forcing remote machines services

Worms to recruit more bots

Wiki/Blog spam posting?

5

Related WorkMachine learning for intrusion detection has been explored

Still actively explored but mostly signature based

Characterization of crawlers has been explored

Needed

Better solutions for automated attacks and malicious bots

6

Related Work

7

http://www.sicherheitstacho.eu/

http://www.sicherheitstacho.eu/

Research QuestionsRQ1: Identify patterns in logs without manually labeling data?

Find anomalies in the data

Anomalies: things that don’t follow the normal trend

RQ2: What patterns can we identify that may be significant?

Find what feature(s) are good predictors for those anomalies

A set of or a specific characters, or request lengths

8

Methodology1. Gather Data

2. Pre-processing

3. Data Mining

4. Result Analysis

5. Result Validation

6. Conclusions

9

MethodologyGathering Data

2 Sensors

Apache Logs

10

MethodologyPre-processing Log Data

1. Gather Data (concatenate apache logs)

2. Remove local WVU traffic

(nagios and local 157.182. IPs 10.10.150.4)

3. Add session number identifier per IP

(30 minutes threshold)

4. Clean data (deal with missing values, errors in fields)

5. Extract features

6. Format for data mining (csv, arff)

11

MethodologyFeature Selection

Attempting to use non-conventional features

Low-level features obtained directly (faster)

Assuming Learners will discover underlying pattern(s)

12

MethodologyList of features extracted (69 total):

1. Individual counts for every letter [a-z, A-z] (26)

◦ For example for the string “aaaAA”, a = 5

2. Individual counts for every number [0-9] (10)

◦ For example for the string “33”, numberThree = 2

3. Individual counts for every symbol (28)

[:;,”!@#$%^&*()-_+={}[]\/?.~`]

For example for “$$$”, moneySymbol = 3

13

MethodologyMore features extracted:

1. Http Request Length in characters (1)

2. IP number (1) nominal

3. Http Server Response code (1) nominal

4. Bytes returned from server (1)

5. User Agent used (1) nominal

14

MethodologySome machine learning and visualization in Weka

1. Observe visualizations looking for correlations

2. Run OneR learner (various class variables)

3. Discretize Numeric attributes

4. Repeat above

5. Reach conclusions

15

Methodology

Percent Symbols Vs HttpRequestLength

16

Methodology

17

BackSlash Vs HttpRequestLength

Results

18

Datasets Response Code User Agent IP Address

Accuracy Attribute Accuracy Attribute Accuracy Attribute

Advertised

Server

99.1% Http

Request

Length

64% IP 29.2% User

Agent

Unadvertised

Server

99.3% Http

Request

Length

82.75% Http

Request

Length

13% User

Agent

OneR Learner Results

Results

19

Log Characteristics Web Advertised Web Unadvertised

Http Requests 23055 14234

Unique IPs 3394 1554

Unique IP/Sessions 11341 12134

Unique Http requests 2630 620

Unique referrer link with UA 1864 788

Unique Referred links 410 132

Unique User Agents 768 374

Unique IP/UserAgents 5245 2009

IPs that don’t report themselves 77.7% (2636) 89% (1382)

IPs that do state themselves as bots 22.3% (758) 11% (172)

Unique IPs present in both logs 8.6% (428)

ResultsLog Composition Statistics Apache Web

Advertised

Apache Web

Unadvertised

String “/Wiki” in requests 65.3% 7.7%

String “/Blog” in requests 2.2% 79.3%

Total 67.5% 87%

20

Results

Country Advertised Web Server Unadvertised Web Server

1 United States China

2 China United States

3 Malaysia Russian Federation

4 France Saudi Arabia

5 Germany France

6 Pakistan Hong Kong

7 Philippines Brazil

8 United Kingdom Thailand

9 Poland Ukraine

10 Sweden Malaysia

21

Top Countries visiting sensors

ResultsAdvertised Web Server Traffic Distribution by IP

22

ResultsUnadvertised Web Server Traffic Distribution by IP

23

ConclusionRQ1: Identify patterns in logs without manually labeling data?

Common patterns found

Need validation with labeled attack data

24

ConclusionRQ2: What patterns can we identify that may be significant?

Specific symbols such as “=“ and % i.e.

25

RAPPELZ」大型アップデートEPIC6第2章「黄金の君主」の情報が公開

Translates to:

Future WorkCompare low-level and high-level feature prediction models

Compare results with production servers

Find good predictors for features with anomalies

26

ReferencesNCSA log format http://publib.boulder.ibm.com/iseries/v5r2/ic2924/info/rzaie/rzaielogformat.htm

Identifying Agents

http://www.jafsoft.com/searchengines/spider_hunting.html

27

http://publib.boulder.ibm.com/iseries/v5r2/ic2924/info/rzaie/rzaielogformat.htm

http://www.jafsoft.com/searchengines/spider_hunting.html

borges rprojectcs691y

Education