borges rprojectcs691y
DESCRIPTION
Empirical Methods final presentationTRANSCRIPT
EMPIRICAL METHODS CS691Y
SLIDES BY: RAYMOND BORGES
Profiling Creepy Crawlers on the World Wide Web
OutlineI. Introduction
II. Related Work
III. Research Problem
IV. Methodology
V. Results
VI. Conclusion
VII. Future Work
VIII. References
2
IntroductionExplore the web attack space and profile hacker/bot strategies
To better design intrusion detection systems
To better design our applications and networks
3
IntroductionTreat Space
Countries web resources seeded with malware
◦ Spam Po
4
http://www.securelist.com/en/analysis/204792231/IT_Threat_Evolution_Q1_2012
IntroductionBotnets
Brute-forcing remote machines services
Worms to recruit more bots
Wiki/Blog spam posting?
5
Related WorkMachine learning for intrusion detection has been explored
Still actively explored but mostly signature based
Characterization of crawlers has been explored
Needed
Better solutions for automated attacks and malicious bots
6
Research QuestionsRQ1: Identify patterns in logs without manually labeling data?
Find anomalies in the data
Anomalies: things that don’t follow the normal trend
RQ2: What patterns can we identify that may be significant?
Find what feature(s) are good predictors for those anomalies
A set of or a specific characters, or request lengths
8
Methodology1. Gather Data
2. Pre-processing
3. Data Mining
4. Result Analysis
5. Result Validation
6. Conclusions
9
MethodologyGathering Data
2 Sensors
Apache Logs
10
MethodologyPre-processing Log Data
1. Gather Data (concatenate apache logs)
2. Remove local WVU traffic
(nagios and local 157.182. IPs 10.10.150.4)
3. Add session number identifier per IP
(30 minutes threshold)
4. Clean data (deal with missing values, errors in fields)
5. Extract features
6. Format for data mining (csv, arff)
11
MethodologyFeature Selection
Attempting to use non-conventional features
Low-level features obtained directly (faster)
Assuming Learners will discover underlying pattern(s)
12
MethodologyList of features extracted (69 total):
1. Individual counts for every letter [a-z, A-z] (26)
◦ For example for the string “aaaAA”, a = 5
2. Individual counts for every number [0-9] (10)
◦ For example for the string “33”, numberThree = 2
3. Individual counts for every symbol (28)
[:;,”!@#$%^&*()-_+={}[]\/?.~`]
For example for “$$$”, moneySymbol = 3
13
MethodologyMore features extracted:
1. Http Request Length in characters (1)
2. IP number (1) nominal
3. Http Server Response code (1) nominal
4. Bytes returned from server (1)
5. User Agent used (1) nominal
14
MethodologySome machine learning and visualization in Weka
1. Observe visualizations looking for correlations
2. Run OneR learner (various class variables)
3. Discretize Numeric attributes
4. Repeat above
5. Reach conclusions
15
Methodology
Percent Symbols Vs HttpRequestLength
16
Methodology
17
BackSlash Vs HttpRequestLength
Results
18
Datasets Response Code User Agent IP Address
Accuracy Attribute Accuracy Attribute Accuracy Attribute
Advertised
Server
99.1% Http
Request
Length
64% IP 29.2% User
Agent
Unadvertised
Server
99.3% Http
Request
Length
82.75% Http
Request
Length
13% User
Agent
OneR Learner Results
Results
19
Log Characteristics Web Advertised Web Unadvertised
Http Requests 23055 14234
Unique IPs 3394 1554
Unique IP/Sessions 11341 12134
Unique Http requests 2630 620
Unique referrer link with UA 1864 788
Unique Referred links 410 132
Unique User Agents 768 374
Unique IP/UserAgents 5245 2009
IPs that don’t report themselves 77.7% (2636) 89% (1382)
IPs that do state themselves as bots 22.3% (758) 11% (172)
Unique IPs present in both logs 8.6% (428)
ResultsLog Composition Statistics Apache Web
Advertised
Apache Web
Unadvertised
String “/Wiki” in requests 65.3% 7.7%
String “/Blog” in requests 2.2% 79.3%
Total 67.5% 87%
20
Results
Country Advertised Web Server Unadvertised Web Server
1 United States China
2 China United States
3 Malaysia Russian Federation
4 France Saudi Arabia
5 Germany France
6 Pakistan Hong Kong
7 Philippines Brazil
8 United Kingdom Thailand
9 Poland Ukraine
10 Sweden Malaysia
21
Top Countries visiting sensors
ResultsAdvertised Web Server Traffic Distribution by IP
22
ResultsUnadvertised Web Server Traffic Distribution by IP
23
ConclusionRQ1: Identify patterns in logs without manually labeling data?
Common patterns found
Need validation with labeled attack data
24
ConclusionRQ2: What patterns can we identify that may be significant?
Specific symbols such as “=“ and % i.e.
25
RAPPELZ」大型アップデートEPIC6第2章「黄金の君主」の情報が公開
Translates to:
Future WorkCompare low-level and high-level feature prediction models
Compare results with production servers
Find good predictors for features with anomalies
26
ReferencesNCSA log format http://publib.boulder.ibm.com/iseries/v5r2/ic2924/info/rzaie/rzaielogformat.htm
Identifying Agents
http://www.jafsoft.com/searchengines/spider_hunting.html
27