casting out demons: sanitizing training data for anomaly sensors
TRANSCRIPT
Casting out Demons: Sanitizing Training Data for
Anomaly Sensors
Angelos Stavrou, Department of Computer Science
George Mason UniversityJoint work with Gabriela Cretu, Michael E. Locasto,
Salvatore J. Stolfo, Angelos D. Keromytis
Anomaly Detection (AD) Systems
SupervisedThey are dependent on labeled data, which cannot be
prepared for large data sets, eg. network packets
Semi-supervisedUsing a third party sensor for labeling some data as
known bad dataDependent on clean data for training
UnsupervisedCan clean the data by determining the outliers in the
training dataNo good definition for an anomaly other than low
probability data
!
"
!
"
"
!
"
"
Motivation
Detection of zero-day attacks (only using ADsystem)
Detection accuracy of all learning-based anomalydetectors depends heavily on the quality of thetraining data
Training data is often poor, severely degradingAD’s reliability as detection and forensic analysistools
!
!
!
Rest of the Talk
Intuition
Local Training Sanitization
Distributed Cross-Sanitization
Future work
Conclusions
!
!
!
!
!
Intuition
Pattern of actions reflected on traces: Regular – what we are expecting based on previous observations
Abnormal – unlikely data requiring further investigation
An attack can pass as normal traffic if it is partof the training set
Sanitize the training data by using a large set ofmicro-models where attacks and non-regular datacause a localized or limited "pollution“ oftraining data
!
"
"
!
!
Training Dataset Sanitization
Attacks and accidental mal-formed requests/datacause a local "pollution“ of training data
An attack can pass as normal traffic if it is part of thetraining set
We seek to remove both malicious and abnormaldata from the training dataset
Related ML algorithms:Ensemble methods [Dietterich00]MetaCost [Domingos99]Meta-learning [Stolfo00]
!
"
!
!
"
"
"
Training Strategies – Uniform Time
Divide data into multiple blocks!
micro-datasets with the same time granularity"
……..
Training Strategies – Multiple Models
Divide data into multiple blocks
Build micro-models for each block
!
!
M1 M2 MKµM1 µM2 µMK
Attacks and non-regular data cause localized "pollution“"
……..
Training Strategies – Voting Models
Divide data into multiple blocks
Build micro-models for each blockTest all models against a smaller dataset
!
!
!
M1 M2 MK
Votingalgorithm
µM1 µM2 µMK
Votingalgorithm
Simple voting:
Weighted voting: wi = number of packets
used for training µMi
"
"……..
Training Strategies - Sanitization
Divide data into multiple blocks
Build micro-models for each blockTest all models against a smaller dataset
Build sanitized and abnormal models
!
!
!
!
M1 M2 MK
Votingalgorithm
Abnormal
model
SanitizedmodelTraining phase
µM1 µM2 µMK
Votingalgorithm
Abnormalmodel
SanitizedmodelTraining phase
sanitized model:
abnormal model:
V = voting threshold
"
"
"
……..
Shadow Sensor RedirectionShadow sensor
Heavily instrumented host based anomaly detector akinto an “oracle”
Performs substantially slower than the nativeapplication
Use the shadow sensor to classify or corroborate thealerts produced by the AD sensors
!
"
"
!
Sanitizedmodel
Alert?
False falsepositive
AlertTesting phase
Shadowserver
Sanitizedmodel
Alert?
Falsepositive
AlertTesting phase
Host basedIDS
Feasibility and scalabilitydepend on the number ofalerts generated by the ADsensor
"
Overall Architecture
For each host, use a large setof training data:
Divide data intomultiple blocks
Build micro-modelsfor each block
Test all models against asmaller dataset
Sanitize data based onprevious step and buildthe sanitized model
Build an abnormal modelas well
!
!
!
"
"
M1 M2 MK
Votingalgorithm
Abnormal
model
SanitizedmodelTraining phase
Alert?
False falsepositive
AlertTesting phase
Shadowserver
µM1 µM2 µMK
Votingalgorithm
Maliciousmodel
SanitizedmodelTraining phase
Alert?
Falsepositive
AlertTesting phase
Host basedIDS
Micro-models
Partition a large training dataset into a number ofsmaller, time delimited training sets => micro-datasets
where each mdi has a time granularity, g
AD can be any chosen anomaly detection algorithm T is the training datasetM denotes the normal model produced by AD
Attacks and non-regular data cause a localized orlimited "pollution“ of training data
!
"
"
"
"
Voting algorithms
Using a second dataset and testing it against !Mi
Lj,i = 0 if !Mi deems the packet Pj as normal
Lj,i = 1 otherwise
The generalized label for packet Pj
where wi is the weight assigned to !Mi
Simple voting:
Weighted voting: = proportion of all packetsused for training µMi
!
!
!
!
!
!
Sanitized and Abnormal Models
Sanitized model
Abnormal model
V = voting threshold
!
!
!
Evaluation
Proof of concept using two content-based anomalydetectors:
Anagramsemi-supervised learning (when using Snort)supervised learning (without Snort)analyzing n-gram
Paylunsupervised learninganalyzing byte(1-gram) frequency distributions
!
"
"
"
!
"
"
Evaluation dataset
300/100/100 hours of real network traffic!
Voting Techniques Comparison
a) Simple voting b) Weighted votingPerformance of Anagram sensor after sanitization for www1
Datasets Comparison
Performance for www and lists for 3-hour granularity when usingAnagram
AD sensors comparison
Sensor FP)(%) FA TP)(%) TA
Anagram 0.07 544 0 0
Anagram)with)Snort 0.04 312 20.20 20
Anagram)withsanitization
0.10 776 100 99
Payl 0.84 6,558 0 0
Payl)with)sanitization 6.64 70,392 76.76 76
Signal-to-noise ratio comparison
Sensor www1 www lists
Anagram 0 0 0
Anagram with Snort 505 59.10 370.2
Anagram with sanitization 1000 294.11 1000
Payl 0 6.64 1.00
Payl with sanitization 11.56 5.84 36.05
signal-to-noise ratio TP/FP: higher values mean better results
Granularity Impact
Granularity impact on the performance of the system when usingAnagram and Payl
Training Dataset Size Impact
Impact of the size of the training dataset for www1
AD’s Internal Threshold Impact
Impact of the anomaly detector’s internal threshold forwww1 when using Anagram
Analysis of g and V
a) Simple voting b) Weighted votingPerformance of Anagram sensor after sanitization
Shadow Sensor Performance Evaluation
Overall computational requirements of an AD sensor anda host based sensor (e.g. STEM and DYBOC)
l is the standard latency of a protected serviceOs is the shadow server overhead
FP is the false positive rate
!
!
!
!
Sensor STEM DYBOC
N/A 44*l 1.2*l
Anagram 1.031*l 1.0001*l
Anagram=with=Snort 1.0172*l 1.0000*l
Anagram=with=sanitization 1.0430*l 1.0002*l
Payl 1.3612*l 1.0016*l
Payl=with=sanitization 3.8552*l 1.0132*l
Caveat Emptor & Limitations
The presence of a long-lasting attack in thedataset used for computing the micro-models
Poisoning all the micro-models
!
!
AD Distributed Cross-Sanitization
Use external knowledge (models) to generate a betterlocal normal model
Abnormal models are exchanged across collaborative sites[Stolfo00]
re-evaluate the locally computed sanitized models
!
!
!
Apply model differencing Remove remote
abnormal data from thelocal normal model
!
!
Cross-sanitization
Direct model differencingAnalytic method, difference of the models
Indirect model differencingNo analytic method, use testing
!
"
!
"
direct
indirect
Local sanitizedmodel
Remote abnormalmodel
Cross-sanitization: Evaluation
Model www1 www lists
FP (%) TP (%) FP (%) TP (%) FP (%) TP (%)
Mpois 0.10 44.94 0.27 51.78 0.25 47.53
Mcross(direct) 0.24 100 0.71 100 0.48 100
Mcross(indirect) 0.10 100 0.26 100 0.10 100
Indirect model differencing is more expensive than thedirect model differencing
!
Method www1 www lists
direct 13.989s 26.359s 16.849s
indirect 1966.689s 1732.329s 685.819s
Future work
adversarial scenarios: new techniques to resist training attacks
distributed sanitization: a distributed architecture toshare models and remove training attacks
model updates: updating AD models to accommodateconcept drift
!
!
!
Conclusions
A novel sanitization method that boosts theperformance of out-of-the-box anomaly detectors
Simple and general method, without significantadditional computational cost
An efficient and accurate online packetclassifier; both in real time and in post-processingforensic analysis
!
!
!
Thank you!
Questions?