casting out demons: sanitizing training data for anomaly sensors

Casting out Demons: Sanitizing Training Data for

Anomaly Sensors

Angelos Stavrou, Department of Computer Science

George Mason UniversityJoint work with Gabriela Cretu, Michael E. Locasto,

Salvatore J. Stolfo, Angelos D. Keromytis

Anomaly Detection (AD) Systems

SupervisedThey are dependent on labeled data, which cannot be

prepared for large data sets, eg. network packets

Semi-supervisedUsing a third party sensor for labeling some data as

known bad dataDependent on clean data for training

UnsupervisedCan clean the data by determining the outliers in the

training dataNo good definition for an anomaly other than low

probability data

!

"

!

"

"

!

"

"

Motivation

Detection of zero-day attacks (only using ADsystem)

Detection accuracy of all learning-based anomalydetectors depends heavily on the quality of thetraining data

Training data is often poor, severely degradingAD’s reliability as detection and forensic analysistools

!

!

!

Rest of the Talk

Intuition

Local Training Sanitization

Distributed Cross-Sanitization

Future work

Conclusions

!

!

!

!

!

Intuition

Pattern of actions reflected on traces: Regular – what we are expecting based on previous observations

Abnormal – unlikely data requiring further investigation

An attack can pass as normal traffic if it is partof the training set

Sanitize the training data by using a large set ofmicro-models where attacks and non-regular datacause a localized or limited "pollution“ oftraining data

!

"

"

!

!

Training Dataset Sanitization

Attacks and accidental mal-formed requests/datacause a local "pollution“ of training data

An attack can pass as normal traffic if it is part of thetraining set

We seek to remove both malicious and abnormaldata from the training dataset

Related ML algorithms:Ensemble methods [Dietterich00]MetaCost [Domingos99]Meta-learning [Stolfo00]

!

"

!

!

"

"

"

Training Strategies – Uniform Time

Divide data into multiple blocks!

micro-datasets with the same time granularity"

……..

Training Strategies – Multiple Models

Divide data into multiple blocks

Build micro-models for each block

!

!

M1 M2 MKµM1 µM2 µMK

Attacks and non-regular data cause localized "pollution“"

……..

Training Strategies – Voting Models


Build micro-models for each blockTest all models against a smaller dataset

!

!

!

M1 M2 MK

Votingalgorithm

µM1 µM2 µMK

Votingalgorithm

Simple voting:

Weighted voting: wi = number of packets

used for training µMi

"

"……..

Training Strategies - Sanitization


Build micro-models for each blockTest all models against a smaller dataset

Build sanitized and abnormal models

!

!

!

!

M1 M2 MK

Votingalgorithm

Abnormal

model

SanitizedmodelTraining phase

µM1 µM2 µMK

Votingalgorithm

Abnormalmodel


sanitized model:

abnormal model:

V = voting threshold

"

"

"

……..

Shadow Sensor RedirectionShadow sensor

Heavily instrumented host based anomaly detector akinto an “oracle”

Performs substantially slower than the nativeapplication

Use the shadow sensor to classify or corroborate thealerts produced by the AD sensors

!

"

"

!

Sanitizedmodel

Alert?

False falsepositive

AlertTesting phase

Shadowserver

Sanitizedmodel

Alert?

Falsepositive

AlertTesting phase

Host basedIDS

Feasibility and scalabilitydepend on the number ofalerts generated by the ADsensor

"

Overall Architecture

For each host, use a large setof training data:

Divide data intomultiple blocks

Build micro-modelsfor each block

Test all models against asmaller dataset

Sanitize data based onprevious step and buildthe sanitized model

Build an abnormal modelas well

!

!

!

"

"

M1 M2 MK

Votingalgorithm

Abnormal

model


Alert?

False falsepositive

AlertTesting phase

Shadowserver

µM1 µM2 µMK

Votingalgorithm

Maliciousmodel


Alert?

Falsepositive

AlertTesting phase

Host basedIDS

Micro-models

Partition a large training dataset into a number ofsmaller, time delimited training sets => micro-datasets

where each mdi has a time granularity, g

AD can be any chosen anomaly detection algorithm T is the training datasetM denotes the normal model produced by AD

Attacks and non-regular data cause a localized orlimited "pollution“ of training data

!

"

"

"

"

Voting algorithms

Using a second dataset and testing it against !Mi

Lj,i = 0 if !Mi deems the packet Pj as normal

Lj,i = 1 otherwise

The generalized label for packet Pj

where wi is the weight assigned to !Mi

Simple voting:

Weighted voting: = proportion of all packetsused for training µMi

!

!

!

!

!

!

Sanitized and Abnormal Models

Sanitized model

Abnormal model

V = voting threshold

!

!

!

Evaluation

Proof of concept using two content-based anomalydetectors:

Anagramsemi-supervised learning (when using Snort)supervised learning (without Snort)analyzing n-gram

Paylunsupervised learninganalyzing byte(1-gram) frequency distributions

!

"

"

"

!

"

"

Evaluation dataset

300/100/100 hours of real network traffic!

Voting Techniques Comparison

a) Simple voting b) Weighted votingPerformance of Anagram sensor after sanitization for www1

Datasets Comparison

Performance for www and lists for 3-hour granularity when usingAnagram

AD sensors comparison

Sensor FP)(%) FA TP)(%) TA

Anagram 0.07 544 0 0

Anagram)with)Snort 0.04 312 20.20 20

Anagram)withsanitization

0.10 776 100 99

Payl 0.84 6,558 0 0

Payl)with)sanitization 6.64 70,392 76.76 76

Signal-to-noise ratio comparison

Sensor www1 www lists

Anagram 0 0 0

Anagram with Snort 505 59.10 370.2

Anagram with sanitization 1000 294.11 1000

Payl 0 6.64 1.00

Payl with sanitization 11.56 5.84 36.05

signal-to-noise ratio TP/FP: higher values mean better results

Granularity Impact

Granularity impact on the performance of the system when usingAnagram and Payl

Training Dataset Size Impact

Impact of the size of the training dataset for www1

AD’s Internal Threshold Impact

Impact of the anomaly detector’s internal threshold forwww1 when using Anagram

Analysis of g and V

a) Simple voting b) Weighted votingPerformance of Anagram sensor after sanitization

Shadow Sensor Performance Evaluation

Overall computational requirements of an AD sensor anda host based sensor (e.g. STEM and DYBOC)

l is the standard latency of a protected serviceOs is the shadow server overhead

FP is the false positive rate

!

!

!

!

Sensor STEM DYBOC

N/A 44*l 1.2*l

Anagram 1.031*l 1.0001*l

Anagram=with=Snort 1.0172*l 1.0000*l

Anagram=with=sanitization 1.0430*l 1.0002*l

Payl 1.3612*l 1.0016*l

Payl=with=sanitization 3.8552*l 1.0132*l

Caveat Emptor & Limitations

The presence of a long-lasting attack in thedataset used for computing the micro-models

Poisoning all the micro-models

!

!

AD Distributed Cross-Sanitization

Use external knowledge (models) to generate a betterlocal normal model

Abnormal models are exchanged across collaborative sites[Stolfo00]

re-evaluate the locally computed sanitized models

!

!

!

Apply model differencing Remove remote

abnormal data from thelocal normal model

!

!

Cross-sanitization

Direct model differencingAnalytic method, difference of the models

Indirect model differencingNo analytic method, use testing

!

"

!

"

direct

indirect

Local sanitizedmodel

Remote abnormalmodel

Cross-sanitization: Evaluation

Model www1 www lists

FP (%) TP (%) FP (%) TP (%) FP (%) TP (%)

Mpois 0.10 44.94 0.27 51.78 0.25 47.53

Mcross(direct) 0.24 100 0.71 100 0.48 100

Mcross(indirect) 0.10 100 0.26 100 0.10 100

Indirect model differencing is more expensive than thedirect model differencing

!

Method www1 www lists

direct 13.989s 26.359s 16.849s

indirect 1966.689s 1732.329s 685.819s

Future work

adversarial scenarios: new techniques to resist training attacks

distributed sanitization: a distributed architecture toshare models and remove training attacks

model updates: updating AD models to accommodateconcept drift

!

!

!

Conclusions

A novel sanitization method that boosts theperformance of out-of-the-box anomaly detectors

Simple and general method, without significantadditional computational cost

An efficient and accurate online packetclassifier; both in real time and in post-processingforensic analysis

!

!

!

Thank you!

Questions?

casting out demons: sanitizing training data for anomaly sensors

Documents