a geometric framework for unsupervised anomaly detection

A Geometric Framework for

Unsupervised Anomaly Detection

E Eskin, A. Arnold, M. Prera,u, L. Portnoy, and S. J. Stolfo

In Applications of Data Mining in Computer Security,

pages 78-100. Kluwer Academics, 2002.

Overview

• Introduction

• Geometric Framework for Unsupervised Anomaly

Detection

• Feature Spaces and Kernel Functions

• Anomaly Detection Algorithms

• Experiments

• Conclusions

Introduction

• Intrusion detection systems(IDS)

Detects malicious activities such as DoS attacks, port scans, etc. by monitoring network traffic and/or system activities

• Most deployed systems use signature-based detection

Compare features to attack signatures provided by experts

Can only detect previously known intrusions

• Supervised machine learning approaches

Misuse detection: train a classifier using normal data and known attacks

Supervised anomaly detection: build model over the normal data and then check how well new data fits into that model

Can retrain models on new types of attacks

Introduction

• Unsupervised machine learning approaches

Take as input unlabeled data. In practice, we do not have either

labeled or purely normal data available.

Attempt to separate anomalous instances from the normal ones.

Then train misuse or traditional anomaly detection over the clean

data

• Unsupervised Anomaly Detection

Process unlabeled data

Two main assumptions:

1) The number of normal instances vastly outnumbers the number

of anomalies

2) The anomalies are qualitatively different from the normal

instances

Geometric framework for unsupervised

anomaly detection

• Map the data to a 𝑅^𝑑 feature space or kernels

• The choice of the feature map is application dependent

• Label points in the sparse regions of the feature space as anomalies

• How to label points in the sparse regions depends on algorithms

• Critical to the practical use of these algorithms is their efficacy since the datasets are typically large

Feature Space

• Instances 𝑥1, 𝑥2, …, 𝑥𝑛 in input space 𝑋

𝑋 can be the space of all possible network connection

records, event logs, system call traces, etc.

• Define a mapping

where 𝑌 is 𝑅𝑑 , or more generally a Hilbert space

• Define the distance between 𝑥1 and 𝑥2:

Feature Space

• Define a kernel function as the dot product between

the images of two elements in the feature space.

• The kernel function can be computed efficiently

without explicitly mapping the elements from the

input space to their images.

• Some kernel functions correspond to an infinite

dimension mapping. For example (radial basis

kernel):

Anomaly Detection Algorithms

Algorithm 1: Cluster-based Estimation

• The basic idea is to compute how many points are

“near” each point in the feature space, i.e., 𝑑(𝑥1, 𝑥2) ≤𝑤

• The straightforward computation complexity is 𝑂 𝑛2 .

• Fixed width clustering algorithm( as approximation)


First point is the center of the first cluster

For each subsequent point, if it is within w of some

cluster’s center, it is added to that cluster. Otherwise, it

is a center of a new cluster

Some points may be added to multiple clusters

Complexity is 𝑂 𝑐𝑛 where c is the number of clusters

• For points in dense regions, the estimate is inaccurate,

because there is a lot of overlap between clusters. For

points in sparse regions, the estimate is more

accurate.


Algorithm 2: K-nearest neighbor

• The basic idea is to compute the sum of the distance

to the k-nearest neighbors of each point in the feature

space (referred as k-NN score)

• K needs to exceed the number of instances of any one

attack to make the algorithm useful

• Straightforward computation complexity is 𝑂(𝑛2).


• Approximation algorithm

First cluster the data using the fixed-width clustering

algorithm of algorithm 1 with a variation where we place

each element into only one cluster.

Let 𝑐 𝑥 denote the center of the cluster that contains 𝑥

If 𝑥1 and 𝑥2 are in the same cluster:

In all case:


• The algorithm uses these three inequalities to find the k-nearest neighbors of a point x.

• The algorithm first computes the distance from x to each cluster. For the cluster with center closest to x, add all of its points to P ( a set of points which are potentially among the k-nearest neighbor). For each point 𝑥𝑖 in P, it computes 𝑑 𝑥, 𝑥𝑖 < 𝑑𝑚𝑖𝑛 with 𝑑𝑚𝑖𝑛=min𝑐∈𝐶 𝑑 𝑥, 𝑐 − 𝑤. This guarantees that 𝑥𝑖 is closer to x than all of the points in the clusters. It removes 𝑥𝑖 from P to K. Once K has k elements, it terminates.

Anomaly Detection Algorithms Algorithm 3: One class SVM (Support Vector Machine)

Standard SVM is a supervised learning algorithm. It tries to

maximally separate two classes of data in feature space by a

hyperplane

Unsupervised (one class) SVM tries to separate the entire set

of training data from the origin

First map the feature space into a second feature space with a

radial basis kernel.

This is done by solving a quadratic program that penalizes

points not separated from the origin while simultaneously

maximize the distance of this hyperplane from the origin.

Points that it separates from the origin are classified as normal,

and those which it doesn’t classified as anomalous.


Objective function:

𝑣: parameter that controls tradeoff between maximizing the

distance from origin and containing most of the data

𝑙: the number of data points

𝑤: the hyperplane’s normal vector in the feature space

𝜌: the origin

𝜍: slack variables

Experiments • Two datasets:

KDD Cup 1999

o Wide variety of simulated intrusions in a military network environment

o 4,900,000 instances

o The features are extracted from connection records. Examples are

duration, protocol type, number of bytes transferred, normal or error

status of the connection

o Four categories of attacks (total 24 attack types):

1) Denial of Service (e.g. syn flood)

2) Unauthorized access from a remote machine (e.g. password

guessing)

3) Unauthorized access to superuser functions (e.g. buffer overflow

attacks)

4) Probing (e.g. port scanning)

o Filter out attacks so that dataset consists of 1-1.5% attacks

Experiments

System call data from BSM (Basic Security Module) portion of

1999 DARPA Intrusion Detection Evaluation data

o Consists of 5 weeks of BSM data of all processes on a Solaris

machine

o Consider two programs: eject and ps

o An attack can correspond to multiple processes because a

malicious process can spawn other processes. Consider an attack

detected if one of the processes corresponding to the attack is

detected

Experiments

• For network connection data, use a data-dependent normalization

feature map. It normalizes the relative distances between feature

values in the feature space

Numerical attributes: normalize to the number of standard

deviations away from the mean

Discrete attributes: the coordinate corresponding to the

attribute value has a positive value 1

Σ𝑖 and the remaining

coordinates corresponding to the feature have a value of 0. The

distance between two vectors is weighted by the size of the

range of values of the discrete attributes.

• For sequences of system calls data, use a spectrum kernel

It maps short sub-sequences of the string into the feature space

Experiments

• The datasets are divided into two parts: one for training and

one for testing

• Parameters:

Experiments

• Result for KDD cup data (network connection data)

Experiments

• Detection rate: the number of intrusion instances detected

by the system divided by the total number of intrusion

instances present in the test set.

• False positive rate: the total number of normal instances

that were (incorrectly) classified as intrusions divided by

the total number of normal instances.

• There were some types of attacks that the system was able

to detect well and other types of attacks that it was not

able to detect. This is because some of the attacks using

the feature space were in the same region as normal data.

Experiments

• For system call traces, all three algorithms perform

perfectly (100% accuracy)!

• There a many sub-sequences that occur in malicious

processes that do not occur in normal processes. This

explains why in our feature space, the intrusion processes

are significantly far away from the normal processes.

• Also, the normal processes clumped together in the

feature space. This is why the intrusion processes are

easily detected as outliers.

Conclusions

• Each of these algorithms can be applied to any feature

space which allows us to apply the same algorithms

to model different kinds of data.

• In both cases, the algorithms were able to detect

intrusions over unlabeled data. They can be applied to

raw collected system data and do not need to be

manually labeled.

• Future work involves defining more feature maps

over different kinds of data and performing more

extensive experiments.

a geometric framework for unsupervised anomaly detection

Documents