a geometric framework for unsupervised anomaly detection
TRANSCRIPT
A Geometric Framework for
Unsupervised Anomaly Detection
E Eskin, A. Arnold, M. Prera,u, L. Portnoy, and S. J. Stolfo
In Applications of Data Mining in Computer Security,
pages 78-100. Kluwer Academics, 2002.
Overview
• Introduction
• Geometric Framework for Unsupervised Anomaly
Detection
• Feature Spaces and Kernel Functions
• Anomaly Detection Algorithms
• Experiments
• Conclusions
Introduction
• Intrusion detection systems(IDS)
Detects malicious activities such as DoS attacks, port scans, etc. by monitoring network traffic and/or system activities
• Most deployed systems use signature-based detection
Compare features to attack signatures provided by experts
Can only detect previously known intrusions
• Supervised machine learning approaches
Misuse detection: train a classifier using normal data and known attacks
Supervised anomaly detection: build model over the normal data and then check how well new data fits into that model
Can retrain models on new types of attacks
Introduction
• Unsupervised machine learning approaches
Take as input unlabeled data. In practice, we do not have either
labeled or purely normal data available.
Attempt to separate anomalous instances from the normal ones.
Then train misuse or traditional anomaly detection over the clean
data
• Unsupervised Anomaly Detection
Process unlabeled data
Two main assumptions:
1) The number of normal instances vastly outnumbers the number
of anomalies
2) The anomalies are qualitatively different from the normal
instances
Geometric framework for unsupervised
anomaly detection
• Map the data to a 𝑅^𝑑 feature space or kernels
• The choice of the feature map is application dependent
• Label points in the sparse regions of the feature space as anomalies
• How to label points in the sparse regions depends on algorithms
• Critical to the practical use of these algorithms is their efficacy since the datasets are typically large
Feature Space
• Instances 𝑥1, 𝑥2, …, 𝑥𝑛 in input space 𝑋
𝑋 can be the space of all possible network connection
records, event logs, system call traces, etc.
• Define a mapping
where 𝑌 is 𝑅𝑑 , or more generally a Hilbert space
• Define the distance between 𝑥1 and 𝑥2:
Feature Space
• Define a kernel function as the dot product between
the images of two elements in the feature space.
• The kernel function can be computed efficiently
without explicitly mapping the elements from the
input space to their images.
• Some kernel functions correspond to an infinite
dimension mapping. For example (radial basis
kernel):
Anomaly Detection Algorithms
Algorithm 1: Cluster-based Estimation
• The basic idea is to compute how many points are
“near” each point in the feature space, i.e., 𝑑(𝑥1, 𝑥2) ≤𝑤
• The straightforward computation complexity is 𝑂 𝑛2 .
• Fixed width clustering algorithm( as approximation)
Anomaly Detection Algorithms
First point is the center of the first cluster
For each subsequent point, if it is within w of some
cluster’s center, it is added to that cluster. Otherwise, it
is a center of a new cluster
Some points may be added to multiple clusters
Complexity is 𝑂 𝑐𝑛 where c is the number of clusters
• For points in dense regions, the estimate is inaccurate,
because there is a lot of overlap between clusters. For
points in sparse regions, the estimate is more
accurate.
Anomaly Detection Algorithms
Algorithm 2: K-nearest neighbor
• The basic idea is to compute the sum of the distance
to the k-nearest neighbors of each point in the feature
space (referred as k-NN score)
• K needs to exceed the number of instances of any one
attack to make the algorithm useful
• Straightforward computation complexity is 𝑂(𝑛2).
Anomaly Detection Algorithms
• Approximation algorithm
First cluster the data using the fixed-width clustering
algorithm of algorithm 1 with a variation where we place
each element into only one cluster.
Let 𝑐 𝑥 denote the center of the cluster that contains 𝑥
If 𝑥1 and 𝑥2 are in the same cluster:
In all case:
Anomaly Detection Algorithms
• The algorithm uses these three inequalities to find the k-nearest neighbors of a point x.
• The algorithm first computes the distance from x to each cluster. For the cluster with center closest to x, add all of its points to P ( a set of points which are potentially among the k-nearest neighbor). For each point 𝑥𝑖 in P, it computes 𝑑 𝑥, 𝑥𝑖 < 𝑑𝑚𝑖𝑛 with 𝑑𝑚𝑖𝑛=min𝑐∈𝐶 𝑑 𝑥, 𝑐 − 𝑤. This guarantees that 𝑥𝑖 is closer to x than all of the points in the clusters. It removes 𝑥𝑖 from P to K. Once K has k elements, it terminates.
Anomaly Detection Algorithms Algorithm 3: One class SVM (Support Vector Machine)
Standard SVM is a supervised learning algorithm. It tries to
maximally separate two classes of data in feature space by a
hyperplane
Unsupervised (one class) SVM tries to separate the entire set
of training data from the origin
First map the feature space into a second feature space with a
radial basis kernel.
This is done by solving a quadratic program that penalizes
points not separated from the origin while simultaneously
maximize the distance of this hyperplane from the origin.
Points that it separates from the origin are classified as normal,
and those which it doesn’t classified as anomalous.
Anomaly Detection Algorithms
Objective function:
𝑣: parameter that controls tradeoff between maximizing the
distance from origin and containing most of the data
𝑙: the number of data points
𝑤: the hyperplane’s normal vector in the feature space
𝜌: the origin
𝜍: slack variables
Experiments • Two datasets:
KDD Cup 1999
o Wide variety of simulated intrusions in a military network environment
o 4,900,000 instances
o The features are extracted from connection records. Examples are
duration, protocol type, number of bytes transferred, normal or error
status of the connection
o Four categories of attacks (total 24 attack types):
1) Denial of Service (e.g. syn flood)
2) Unauthorized access from a remote machine (e.g. password
guessing)
3) Unauthorized access to superuser functions (e.g. buffer overflow
attacks)
4) Probing (e.g. port scanning)
o Filter out attacks so that dataset consists of 1-1.5% attacks
Experiments
System call data from BSM (Basic Security Module) portion of
1999 DARPA Intrusion Detection Evaluation data
o Consists of 5 weeks of BSM data of all processes on a Solaris
machine
o Consider two programs: eject and ps
o An attack can correspond to multiple processes because a
malicious process can spawn other processes. Consider an attack
detected if one of the processes corresponding to the attack is
detected
Experiments
• For network connection data, use a data-dependent normalization
feature map. It normalizes the relative distances between feature
values in the feature space
Numerical attributes: normalize to the number of standard
deviations away from the mean
Discrete attributes: the coordinate corresponding to the
attribute value has a positive value 1
Σ𝑖 and the remaining
coordinates corresponding to the feature have a value of 0. The
distance between two vectors is weighted by the size of the
range of values of the discrete attributes.
• For sequences of system calls data, use a spectrum kernel
It maps short sub-sequences of the string into the feature space
Experiments
• The datasets are divided into two parts: one for training and
one for testing
• Parameters:
Experiments
• Detection rate: the number of intrusion instances detected
by the system divided by the total number of intrusion
instances present in the test set.
• False positive rate: the total number of normal instances
that were (incorrectly) classified as intrusions divided by
the total number of normal instances.
• There were some types of attacks that the system was able
to detect well and other types of attacks that it was not
able to detect. This is because some of the attacks using
the feature space were in the same region as normal data.
Experiments
• For system call traces, all three algorithms perform
perfectly (100% accuracy)!
• There a many sub-sequences that occur in malicious
processes that do not occur in normal processes. This
explains why in our feature space, the intrusion processes
are significantly far away from the normal processes.
• Also, the normal processes clumped together in the
feature space. This is why the intrusion processes are
easily detected as outliers.
Conclusions
• Each of these algorithms can be applied to any feature
space which allows us to apply the same algorithms
to model different kinds of data.
• In both cases, the algorithms were able to detect
intrusions over unlabeled data. They can be applied to
raw collected system data and do not need to be
manually labeled.
• Future work involves defining more feature maps
over different kinds of data and performing more
extensive experiments.