learning from sequential data for anomaly detection349793/fulltext.pdflearning from sequential data...
Post on 28-Dec-2019
22 Views
Preview:
TRANSCRIPT
LEARNING FROM SEQUENTIAL DATA FOR ANOMALY
DETECTION
A Dissertation Presented
by
Esra Nergis Yolacan
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in
Electrical and Computer Engineering
in the field of
Computer Engineering
Northeastern University
Boston, Massachusetts
October 2014
c© Copyright 2015 by Esra Nergis Yolacan
All Rights Reserved
ii
Abstract
Anomaly detection has been used in a wide range of real world problems and has
received significant attention in a number of research fields over the last decades.
Anomaly detection attempts to identify events, activities, or observations which are
measurably different than an expected behavior or pattern present in a dataset. This
thesis focuses on a specific set of techniques targeting the detection of anomalous
behavior in a discrete, symbolic, and sequential dataset. Since profiling complex
sequential data is still an open problem in anomaly detection, and given that the rate
of production of sequential data in fields ranging from finance to homeland security
is exploding, there is a pressing need to develop effective detection algorithms that
can handle patterns in sequential information flows.
In this thesis, we address context-aware multi-class anomaly detection as applied
to discrete sequences and develop a context learning approach using an unsupervised
learning paradigm. We begin the anomaly detection process by applying our approach
to differentiate normal behavior classes (contexts) before attempting to model normal
behavior. This approach leads to stronger learning on each class by taking advantage
of the power of advanced models to identify normal behavior of the sequence classes.
We evaluate our discrete sequence-based anomaly detection framework using two
illustrative applications: 1) System call intrusion detection and 2) Crowd anomaly
iii
detection. We also evaluate how clustering can guide our context-aware methodology
to positively impact the anomaly detection rate.
In this thesis, we utilize a Hidden Markov Model (HMM) to perform anomaly
detection. A HMM is the simplest dynamic Bayesian network. A HMM is a Markov
model which can be used when the states are not observable, but observed data is
dependent on these hidden states. While there has been a large amount of prior
work utilizing Hidden Markov Models (HMMs) for anomaly detection, the proposed
models became overly complex when attempting to improve the detection rate, while
reducing the false detection rate.
We apply HMMs to perform anomaly detection on discrete sequential data. We
utilize multiple HMMs, one for each context class. We demonstrate our multi-HMM
approach to system call anomalies in cyber security and provide results in the presence
of anomalies. Applying process trace analysis with multi-HMMs, system call anomaly
detection achieves better results using better tuned model settings and a less complex
structure to detect anomalies.
To evaluate the extensibility of our approach, we consider a second application,
crowd behavior analytics. We attempt to classify crowd behavior and treat this as an
anomaly detection problem on sequential data. We convert crowd video data into a
discrete/symbolic sequence of data. We apply computer vision techniques to generate
features from objects, and use these features for frame-based representations to model
the behavior of the crowd in a video stream. We attempt to identify anomalous
behavior of a crowd in a scene by applying machine learning techniques to understand
what it means for a video stream to be identified as “normal”. The results of applying
our context-aware multi-HMMs approach to crowd analytics show the generality of
our anomaly detection approach, and the power of our context-learning approach.
iv
Acknowledgements
In the name of God, the Most Gracious, the Most Merciful.
.
I dedicate this thesis to my beloved husband, Riza, from the depths of my heart
and soul. You have supported me throughout everything, I could not have accom-
plished this without you. Thank you for your remarkable patience and unwavering
love during this doctoral journey. To my loving parents, Nermin and Feridun. You
both have successfully made me the person I am becoming by instilling the impor-
tance of hard work and higher education. Thank you for being my inspiration and
a wonderful role model. To my precious brother, Emre. You have always been there
cheering me up and stood by me through the good times and bad. Thank you for
never ending motivations and always believing in me. I am grateful to all four of you
for your presence and I love you more than you will ever know. Thanks for all of your
endless love, support and encouragement.
I would like to express my gratitude to my advisor, Prof. Dr. David R. Kaeli,
for his guidance, understanding, and patience during five years of my dissertation.
Thank you for being so supportive by giving advices, providing persistent help and
encouraging me in order to complete this task. I would also like to thank my com-
mittee members, Prof. Dr. Jennifer G. Dy, and Dr. Fatemeh Azmandian for their
v
precious time and guidance throughout this dissertation. Your thoughtful comments
were invaluable. I thank all my dear colleagues at NUCAR group for valuable dis-
cussions, suggestions, and the most importantly, their friendship during my studies
at Northeastern University. I would like to thank Ayse Yilmazer, for being with me
and sharing her experiences during my research. Additionally, I would like to thank
our graduate coordinator Faith Crisley for her advice and assistance during the years
of my study.
vi
Contents
Abstract ii
Acknowledgements iv
1 Introduction 1
1.1 Anomaly Detection on Sequential Data . . . . . . . . . . . . . . . . . 2
1.2 Challenges of Working with Sequential Data . . . . . . . . . . . . . . 9
1.3 Contributions of the Work . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . 16
2 Background 18
2.1 Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 Detection Method . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Response Type . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Crowd Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 People Counting/Density Estimation . . . . . . . . . . . . . . 24
2.2.2 People Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Behavior Learning . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
2.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Anomaly Detection Algorithms . . . . . . . . . . . . . . . . . 29
2.3.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Related Work 40
3.1 Related Work in System Call Analysis . . . . . . . . . . . . . . . . . 40
3.1.1 Data Representation in System Call Analysis . . . . . . . . . . 41
3.1.2 HMM in System Call Analysis . . . . . . . . . . . . . . . . . . 43
3.2 Related Work in Crowd Analysis . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Data Representation in Crowd Analysis . . . . . . . . . . . . . 46
3.2.2 HMM in Crowd Analysis . . . . . . . . . . . . . . . . . . . . . 49
3.3 Related Work in Context-aware Systems . . . . . . . . . . . . . . . . 51
3.3.1 Context-aware Applications . . . . . . . . . . . . . . . . . . . 52
3.3.2 Context Inference . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Context Learning 56
4.1 Context in a Symbolic Sequential Data . . . . . . . . . . . . . . . . . 56
4.2 Clustering for Context Learning . . . . . . . . . . . . . . . . . . . . . 57
4.3 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Required Length (Number of Symbols) . . . . . . . . . . . . . 63
4.4 Summary of Context Learning . . . . . . . . . . . . . . . . . . . . . . 68
5 System Call Anomaly Detection 70
5.1 System Call Trace Dataset . . . . . . . . . . . . . . . . . . . . . . . . 72
viii
5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.2 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.3 Clustering for Context Learning . . . . . . . . . . . . . . . . . 75
5.2.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Behavior Learning -Training . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Test and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Summary of System Call Trace Analysis . . . . . . . . . . . . . . . . 85
6 Crowd Anomaly Detection 88
6.1 Event Recognition Video Dataset . . . . . . . . . . . . . . . . . . . . 89
6.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.1 Feature Extraction from a Video . . . . . . . . . . . . . . . . 93
6.2.2 Clustering for Context Learning . . . . . . . . . . . . . . . . . 98
6.2.3 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Behavior Learning and Anomaly Detection . . . . . . . . . . . . . . . 103
6.4 Testing and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.5 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.6 Summary of Crowd Analysis . . . . . . . . . . . . . . . . . . . . . . . 110
7 Summary and Conclusion 114
7.1 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Bibliography 120
ix
List of Figures
1.1 A general scheme of anomaly detection. . . . . . . . . . . . . . . . . . 3
1.2 Example of four different symbolic discrete sequences. . . . . . . . . . 6
1.3 An example of transitional probability-based features - two-state Markov
Chains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 K-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Design of our system call anomaly detection framework. . . . . . . . . 71
5.2 BIC values for various number of Hidden States . . . . . . . . . . . . 80
5.3 ROC curve for Set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 ROC curve for Set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 ROC curve for Set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6 ROC curve for one HMM . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.7 Time-series plot for process traces. . . . . . . . . . . . . . . . . . . . 85
5.8 Process-based evaluation results. . . . . . . . . . . . . . . . . . . . . . 86
6.1 Design of our crowd anomaly detection framework. . . . . . . . . . . 89
6.2 Dataset-S3 Events, frames 50 and 150 (left-to-right) [52]. . . . . . . . 91
x
6.3 Feature extraction from a video sequence. . . . . . . . . . . . . . . . 94
6.4 Top-down view of the surveillance area and camera position. . . . . . 95
6.5 Projective transformation (Homography) result of an original image
frame from PETS09 event recognition dataset. . . . . . . . . . . . . . 96
6.6 ROC curve for the velocity-based symbolic sequences, training and
testing with only a single model. . . . . . . . . . . . . . . . . . . . . . 106
6.7 ROC curves for velocity-based symbolic sequences, training and testing
on a per-context basis. . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.8 The ROC curve for direction-based symbolic sequences, training and
testing using only a single model. . . . . . . . . . . . . . . . . . . . . 108
6.9 ROC curves for direction-based symbolic sequences, training and test-
ing on a per-context basis. . . . . . . . . . . . . . . . . . . . . . . . . 109
6.10 ROC curve for distance based symbolic sequences, training and testing
with only a single model. . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.11 ROC curves for distance-based symbolic sequences, training and test-
ing on a per-context basis. . . . . . . . . . . . . . . . . . . . . . . . . 111
6.12 High-level design of crowd anomaly detection framework. . . . . . . . 112
xi
List of Tables
1.1 An example of frequency vector based features . . . . . . . . . . . . . 7
1.2 An example of distance based features. . . . . . . . . . . . . . . . . . 7
1.3 An example of user command sequences . . . . . . . . . . . . . . . . 11
1.4 Unique subsequences: (normal dataset) . . . . . . . . . . . . . . . . . 12
1.5 Subsequences extracted from test sequence . . . . . . . . . . . . . . 12
1.6 Clustered sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Unique subsequences from the clustered sequences: (normal dataset). 13
2.1 Categorization of Intrusion Detection Systems . . . . . . . . . . . . . 19
2.2 Video components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 A sample of the UNM program trace. . . . . . . . . . . . . . . . . . . 72
5.2 A sample of extracted process traces. . . . . . . . . . . . . . . . . . . 72
5.3 System call sequence for PID:552. . . . . . . . . . . . . . . . . . . . . 73
5.4 The UNM sendmail trace dataset. . . . . . . . . . . . . . . . . . . . . 74
5.5 Clustering results of UNM sendmail processes . . . . . . . . . . . . . 76
6.1 Video sequences in the PETS09 Dataset-S3:High Level. . . . . . . . . 92
6.2 Time intervals of the events in the PETS09 Dataset-S3:High Level. . 92
xii
6.3 Non-overlapping sequence partitions generated from the PETS09 Dataset-
S3: High Level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Normal sequence partitions for velocity-based events. . . . . . . . . . 102
xiii
Chapter 1
Introduction
The goal of anomaly detection is to identify anomalous behavior, events or items based
on deviations from expected normal cases. Anomaly detection is a research area that
has been studied extensively for a range of application domains, such as computer
and network monitoring for intrusion detection, video and image processing for crowd
analytics, activity monitoring for fraud detection, DNA analysis for mutation and
disease detection, bio-surveillance for disease outbreak detection, and sensor data
analysis for fault diagnosis. We can select a particular method to address these
problems, though the same method can be applied to other domains given a similar
data representation, problem formulation and nature of the anomalies. Particular
anomaly detection processes include outlier detection, novelty detection, deviation
detection and exception mining. The processes differ based on the application domain
and the employed detection approaches [65]. In our work, we have used the term
anomaly detection to describe the process of differentiating abnormal behavior from
normal behavior in a problem-relevant dataset.
1
1.1 Anomaly Detection on Sequential Data
Sequential data is a valuable source of information which is available in many aspect
of our lives, including weather prediction [126, 102], frequency analysis for phonetic
speaker recognition [22], unusual human action detection in a video [80], pattern
discovery for product placement in supermarkets [6, 56], and detection of mutations
in a gene sequence [79, 91].
Sequences can be discrete or continuous in terms of the value they take for each
uniform time interval. A continuous sequence, also known as time series, is a sequence
of data points which are obtained by measuring a variable at discrete time points [29].
A data point in a continuous sequence may take on any value within a certain range
for the measured variable, such as the height of a tree year after year or the daily air
temperature of a city. To perform a learning task in continuous sequences, researchers
may prefer to apply dimensionality reduction and discretization methods to obtain a
discrete representation [77, 90, 56].
A discrete sequence is an ordered series of symbols which can be characters, num-
bers or words [147]. A data point in a discrete sequence may only take certain values
which is limited with an alphabet, such as a gene is a sequence of DNA nucleotides,
a program execution trace is the sequence of system calls. In symbolic sequences,
typically the value of a data point is not meaningful individually, but it provides a
valuable information when considered with the other symbols in the sequence.
In this thesis, we demonstrate our approach using discrete (symbolic) sequences to
perform the anomaly detection task in two different application domains: 1) intrusion
detection for cyber security and 2) crowd behavior detection for physical security. In
anomaly detection-based intrusion detection, we use a sequence of system calls as
instances in a discrete (symbolic) sequence to detect abnormal program behavior.
2
In crowd anomaly detection, we extract features frame-by-frame from a crowd video
dataset and use those features as a discrete sequence of symbols to detect crowd
behavior.
We present a general scheme of anomaly detection process in Figure 1.1. The
nature of the anomaly detection process requires a well-defined profile to learn the
normal behavior. To address this issue, machine learning [134], data mining [48]
and statistical methods [113] are some of the techniques used to profile the normal
behavior during anomaly detection. The extracted profile can be anything that can
distinguish normal and abnormal behaviors such as pattern sets, rule sets, probability
distributions, statistical models, etc..
A general definition for describing the anomaly detection task on a discrete se-
quence of symbols can be stated as:
Definition 1.1 Given a single sequence (S) as S = {s1, s2, s3..., sn}, (si is a symbol
from a finite alphabet∑
), where n is the number of symbols in S and i = 1, 2, ...n;
then, anomaly detection is the task of deciding whether S is normal or abnormal with
respect to learned normal behavior.
Anomaly detection techniques generally use a threshold value to raise an alarm
to decide whether to flag an anomaly. Typically there are two different approaches
Figure 1.1: A general scheme of anomaly detection.
3
for the evaluation of anomalies in a discrete sequence of data. The first approach is
based on assigning an anomaly score to the entire sequence. If the anomaly score
is higher than a predefined threshold, then the sequence is labeled as abnormal [64].
In this approach, normal sequences are expected to have a lower anomaly score than
the ones that include an anomaly. An anomaly detection technique in this category
needs to apply normalization, compensating for the sequence length to provide a
fair evaluation. Although this normalization approach eliminates the impact of False
Positives (FP) in a normal sequence, it may also eliminate the True Positives (TP)
in an abnormal sequence. For example, if an abnormal sequence is too long, the
anomaly score may never reach the threshold value. Therefore, the success of this
kind of evaluation depends on the density of abnormal events in the entire sequence
and it is difficult to determine where the anomaly starts in a sequence of data.
The second approach is based on performing an evaluation on regions of the se-
quence to compute an anomaly score [63]. In this approach, abnormal portions of
a sequence can be detected when the anomaly score of the evaluated region reaches
some predefined threshold. If desired, the anomaly score for the entire sequence can
be obtained by combining all anomaly scores assigned to the regions. Three of the
advantages of this approach (and which motivate our work) are listed as follow:
• First, since region-based analysis simplifies the data, it allows a wider array of
techniques to be applied.
• Second, since this approach works on only a small portion of the data, it enables
us to detect local anomalies which would be missed in the first approach.
• Third, since this analysis approach computes anomaly scores in regions, it per-
forms anomaly detection task without needing to examine the entire sequence.
4
Typically, an anomaly detection is a time-sensitive task, especially when it is
applied for security purposes. When an anomaly is detected, we then need to termi-
nate the anomaly, to eliminate or minimize its effects and investigate the reason that
caused the anomaly. Region-based evaluation that works with a partition of sequence
is well-suited for real-time anomaly detection applications, but detection still needs
to be implemented using efficient algorithms.
In this thesis, we present an approach that performs a region-based evaluation
by using subsequences generated from discrete sequences via a fixed-size windowing
technique. We use the term sequence to refer to entire symbol sequence in a sequence
data set and the term subsequence to refer to a shorter sequence which comprises
consecutive symbols.
The aim of our work is to provide a generic and effective anomaly detection
framework for discrete sequences:
• The technique should be generic because it must be applied to discrete sequence
data collected from various domains. We evaluate our discrete sequence-based
anomaly detection framework by using two illustrative application problems:
1) Intrusion detection and 2) Crowd anomaly detection.
• The technique should be effective because a sequence-based anomaly detection
system must learn normal sequences from a given sequence of the dataset in
order to be able to detect anomalies in the same manner as a human expert
would detect.
The effectiveness of an anomaly detection process relies on how well the model is
designed. Therefore, the main challenge in anomaly detection is extracting beneficial
information from the given sequences. To begin to pursue the anomaly detection
5
problem by analyzing the ordered sequence of symbols, we need to use methods
specifically designed for this purpose.
Figure 1.2: Example of four different symbolic discrete sequences.
To highlight the importance of the methods selected, we start with an illustrative
example. In Figure 1.2, we have four equal-length discrete sequences generated from
a length two alphabet∑
= {A,B}. Even though it seems obvious that each sequence
has different characteristics, it is essential to select an effective method to differen-
tiate between these sequences. In this example, we contrast utilizing vector-based
extraction, distance-based and relation-based features to differentiate the sequences.
For vector-based feature extraction, we count the number of occurrences of each sym-
bol in the sequence. For distance-based feature extraction, we use a basic similarity
measure to compute the distance between given sequences. For relation-based feature
extraction, we used the transition information between the symbols.
First, a frequency-based feature vector is calculated and presented for each se-
quence in Table 1.1. Although the order of symbols in the given sequences are differ-
ent from each other, the extracted feature vectors are identical, and thus we would
6
be unsuccessful to differentiate between these sequences.
Table 1.1: An example of frequency vector based features
Sequences A B
Sequence 1 8 8
Sequence 2 8 8
Sequence 3 8 8
Sequence 4 8 8
Second, we use Hamming Distance to compute the distances between a sequence
and the other sequences. Hamming distance is a simple distance measure that counts
the number of places where the symbols mismatch. In Table 1.2, each column shows
the distances between the sequence and the other sequences. Each sequence has
the same distance to each sequence in the example. While there are more sophisti-
cated and domain specific similarity/distance measures which can be used to evaluate
discrete sequences, they do not consider the transitional probabilities between the
symbols in a sequence. Instead of extracting behavioral information, they are more
appropriate to compute a similarity or a distance between two sequences.
Table 1.2: An example of distance based features.
Sequences Sequence 1 Sequence 2 Sequence 3 Sequence 4
Sequence 1 0 8 8 8
Sequence 2 8 0 8 8
Sequence 3 8 8 0 8
Sequence 4 8 8 8 0
7
Figure 1.3: An example of transitional probability-based features - two-state MarkovChains.
Third, we use the relationship between symbols. To extract behavioral information
from a sequence, we need to consider the ordering of symbols in a sequence. A
fundamental way to perform a behavior extraction process is calculating transitional
probabilities between the symbols in a sequence. To model the relation features
based on the transitions between symbols, we use Markov Chain. We use a first-order
Markov process, the simplest Markov model to represent the transitional probabilities
from one state (symbol) to another [20]. In Figure 1.3, we show the two-state Markov
chain used to compute the probability of next symbol, which only depends the current
symbol. Using the ordering information present in a symbolic sequence, we will be
8
able to differentiate between sequences and learn the system behavior.
A Hidden Markov Model (HMM), the simplest dynamic Bayesian network, is a
Markov model which can be used when states in a process are not observable, but
observed data is dependent on these hidden states. HMMs rely on two properties:
1) the observation at time t was generated by some process whose state Ht is hid-
den from the observer, and 2) the state of the hidden process satisfies the Markov
property. In this thesis, we applied HMMs to generate the normal behavior model of
subsequences for two different application domains: cyber-security intrusion detec-
tion and crowd anomaly detection. Since we collect only symbolic sequences to detect
anomalies, the states that generate these symbols are hidden in our case. A HMM
is well-suited to our problem under these circumstances, because the sequential data
we use in these domains can be considered as observation data which is used to learn
and detect actual underlying program and crowd behaviors.
1.2 Challenges of Working with Sequential Data
Next, we focus on the challenges of using discrete (symbolic) sequential data for
anomaly detection. Although there are many algorithms and approaches to use in
the anomaly detection task, they are often not directly applicable to working with
sequential data, since they only take input data as a vector of features. It is possible
to generate a feature vector for sequence data instances, but this is undesirable for
two reasons: First, when we consider a feature vector extraction by transforming
the data (e.g., extracting the frequency distribution of the symbols in a sequence),
the dimension of the vector is dependent on the alphabet size, and this can increase
the computation cost. Second, sequential data includes transition-based features for
9
the sequence. When we transform the data into a vector of features, it is no longer
possible to take advantage of the probabilistic structure of sequences. To extract the
behavior of a data sequence, we need to apply machine learning techniques without
loosing the transitional features in the symbol sequences.
Another challenge encountered when using sequential data can be seen when com-
puting a similarity or dissimilarity measure between two sequences. Although there
are various similarity/dissimilarity metrics in the literature, most of them are not
applicable directly to sequential data or cannot satisfy the mandatory condition of
providing a real distance metric [61]. Also, typically there is a need to perform pre-
processing to compute a similarity/dissimilarity measure, since the sequences in a
dataset almost always differ in length or they could be too long to perform such
measures.
From the dataset perspective, the labeling process is often challenging, even for
vector instances. When we consider sequential data, it is even more difficult to label
abnormal segments in the sequence, since the distinction between normal and abnor-
mal data are imprecise and could be interspersed throughout the whole sequence.
All of these issues make sequential data analysis a complex and challenging pro-
cess. In this thesis, we address an additional challenge when working with sequential
data, which affects our ability to detect discrete sequences. In previous work on
anomaly detection with sequential data, researchers typically assumed that there is
only a single normal behavior if the data is from a single data source [32]. But in
our analysis, we found that data sequences can include multiple normal behaviors.
For example, a sequence of computer user commands would be used to learn normal
behavior of a user and the learned behavior would be used later for anomaly detec-
tion [31]. The computer user would have different tasks to complete on different days,
10
therefore the distribution of user commands would change according to schedule of
the work that needs to be done on each successive day. In this case, learning a normal
behavior for each similar work day should lead to improved detection accuracy versus
learning only a single normal behavior for all days.
We present an example to illustrate this problem. Lets assume that we have some
normal discrete sequences generated from an alphabet of user commands ( Σ : A, B, C)
as displayed in Table 1.3. In order to generate a dictionary-based model (i.e., a set of
patterns) from these normal sequences as shown in Table 1.4, we extract subsequences
using a sliding window of length 3.
Table 1.3: An example of user command sequences
1. BACBABCB
2. ABACBABCBABACBABCBA
3. CBABABCBAC
4. ACBABACBABABABCBA
5. AABBAABBAABBAACCBB
6. BBCCAACCAABBAA
7. CCAABBAACCAABBAACC
...
n. ...
Since we extract normal subsequences, we can detect anomalies in a test sequence
by following the same preprocessing steps and then comparing the test subsequences
with the normal data dictionary. For a given test sequence: ‘CCBBAABBAAB-
BAACCBBACBAC’, the generated subsequences are presented in Table 1.5. If we
compare the subsequences with the dictionary, we will find that there are no abnormal
11
Table 1.4: Unique subsequences: (normal dataset)
ABA, BAC, ACB, CBA, BAB, ABC, BCB,
ACB, AAB, ABB, BBA, BAA, AAC, ACC
CCB, CBB, BBC, BCC, CCA, CAA
sequences since all of the subsequences appear in the normal dataset.
Table 1.5: Subsequences extracted from test sequence
CCB, CBB, BBA, BAA, AAB, ABB, BBA,
BAA, AAB, ABB, BBA, BAA, AAC, ACC,
CCB, CBB, BBA, BAC, ACB, CBA, BAC
If we examine the normal sequences in detail, we can see the differences between
them and the test sequence. On the other hand, we can see that the given test
sequence starts with repeated symbol sequences, but includes the other structure as
well. This kind of anomaly can be detected only if the sequences are clustered to
train a model for each cluster. The results of clustering and the dictionary for each
cluster are presented in Tables 1.6 and 1.7, respectively.
If we classify the test sequence to evaluate the subsequences with the dictionary
of the relevant cluster, it falls into the 2. Cluster class, since it is more similar than
the other cluster. Since the test sequence is classified into the second cluster we
can evaluate it by using the normal dictionary patterns from the 2. Cluster. The
subsequences BAC, ACB, CBA and BAC in the test sequence do not appear in the
2. Cluster’s normal pattern set, therefore we can detect those subsequences since they
are not expected in this model.
12
Table 1.6: Clustered sequences
1. Cluster 2. Cluster
BACBABCB AABBAABBAABBAACCBB
ABACBABCBABACBABCBA BBCCAACCAABBAA
CBABABCBAC CCAABBAACCAABBAACC
ACBABACBABABABCBA ...
...
Table 1.7: Unique subsequences from the clustered sequences: (normal dataset).
1. Cluster ABA, BAC, ACB, CBA, BAB, ABC, BCB, ACB
2. Cluster AAB, ABB ,BBA, BAA, AAC, ACC, CCB, CBB, BBC, BCC, CCA, CAA
We presented a simple example to explain the motivation of our context learn-
ing (clustering) approach. Although, there are only slight differences between the
sequences collected from real data sources, a dedicated clustering technique may be
used for categorization of sequences.
Note that the given problem is different than a typical context-based anomaly
detection in two ways: 1) defining a context is not straightforward since we work
with discrete sequences that only include the symbols, and there is no other context
attribute, and 2) a context-based anomaly detection on sequential data generally
depends on surprise detection with respect to the previous data in the same sequence.
On the other hand, our contextual anomaly detection approach takes into account
other similar sequences in the same category, instead of relying solely on the behavior
of the current sequence.
13
In this thesis, we address the context-based multi-class anomaly detection prob-
lem on discrete sequences. We accomplish this by applying a clustering approach to
differentiate normal behavior classes (contexts) before implementing an anomaly de-
tection task. This approach provides a better learning model for each class by taking
advantage of the power of HMMs to capture the normal behavior of the sequence
classes.
This thesis enhances the current state-of-the-art and makes key contributions to
the following areas:
• Machine learning/ Data mining approach: Devising a novel, effective and generic
context-aware anomaly detection framework for discrete sequential data by
adapting sophisticating machine learning algorithms.
• Application domains: Applying the presented anomaly detection approach to
address two different application domains, each of which is critical for security.
Next, we discuss the major contributions of this work and describe the organiza-
tion of the remainder of the thesis.
1.3 Contributions of the Work
The thesis consists of two parts: Part one focuses on the basic research issues arising
in sequential data anomaly detection and presents a generic framework for anomaly
detection on discrete sequences. Part two evaluates the proposed sequential anomaly
detection approach to build real applications in multiple domains: cyber-security and
physical security.
The contributions of the thesis are:
14
We address potential issues in behavioral learning where the data is sequential.
Most existing classification-based anomaly detection techniques are not appropriate
to apply directly to sequence-based anomaly detection because of many limitations. In
this thesis, we investigate the challenges of working with sequential data to understand
these limitations and to provide an anomaly detection method better suited for this
task.
We present a novel framework for context-based sequence anomaly detection, where
defining a context is not straightforward task. Typically, a context-aware analysis
should include a context attribute to differentiate contexts and to learn a behavior
for each context. For example, daily temperature of an area over last few years could
be analyzed in the context of seasons or months. In our case, discrete sequences do
not include any defining attributes to identify a context in a straightforward fashion.
We fully explore the idea of context learning for incomplete discrete sequences
where the data is in streaming form. Most existing sequence clustering techniques fail
to account for incomplete sequences, and are limited because of length and alignment
issues. We provide a fundamental methodology for context learning by clustering
sequences based on their similarities of the first l symbols by assuming the sequences
which are generated from a common distribution will start with similar symbol order-
ings. l is a parameter to select the subsequence length for the similarity computation
which would differ according to sequence domain. We should select an l long enough
to capture similarities between sequences and small enough to provide online cluster-
ing for context learning on streaming data.
We develop an anomaly detection-based intrusion detection method where the data
is symbolic sequences of system calls gathered from program execution traces. We
evaluated our context-based anomaly detection approach by experimenting on a well
15
known benchmark dataset taken from system call traces of Unix privileged programs.
This work has been published the proceedings of the 2014 Software Security and
Reliability Conference (SERE) [155].
We illustrate the generality of our approach by applying it the problem of crowd
anomaly detection. We review competing techniques and approaches for crowd anal-
ysis. Our proposed method learns crowd behavior from symbolic sequences in order
to detect abnormal crowd activities. To obtain a symbolic sequence from a video, we
represent global events occurring in a scene by using symbolic data for each video
frame.
1.4 Organization of Dissertation
In this chapter, we introduce the problem of anomaly detection on discrete/symbolic
sequential data. We discuss the challenges of working on sequential data and also
provide the motivation behind using HMMs for system call and crowd event anomaly
detection. The rest of the dissertation is organized as follows:
In Chapter 2, we provide background material to this thesis. This includes a
survey of intrusion detection and crowd analysis methods, as well as in-depth expla-
nation of the techniques that have been proposed in prior work. We also provide a
background of the anomaly detection algorithms and evaluation techniques used in
this dissertation.
In Chapter 3, we discuss the related research in system call analysis, crowd analysis
and context-aware systems.
In Chapter 4, we present an automated context learning approach. We detail the
design challenges and solutions.
16
In Chapter 5, we describe the presented framework for system call anomaly
detection-based intrusion detection. We evaluate the accuracy of our implementa-
tion using a commonly used system call dataset.
In Chapter 6, we describe the presented framework for crowd anomaly detection.
We evaluate the accuracy of our implementation using a public benchmark video
dataset.
In Chapter 7, we conclude the thesis and summarize our work. We also suggest
possible topics for future research.
17
Chapter 2
Background
In this chapter, we present background information on concepts used throughout this
thesis, including intrusion detection, crowd analysis, anomaly detection algorithms
and selected machine learning algorithms.
2.1 Intrusion Detection
Intrusion is defined as “any set of actions that attempt to compromise the integrity,
confidentiality, or availability of a resource” [60]. An intrusion detection is the process
of detecting these actions by monitoring and analyzing the systems [157]. Intrusion
detection system (IDS) is the hardware or software device developed to perform the
intrusion detection process [123].
As listed in Table 2.1, the IDSs are classified from several perspectives, such as
the time of analysis, the structure for the detector, the source of the data, the data
analysis method, or the response to the detection [38, 39, 11].
A categorization of an IDS based on a time of analysis can be online (real-time) or
offline depending on the data monitoring and evaluation methods. While online IDSs
18
provide a real time detection approach to analyze a real-time data feed, offline IDSs
evaluates the stored data periodically. Another dimension to differentiate IDSs is the
structure of IDS. Based on the data collection and preprocessing environment, the
IDS structure criteria categorizes the IDS into two folds: centralized and distributed.
Other categorization criteria given as data, analysis and response in Table 2.1 are
also defined as the functional components of IDSs in [12]. These components are
explained in next subsections.
Table 2.1: Categorization of Intrusion Detection Systems
Time Online OfflineStructure Centralized DistributedData Host NetworkAnalysis Misuse AnomalyResponse Active Passive
2.1.1 Data Source
IDSs can be classified into two main categories depending on data sources and target
system used: 1) host-based and 2) network-based IDSs.
Host-based intrusion detection system:
A host-based intrusion detection system (HIDS) involves monitoring the host system
activities to see if there is a known attack behavior or unusual behavior occurs. The
main role of host-based detection is protecting the host system by analyzing the
monitored data.
19
Network-based intrusion detection system:
A network-based intrusion detection system (NIDS) analyzes network related system
activities to find out the intrusions such as denial of service (DoS), port scan, remote
unauthorized access, etc.
2.1.2 Detection Method
In order to detect an intrusion, there are two different IDS modeling approaches:
1) misuse and 2) anomaly detection.
Misuse detection:
In misuse detection, the IDS is profiled by using the previously known intrusion
behaviors. This profile is called as a signature of the attack. These signatures can be
extracted from any data source which can be used to extract an intrusion pattern,
such as user commands, system calls, audit events, network packets, keystrokes, etc.
Misuse detection is based on looking for the known intrusion patterns in the target
data. If any match appears, the system results in an alarm for the detected intrusion.
This feature sometimes provides to detect different versions of the similar kind of
attacks if the extracted intrusion profile is well generalized to match with the new
versions. The main advantage of signature detection based approaches is the ability of
detecting known intrusions, but they are not capable of detecting novel attacks [89].
Anomaly detection:
In anomaly detection, the IDS is profiled by using the previously known normal be-
havior. Again, profiles are extracted from any data source which can define a normal
behavior. Unlike misuse detection, anomaly detection looks for the mismatches or
20
deviations from a normal profile. According to predefined thresholds, mismatching
behaviors raise an alarm for an anomaly. The main advantage of this approach is the
ability of detecting new types of intrusions, which may result in a deviation from the
normal profile. The main drawback of this approach is the high rate of false positives
which can be caused by an undefined normal behavior or a noise that is not related
to an attack.
2.1.3 Response Type
In general, an IDS has two different responses: 1) active and 2) passive response.
Active response:
An active response of an IDS means providing an automatic response once an attack is
detected. The idea behind an active response is preventing the target system from any
further damages and minimizing the effect of intrusion [24]. These active responses
can be explained by three main categories: collecting additional information to resolve
the attack type and effects, changing the environment by blocking or reconfiguring
the system components, and taking action against the intruder [12]. An IDS with an
active response mechanism needs a real-time detection approach since an the time
gap between the start of an attack and the detection leaves the system vulnerable to
exploitation during that period. A major issue of this approach is the possibility of
an inappropriate response, such as blocking normal traffic, because of an incorrect
implementation or configuration [158].
21
Passive response:
A passive response of an IDS is based on producing an alarm for a detected attack.
This kind of response aims to inform the system user or an administrator rather than
taking actions automatically to prevent an attack [156]. The alarm may come with a
report that includes system logs, potential vulnerabilities and attack types to allow
the administrator to perform a further investigation. A major issue of this approach
is the delay between the intrusion and the human response for critical systems [125].
2.2 Crowd Analysis
Although there is no single definition of a crowd since it changes depending on the
components (such as size, density, time, etc.) that characterize the crowd, a general
definition is given by Chandella et al. [25] as a “sizable number of people gathered at
a specific location for a measurable time period, with common goals and displaying
common behaviors”. In the computer vision domain, this definition can be summa-
rized as a group of individuals in a scene.
Crowd Analysis refers to a collection of methods that focus on studying crowd
related problems. In the last decade, with the growing interest in surveillance of
crowded scenes to provide for increased public safety and the decreasing cost of video
equipments, crowd analysis has become a popular strategy to understand behaviors of
crowd environment automatically. In most previous work, people and their activities
are the main targets for learning, particularly in public places. Crowd analysis can be
used for various kinds of purposes, such as crowd management, public space design,
virtual environments, visual surveillance, and intelligent environments [159].
There are various classes of video components which impact the selection of
22
method and approach taken to perform crowd analysis [23, 159]. We divide the
video components into three categories: 1) target, 2) environment and 3) sensor-
based components as presented in Table 2.2 (the video components in our case are
shown in italic).
Table 2.2: Video components
Target Based
Number Single Multiple
Density Sparse Dense
Rigidity Rigid Non-Rigid
Occlusions Low High
Environmnet Based
Background Moving Static
Space Indoor Outdoor
Light Day Night
Sensor Based
Number Single Multiple
Platform Moving Static
Video type Color Gray
Resolution Low High
The main challenge in crowd analysis is working with multiple targets, since com-
puter vision techniques for object analysis are not directly applicable to multiple
objects. Besides this, the density of a crowd scene is another important factor while
choosing a method to analyze the scene [40]. Crowds density can range from sparse
to very dense, depending on the target surveillance data.
23
Typically, crowd analysis can be categorized under three main titles: 1) people
counting, 2) tracking and 3) behavior understanding [71].
2.2.1 People Counting/Density Estimation
Crowd density estimation and people counting are two fundamental problems for
managing and planning purposes. The challenge of crowd density estimation and
people counting arises while people exhibit limited motion, such as gathering and
waiting [161]. Solutions to these two problems can also be used for object detection
as a feature extraction method for further crowd analysis, such as tracking and be-
havior learning. There are various techniques presented in the literature that can be
categorized under three main titles:
Pixel-based analysis:
Pixel-based analysis mostly relies on background subtraction methods. Background
subtraction methods based on modeling a background to differentiate the foreground
objects in each incoming frame by analyzing the frame pixel-by-pixel. Although there
are various techniques to subtract background, applying approximate median filter
provides better results for density estimation in a crowded scene [35].
Texture-based analysis:
The idea behind texture-based analysis is to assume that a crowd with low density
should create coarse-grained textures, while high density crowds should create finer
textures [159]. Interest point detectors are widely used to extract texture information
from images (e.g. Harris, KLT, SIFT).
24
Object-based analysis:
Object-based analysis aims to detect objects to count the number of targets by ex-
tracting silhouettes, edges or blobs. There are two main categories of object detection
approaches: 1) appearance based, 2) motion based.
• Appearance based: Appearance-based object detection approaches apply image
processing steps to each frame individually to detect objects of interest. The
processing steps commonly include segmentation, point detection (identifying
interest points), templating (generating contours). Segmentation methods par-
tition an image frame by considering similar regions in the image. The texture-
extraction method is based on applying image descriptor algorithms. Contour-
based method is based on using a template as a filter to detect objects.
• Motion based: Motion-based object detection approaches take advantage of
(pixel-wise) differences between frames caused by moving objects, such as back-
ground subtraction, temporal differencing, and optical flow. Background sub-
traction methods rely on modeling a background to differentiate the foreground
objects in each incoming frame. In temporal differencing, the motion is esti-
mated by using consecutive frames in a time interval to find deviations in the
following frames. When compared to background subtraction, it may be thought
of as dynamic background modeling. Optical flow is a pixel-wise short-term mo-
tion computation between consecutive frames, where each vector represents the
direction and amount of motion [107].
25
2.2.2 People Tracking
The tracking task can be defined as a labeling process on detected objects or motion
flows for each frame in a video. A survey of available methods for object tracking
are presented in [154]. We summarize the people tracking methods under two main
categories: 1) detection based, and 2) motion flow based tracking.
Tracking by detection:
This approach is based on the association of detected objects between image frames.
Tracking by detection approach works better in low density crowds because of the de-
tection problems in high density crowds. Prior work in this area considers association
of detected objects between image frames.
Tracking by motion flow
This approach is based on extracting motion flow for the objects or points that move
coherently. Texture-level analysis is used to compute regression estimates. This
approach is more suitable for structured, high density, crowds [5].
2.2.3 Behavior Learning
Crowd analysis for behavior learning is used in wide variety of applications includ-
ing event recognition and anomaly detection. In video analysis, event recognition is
defined as the process of monitoring and analyzing the events occurring in a video
surveillance system in order to detect signs of security problems [71].
Anomaly detection in a crowded scene can be performed for detecting abnormal
events from unusual motions (e.g., fights), sizes (e.g., a vehicle in the pedestrian road),
directions (e.g., walking through a restricted area), and velocities (e.g., an escape from
26
a threat) or any combination of these anomalies in the targeted scene. Because of
the limitations of acquiring components of the surveillance video, anomaly detection
is a challenging problem in crowd analysis. First of all, the type of anomaly that
is looked for may depend on various definitions such as, density of the crowd (e.g.,
low or high density), surveillance area (e.g., streets, pedestrian paths, indoor places,
concert/event areas, train stations and etc.), surveillance equipment (e.g., resolution
quality, number of the cameras, camera platform and etc.). Although it is a difficult
task to define abnormal activity in an explicit manner, these components help to
define the target anomaly and appropriate approaches. It is important to mention
that this thesis is not trying to detect abnormal objects, which have abnormal object
sizes, speeds or directions, such as a vehicle on the pedestrian road or a person in
the restricted area [95, 15]. Instead, we aim to define behavioral structures of people
and recognize the events in the video to detect abnormal events, such as a sudden
dispersion, running or merging of considerable number of people [165, 160].
Behavior-based learning approaches can be divided into to two main categories:
1) object- based and 2) holistic-based approaches [101].
Object-based approaches
Object-based approaches focus on segmentation and detection methods to learn be-
haviors of individuals in a crowded scene. Two of the main applications include:
1) detecting a person in a crowd whose behavior is not similar to learned behav-
ior [21], and 2) detecting abnormal interaction within groups of people [98]. This
approach is not suitable for high density crowds since it is difficult to track individu-
als where the whole process is affected by occlusions.
27
Holistic approaches
Holistic approaches treat the crowd as a single entity to learn the global behavior of
a crowded scene. Instead of learning behaviors from individual targets, this approach
focuses on modeling techniques to extract dominant behavior such as motion flow, or
bottlenecks [27, 115]. This approach is more suited for dense crowds.
2.3 Anomaly Detection
The anomaly-based detection depends on defining normal behavior by training a
model for the considered dataset. This defining process has been performed with
various approaches and techniques by researchers according to the specific application
domain and the dataset structure. The structure of the dataset is the key point of
the modeling which specifies the applicable methods for detection.
In many domains, there are many challenges related to working with a dataset
during the anomaly detection process. These issues can be summarized as follows.
First, we want to extract a dataset which covers every possible normal behavior. Even
if it is possible for the current situation, the normal behavior may change over time.
Second, we need to ensure the correctness of data samples. In some domains, such
as a video anomaly detection, datasets needs to be extracted from a raw data via
preprocessing. If the extracted dataset has incorrect information or missing feature
points, then the anomaly detection process would not be successful. Third, we need to
collect a sufficient amount of data. There is a minimal number of data points needed
to learn a behavior. Fourth, we need to properly label the collected dataset. It is
more appropriate to test an anomaly detection method on a real dataset to prove its
success for a real world problem. This adds a time-consuming step in order to provide
28
labels on a collected data set. Fifth, we need to select the appropriate features. In
some cases the dataset would include information which is not necessary, and can
impact the detection process negatively. In these cases, there is a need to apply a
dimensionality reduction or a feature selection technique to improve the efficiency of
the model.
2.3.1 Anomaly Detection Algorithms
The techniques used to detect anomalies are developed by employing various ap-
proaches. In anomaly detection literature, these approaches has been grouped under
different categories by researchers.
In 2004, Hodge et al. [65] reviewed outlier detection methods used in three field
of computing: Statistics, Neural Networks and Machine Learning. This prior study
also mentioned that there are hybrid systems which can combine multiple methods
to achieve better results.
A study presented by Ng [104] grouped anomaly detection methods into two
classes: 1) discriminative and 2) generative. Discrimination methods include tech-
niques which perform a function fitting to extract a decision rule from input data to
the output labels, such as Nearest Neighbor (NN), Support Vector Machines (SVM)
and Neural Networks [28, 69]. Alternatively, generative methods refer to techniques
that build a model by learning the relationships between the input data and the out-
put labels to estimate the underlying system behavior class, such as Parzen windows,
Mixture of Gaussians, Hidden Markov Model (HMM) [152, 110, 153].
Chandola et al. [31, 30] presented a detailed review of anomaly and outlier de-
tection methods with various data sources. Their work provides a broad overview of
extensive research on anomaly detection techniques in multiple application domains.
29
They categorized the techniques under six groups: classification based, clustering
based, NN based, statistical, spectral and information theoretic.
Although there are different approaches taken to classify anomaly detection algo-
rithms, the anomaly detection task fits well in the machine learning field, since the
nature of the anomaly detection process depends on learning a behavior or item-sets
from a given data.
2.3.2 Machine Learning
Machine learning involves design and development of algorithms that learn from train-
ing data or past experiences. Machine learning techniques focus on building models
for various purposes, such as prediction, recognition, diagnosis, detection, planning,
controlling etc. [7]. The field of machine learning overlaps with many other fields,
such as data mining, statistics and information theory, typically sharing some algo-
rithms or approaches to solve a learning problem. Since there are no strict boundaries
between these fields, an anomaly detection algorithm falls under one or many of fields,
according to application and the target output. It follows that most techniques used
for anomaly detection, which include a learning phase, can also be considered machine
learning techniques.
In a generalized approach, there are three steps to be performed for a machine
learning-based anomaly detection: training, testing and analysis. The training step
includes profiling the behavior of the data. The test step refers to comparisons be-
tween the current activities and the learned behavior model. Finally, the analysis step
evaluates the test results to report significant deviations. Based on the availability of
the data labels, there are three different learning approaches for anomaly detection:
Supervised learning: Training with the normal and abnormal data with labels.
30
Supervised learning aims to determine the normal and abnormal classes from the
labeled data. If a test instance lies in a region of normality, it is classified as normal,
otherwise it is classified as abnormal.
Unsupervised learning: Training with the normal and abnormal data without
labels. Unsupervised learning aims to determine the anomalies without prior knowl-
edge of the data. This learning technique assumes that anomalies are separated from
the normal data and will thus appear as outliers.
Semi-supervised learning: Training with the normal data. This technique
needs pre-classified data but only learns from the data labeled as normal. Semi-
supervised learning aims to define a boundary of normality.
2.3.3 Hidden Markov Models
Figure 2.1: Hidden Markov Model
We select to use HMMs as our machine learning approach for anomaly detection.
A brief description of a HMM is given in the following and the preprocessing method
and the structure of our modeling based on HMMs is explained in Chapter 5 and
Chapter 6.
HMMs are stochastic models for sequential data that have a broad spectrum of
usage categories that include speech recognition, motion/action analysis in videos,
31
human recognition and protein sequence alignment. A generic HMM structure is
presented in Figure 2.1 which represents the joint probability distribution over states
and observations as given in (2.1).
P (Y,X) = P (X1|π)T∏t=2
P (Xt|Xt−1)P (Yt|Xt) (2.1)
In an HMM, the states are not visible but the observations and the probability
distributions over the observations are known. Let H={H1, H2, H3, ..., HN} be the
set of hidden states, where N is the number of states and V={v1, v2, v3, ..., vM} is the
set of observation symbols, where M is the number of observation symbols. Then
there are three probability measures in an HMM; 1) A (state transition probability
distribution), 2) B (observation symbol probability distribution), and 3) π (initial
state distribution). HMM is represented in a compact form as λ = {A,B, π}.
The state transition probability distribution A = {aij} represents the probability
of being in a hidden state j at time t+ 1, given a known state i at time t:
aij = P [xt+1 = Hj|xt = Hi] (2.2)
1 ≤ i, j ≤ N, aij ≥ 0,N∑j=1
aij = 1
The observation symbol probability distribution B = {bj(k)} represents the prob-
ability of observation symbol vk in state j:
bj(k) = P [vk at time t|xt = Hj] (2.3)
1 ≤ j ≤ N, 1 ≤ k ≤M, bj(k) ≥ 0,M∑k=1
bj(k) = 1
32
The initial state distribution π = {πi} represents the initial probability of state i:
πi = P [x1 = Hi] (2.4)
1 ≤ i ≤ N, πi ≥ 0,N∑i=1
πi = 1
The observation sequence Y is represented as:
Y = {Y1, Y2, ..., YT} (2.5)
where T is the length of the observation sequence and each observation Yt is one of
the symbols from set V .
There are three kinds of problems one can solve by using HMMs [114].
1) Evaluation: computing the probability of observation sequence P [Y |λ], when
the observations sequence Y and a model λ are given. Calculating the likelihood of
a model using the forward/backward algorithm has time complexity O(N2T ) where
N is the number of states and T is the length of sequences.
2) Decoding: estimating the optimal state sequence, when the observations se-
quence Y and a model λ are given. Decoding the state sequences using Viterbi
algorithm has time complexity O(N2T ) where N is the number of states and T is the
length of sequences.
3) Training: estimating the model parameters λ = {A,B, π}, when the obser-
vations sequence Y and the dimensions N and M are given. Estimating the model
parameters using the Baum-Welch (Expectation Maximization) algorithm has time
complexity O(N2TI) where N is the number of states, T is the length of sequences
and I is the number of iterations.
We apply the solution of the third problem to find the best-fit model. Then
estimated parameters are used for the first problem to find the likelihood values of
the observation sequences.
33
2.3.4 Evaluation Metrics
There are two possible outputs of an anomaly detection task: 1) a label, or 2) a score.
Label outputs can be represented by string words, binary values or a letter from an
alphabet of length two, such as normal or abnormal, 0 or 1, or a or b. On the other
hand, output scores are the calculated (also called decision values) and we can apply a
threshold to the decision value to decide if a sample is normal or abnormal. In either
output types, we need to evaluate the results of an anomaly detection task. Next,
we present the evaluation methods that we have used to evaluate our implementation
results.
Confusion Matrix
A confusion matrix provides information about the number of normal and abnormal
instances in actuality, and the number of normal and abnormal instances in the
analyzed results. Table 2.3 shows the confusion matrix for a two-class classifier. “True
(T)” indicates that the prediction is correct, and “False (F)” is incorrect. “Positive
(P)” is used to indicate an “abnormal” class and “Negative (N)” a “normal” class.
There are four kinds of data in the confusion matrix to show the correct and incorrect
predictions of the two classes: 1) TP, 2) FP, 3) TN and 4) FN. For example, TP (True
Positive) indicates the number of “abnormal” instances that are predicted correctly.
Table 2.3: Confusion matrix
ACTUALPREDICTED
Positive NegativePositive TP FNNegative FP TN
34
Rate Measures
The basic evaluation measures are derived from the information that the confusion
matrix provides. The definitions and the formulations of these evaluation measures
are explained below.
The precision is the percentage of the correctly predicted anomalies (TP), com-
puted over all predicted anomalies.
Precision =TP
TP + FP(2.6)
The true positive rate (TPR) is the percentage of the correctly predicted anoma-
lies (TP) over all the actual anomalies. The TPR is also known as the recall rate
or sensitivity.
TPR =TP
TP + FN(2.7)
The true negative rate (TNR) is the percentage of the correctly predicted nor-
mal cases (TN) over all the actual normal cases. TNR is also known as specificity.
TNR =TN
TN + FP(2.8)
The false positive rate (FPR) is the percentage of the normal cases that are
incorrectly predicted as anomalies (FP) over all the actual normal cases.
FPR = 1− TNR =FP
TN + FP(2.9)
The false negative rate (FNR) is the percentage of the anomalies that are
incorrectly predicted as normal cases (FN) over all the actual anomalies.
FNR = 1− TPR =FN
FN + TP(2.10)
35
The accuracy is the percentage of the correctly predicted cases (TP+TN) over
all the test samples.
Accuracy =TP + TN
TP + TN + FP + FN(2.11)
Receiver Operating Characteristic Curve
Figure 2.2: ROC curves
A Receiver Operating Characteristic (ROC) curve is a graphical plot which illus-
trates the performance of a classifier by presenting a trade-off between detection rate
36
(TPR) on y-axis and FPR on the x-axis [100]. ROC curves can be plotted for different
threshold values.
A generic example of an ROC curve is shown in Figure 2.2. Point (0, 1) is the
ideal operating point on an ROC curve, where the TPR is 100% and FPR is 0%. The
line x = y shows the performance of a random classifier which makes guesses on a
classification of data points [51].
There are two fundamental features of an ROC curve representation. First, it
provides a method to find the optimum threshold value for a classifier. For example,
an operational point is possibly optimal if it is closest to the ideal point. If the
distance of two points are equal on the ROC curve, then the one with the lower FPR is
preferable. Second, it provides a comparison environment for classifiers. For example,
a classifier is possibly optimal if it is closest to the y-axis since the optimal detector
should have the lowest FPR with the highest TPR. Another comparison technique is
computing Area Under Curve (AUC) values of ROC curves for each classifier. AUC
value presents a single-number discrimination measure across all possible ranges of
threshold to assess the performance (accuracy) of a model.
Typically, it is assumed that the optimal detector should have the largest AUC [19].
Although this assumption is correct in general, it is recommended to consider the sen-
sitivity and specificity, instead of using the AUC values alone. Visual inspection of
the ROC curves would also be needed when misclassification (FPs and FNs) costs are
unequal in accordance with the application goal [93].
Cross Validation
Cross validation is a re-sampling technique for the performance evaluation of a train-
ing approach where only a limited sample size is available. In general, cross-validation
37
techniques divide a dataset into approximately equal-sized subsets and perform an
evaluation on each subset, iteratively [81]. One type of cross validation is k-fold cross
validation, which is a common re-sampling technique to evaluate how accurately a
detection model will perform in practice. The iterative procedure for k-fold cross
validation presented in Figure 2.3 is formally defined as follows:
• Step 1. Divide the samples randomly into k-folds (subsets).
• Step 2. Select one fold as a validation set and use remaining k-1 folds for
training.
• Step 3. Repeat step 2, until each subset has been tested exactly once.
Figure 2.3: K-fold cross validation
As a result, each fold plays the role as the test set exactly once, and is part of the
training set k− 1 times [99]. The overall performance of the model may be estimated
by either averaging or combining the results obtained from the test folds in each run.
A complete cross validation may be obtained by repeating k-fold cross validation
multiple times using different folds, or by setting the k value to the number of data
38
samples (known as leave-one-out cross validation). Although, the variance of the re-
sulting estimate is reduced as k is increased, this approach is not preferred in practice
since it is computationally expensive [81].
39
Chapter 3
Related Work
In this chapter, we provide related work on anomaly detection for the areas of system
call analysis, crowd analytics and context-aware systems.
3.1 Related Work in System Call Analysis
Since computer security becomes an important issue, the need for efficient and accu-
rate intrusion detection methods as become mandatory in order to protect computer
systems from intrusions. As a result, a number of Intrusion Detection Systems (IDS)
have been developed to defend network and computer systems against possible dam-
age and information loss.
Anomaly detection-based intrusion detection has been an active research area
since it was proposed by Denning [41]. Anomaly detection can be used in Host-based
Intrusion Detection Systems (HIDS) using a range of different techniques and data
sources to improve computer security. There are many forms of data that can be used
for HIDS such as CPU usage, time to login, names of files accessed, user commands,
keystroke records, and system call traces. In recent work, system call traces have
40
commonly been used to analyze program behavior. System call behavior has been
studied extensively in prior work on HIDSs and there have been many applications
of this approach discussed in the computer security literature [41, 96, 97, 108, 120,
31, 153]. The idea behind using system call traces is based on being able to track
each request that a program makes to the operating system during its execution. A
system call sequence is a discrete sequence such that each system call belongs to a
finite alphabet of system calls executed by a particular operating system [32].
3.1.1 Data Representation in System Call Analysis
There have been a variety of approaches proposed for anomaly-based intrusion detec-
tion using system calls traces. Their differ mainly based on the type of data used as
input to the detector and the specific anomaly data representation.
Some of the previous work has focused on using only system call arguments [85,
103], while others have combined the system call sequences with the arguments [94,
132]. But the majority of the previous work in this area has focused on using only
system call sequences to train a behavior model. Working only with system call
sequences, the various implementations differ in how data is represented. In gen-
eral, these representations can be grouped into two categories based on their feature
extraction methods: 1) frequency-based methods and 2) sequence-based methods.
Frequency-based feature extraction methods
The frequency-based feature extraction methods rely on the number of occurrences
of each system call. For example, using a “bag of words” representation (which is
commonly used in text classification), we can map a system call anomaly detection
to this representation. In [75], each trace is treated as a document and each system
41
call in a document is treated as a word. Since the “bag of words” representation
can be used with many of the machine learning algorithms, it is widely used in
system call anomaly detection [88, 151, 162]. Instead of only counting the number
of occurrences, some approachs improve detection by applying a ranking approach
based on the relative order of frequency values [137].
Another rich representation of a frequency-based vector is the term frequency-
inverse document frequency (tf-idf), where term refers to a system call and document
refers to a system call sequence [34]. In [141], several forms of tf-idf are presented
and experimented with various classification algorithms on system call sequences and
HTPP log data set.
Sequence-based methods
Sequence-based methods use the order of system calls or place where a system call
occurs in short sequences. This information is extracted from a system call trace. For-
rest et al. [54] introduced an approach for intrusion detection-based on monitoring the
system calls of a program during execution. The idea behind their work depends on
the immune system’s ability of distinguishing “self” from “nonself”. They extracted
normal program behaviors from system call traces to define “self” for Unix processes.
A normal behavior database is created using a sliding a window of size k + 1 over
the system call sequences. Then for each system call, they recorded the following
system calls for each position from 1 to k. During the testing phase, test sequences
are scanned and the percentage of mismatches are computed by considering the max-
imum number of possible pairwise mismatches for a sequence with a lookahead of k.
In [66], they improved their prior work by using a “stide” (sequence time delay em-
bedding) system. They first generate a normal database from a set of length k unique
42
short sequences and use the “Hamming distance” to compute deviations of the test
sequence from the normal. Success of both approaches depends on the completeness
of normal dataset and the length of sliding window. If a normal dataset does not
include all possible short sequence that a program can perform during its execution,
then those will result in false positives. If the length of sliding window is too short,
the anomalies in the test sequences can be missed since an abnormal short sequence
may include many normal shorter sequences; and if it is too long, it may consume
considerable system resources. One outcome of this prior work was the generation of
a benchmark dataset for further system call trace analysis [136].
3.1.2 HMM in System Call Analysis
A large amount of prior work has been done in the area of anomaly detection-based
HIDS. In this section, we present prior work related to anomaly detection-based
HIDSs using Hidden Markov Models for system call analysis.
Hoang et al. [64] developed a multi-layer model based on HMMs to decrease the
False Positive Rate (FPR) of system call anomaly detection. They created a normal
dataset using a sliding window run over system call traces. In their model, a first-layer
checks the given test subsequence, and if there is a mismatch or if it is a rare sequence
in the normal dataset, then the subsequence is sent to the HMM layer. The HMM
layer computes the likelihood value and compares it with a predefined threshold to
determine whether the sequence is normal or abnormal. Their approach decreases the
FPR, since the likelihood value of a normal subsequence is higher than the abnormal
ones, even if it does not appear in the normal dataset. Their work also provides for
detection using only a short period of time without needing to wait for the end of the
program execution.
43
Yeung and Ding [153] experimented with Fully-Connected HMMs and Left-to-
Right HMMs as dynamic modeling approaches to evaluating sequences of system calls
and minimum cross entropy as an information-theoretic static modeling approach
using the occurrence frequency distributions of system calls. Their results showed
that the dynamic modeling approaches are more suitable for system call datasets
and the Fully-Connected topology between states were more successful compared to
Left-to-Right.
Du et al. [47] implemented HMM-based anomaly detection by defining two hidden
states, one for normal states and one for abnormal states. They compute the relative
probability of system call sequences to determine if they are normal or abnormal ac-
cording to a given HMM model. Experiments on the UNM sendmail and lpr datasets
showed that the relative probability value differences between normal and abnormal
sequences are distinct. The proposed model is simple and effective when applied to
the intrusion detection problem.
Khreich et al. [78] proposed multiple-HMMs (µ-HMMs) for system call anomaly
detection to overcome the issue of selecting the number of hidden states. The µ-
HMMs approach is based on training multiple models with a varying number of
hidden states and combining the results according to the Maximum Realizable ROC
(MRROC) method. The proposed model provides better performance on system call
traces when compared with using a single HMM.
Although HMMs applied to system call sequences show better results as compared
to static approaches, there are still concerns about the required training time. In this
regard, Hu et al. [68] proposed a simple data preprocessing approach to speed up
HMM training. They improved their previous work [63], in which they proposed an
incremental HMM based on dividing the observation sequences into a number of (R)
44
subsequences and merging the parameter estimation results from the R trained sub-
HMMs, removing similar subsequences of system calls from the normal dataset. The
result is that the training time can be reduced by up to 50% with a reasonable FPR.
Since profiling complex sequential data is still an open problem in anomaly detec-
tion, there is still a need for further enhancements. While there are a number of prior
approached that used HMMs for anomaly-based intrusion detection, these models
became more complex as each improvement is included to increase the detection rate
while reducing the FP rate.
There are several reasons for this growth in complexity. First, a detailed analysis
of program traces is missing in most of the prior work and HMMs are trained for
a range of behaviors for normal execution. This kind of training has a huge impact
on learning a normal behavior. Second, some of the process traces collected during
different executions of a program are identical. This leads two different problems:
1) For the training step, there would be several identical training traces. Using
unique process traces is necessary because of the need to decrease learning time while
using HMMs. 2) For the test step, using different program traces including identical
process traces does not provide an accurate learning and testing approach. If the
training and test sets are partitioned only by considering program execution, it is
possible to see identical process traces in training and testing dataset at the same
time.
Our process trace clustering approach for using HMMs on system call anomaly
detection provides better results, more accurate model settings and a less complex
structure to detect anomalies. Details of our training approach is presented in Chap-
ter 5. In this thesis, we show how to use system call traces with a HMM method and
explore which preprocessing technique is most suitable for anomaly detection.
45
3.2 Related Work in Crowd Analysis
In this thesis we utilize Crowd Anomaly Detection as a motivating application. In
this section, we present prior work on crowd analysis for behavioral learning and
anomaly detection.
Anomaly detection is an active area of research in surveillance applications of
crowd scenes. Typically, the anomaly detection task involves finding data samples
that do not conform to a definition of normal. In crowd analysis, based on the target
video scene and the analysis goal, there are number of approaches we can use for
identifying abnormal behavior and many different data (feature) types that one can
extract from a crowd video.
3.2.1 Data Representation in Crowd Analysis
In crowd analysis, feature extraction is highly dependent on the goal of the analysis,
since the crowd anomaly detection approaches can be categorized according to type
of scene representation [95]. These features may be grouped roughly under two cate-
gories with respect to the problem of interest, video components and techniques that
are used. The first category is object-level features which are extracted by applying
the methods for detection and/or tracking of individuals. The second category is
frame-level features which are extracted from pixels or patches present in each frame
and then considering the dynamics in consecutive frames.
Object-level Features
In video surveillance applications, one of the main problems of interest is detecting
abnormal behaviors of individuals, or the relationship and connections between these
46
individuals [58].
Object-level features provide the information of individuals in the scene, including,
but not limited to, color, size, shape, speed, direction, and trajectory. Object-level
features are typically used for low to moderate-density crowds for local or global
abnormal event detection, such as people interaction (fighting, following, meeting),
restricted area access (vehicles in a pedestrian path), and similar scenarios. There
are many applications of anomaly detection using object-level features during the
surveillance of critical places where we need to understand behavior of individuals or
groups. An extensive survey of object-level approaches in transit surveillance can be
found in [23].
A trajectory feature is the one that is mostly used to learn crowd behavior in
object-level analysis [36]. This approach may be used in either sparse or dense crowds
by applying appropriate techniques. Researchers have proposed various approaches to
use trajectory information for anomaly detection, such as similarity analysis between
trajectories [70, 145], motion pattern modeling by using machine learning approaches,
and include HMMs [62, 74, 138] and Bayesian models [142, 143, 144].
Hervieu et. al [62] proposed a statistical trajectory-based approach addressing
two issues related to dynamic video content understanding: recognition of events
and detection of unexpected events. They compute local differential features com-
bining curvature and motion magnitude on the motion trajectories and the temporal
causality of the features is then captured by HMMs. Event recognition is performed
by classifying the trajectories according to learned trajectory classes. Unexpected
events are identified by comparing the test trajectories to representative trajectories
of known classes of events.
While detection-based methods provide more accurate results, they may fail in
47
high density crowds because of a high degree of occlusion and their required compu-
tational complexity [148].
Frame-level Features
Behavior learning for anomaly detection in extremely crowd scenes is another area of
great interest in video surveillance applications. In heavily crowded scenes, perform-
ing crowd behavior analysis by tracking individuals is impractical [127]. Instead, we
can try to understand collective behavior of a crowd to classify the events.
Frame-level features generally provide global information about a frame or patches
at a pixel level, including, but not limited to, color, texture, shape, motion. Frame-
level features are mainly used for moderate to high-density crowds for abnormal event
detection, such as collective behavior (bottlenecks, lanes), sudden behavior changes
(running, evacuation), and similar analyzes.
Anomaly detection via tracking individuals in extremely crowded scenes is chal-
lenging due to the high density of features and frequent occlusions [82]. Therefore,
spatio-temporal gradients and optical flow are two popular feature representations
to model motion patterns [95]. Spatio-temporal-based crowd analysis aims to detect
anomalies based on the appearance and motion of objects in a temporal time sequence
without tracking [17, 43, 82]. Optical flow technique depends on modeling the motion
information of the flow points which are generally extracted as descriptors [139].
In [33], authors estimated optical flow to cluster crowds into groups by using an
adjacency-matrix based clustering (AMC) method. Then they characterized group
behaviors by a model of the force field which provides information about the orien-
tation and the force of each crowd. Abnormal crowd behavior events are defined as
the orientation of a crowd abruptly changes or if the interactions in the crowd differ
48
significantly from the predicted behavior.
In [148], researchers provide an energy-based model to estimate the number of
people. To represent the spatial distribution of a crowd, they combined people count-
ing results with crowd entropy. Their work detects two abnormal activities of people:
1) gathering and 2) running.
In [36], researchers proposed an anomaly detection framework to model the spatio-
temporal distribution of crowd motions and detect anomalous events by learning
regions of interest from historical trajectory sets and the statistical template of the
pedestrian distribution. They build a Hierarchical Pedestrian Distribution, a series
of histograms based both on global and local levels, as the templates for the observed
movement distribution. This distribution statistically describes time-correlated crowd
events using overall crowd information and local details in the regions of interest.
Mehran et. al [101] proposed a social force model to capture the dynamics of the
crowd behavior for localized anomaly detection in a crowd scene. They computed
across a grid of particles the space-time average of optical flow and obtained Force
Flow values for every pixel in every frame. To model normal behavior, they use
spatio-temporal volumes of Force Flow. Anomalies are detected based on a bag of
words model of the social force fields.
3.2.2 HMM in Crowd Analysis
A large amount of prior work has been done in the area of anomaly detection in
crowd scenes. In this section, we present prior work related to anomaly detection and
behavior learning-based crowd analysis using Hidden Markov Models. HMMs have
been used in different ways to detect abnormal behavior in a video scene depending
on the data that needs to be modeled to learn normal behavior, such as human
49
interaction modeling, optical flow analysis, and related behaviors.
Oliver et al. [105] used HMMs for two different architectures, namely HMM and
Coupled-HMM (CHMM), to compare their performance on learning normal behavior
and recognizing human behaviors. Their system focused on recognizing the interac-
tions between people, such as following another person or meeting another person.
Training is performed by using synthetic data, created to develop flexible and in-
terpretable behavior models. A CHMM can provide improved results in terms of
training efficiency and classification accuracy.
Vaswani et al. [138] proposed a “shape activity model” to model “activity” per-
formed by a group of moving and interacting objects (referred to as “landmarks”).
A continuous-state hidden Markov model is defined for landmark shape dynamics in
an activity where the object locations at a given time forms the observation vector,
and the corresponding shape and motion parameters constitute the hidden-state vec-
tor. An abnormal activity is detected with a change in the shape activity model.
The model is tested for abnormal activity-detection in an airport scenario involving
multiple interacting objects.
Andrea et al. [9, 10, 8] used HMMs with a mixture of Gaussians to characterize
normal behavior of a crowd by learning normal motion patterns from optical flow
of image blocks. They applied Principal Component Analysis (PCA) to build the
feature prototypes and used spectral clustering to find the optimum number of models
to group video segments with similar motion patterns and trained a HMM for each
model. These HMMs are then used for event recognition and anomaly detection.
Kartz and Nishino [82, 84] extracted spatio-temporal gradients to fit a Gaussian
model and use a collection of HMMs, one for each spatial patch in a frame, to model
the structured crowd motion. In [83], the authors introduced the use of pedestrian
50
efficiency to detect unusual events and to track individuals in crowded scenes. They
used the trained HMMs to estimate the intended motion at each space-time location
in a different video of the same scene. For each scene, they train the HMMs on a
usual sequence, and estimate the efficiency of the remaining sequences. A frame is
considered unusual if its average efficiency is below a specific threshold that is selected
empirically. In contrast to their previous work, in [83] the authors train the HMMs
on directional statistic distributions of optical flow resulting in a more compact and
accurate representation. They showed that the pedestrian efficiency can be used to
detect abnormal activities without detecting and tracking individuals.
3.3 Related Work in Context-aware Systems
In this thesis, we propose a framework for context-aware anomaly detection on sym-
bolic sequential dataset. Using a context-aware approach is not a new idea. It has
been used in different research areas such as web services, information services, smart
phones, smart environments, etc. The core of context-aware systems is the ability of
providing specific services according to context of a user, item, task or event.
In order to understand the context-aware approaches, it is first necessary to dis-
cuss the definition of context and context awareness. Context has been studied across
different disciplines and each discipline tends to define context from its own view for
specific applications [1]. Therefore, there are many different definitions of context in
literature depending on the application area and discipline. An exhaustive review
of 150 definitions coming from different domains is presented in [16]. The authors
collected the definitions of context on the web by researching cognitive science and
related disciplines to understand the definition of context. Although their analysis
51
show that the content of all the definitions can be analyzed through six essential com-
ponents (constraint, influence, behavior, nature, structure and system), they conclude
as: “a definition of context depends on the field of the knowledge that it belongs to”.
Dourish [46] also introduced a broad taxonomy, by categorizing the contexts into
two categories: 1) representational: the context can be defined with observable and
stable attributes, 2) interactional: the context is not necessarily observable. The most
important distinction between these two categories is the definition of the relation
between context and activity. In representational, while it is defined as “context
and activity are separable”, in interactional, it is defined as “context arises from the
activity” [46]. Considering context as an interactional problem is an alternative view
to context definition, since it is looking for relevancy between the activities instead
of considering something as a context.
A generic definition of context is given by Hong et al. [67] as “any information that
can be used to characterize the situation of an entity”. The authors provided a com-
prehensive overview and presented a classification framework on the context-aware
systems. Besides their generic definition, they categorized the context as external and
internal. External context refers the context data which is collected through physical
sensors, such as location, distance, temperature, time, etc. On the other hand, inter-
nal context refers the context data which is extracted by analyzing the collected data
to understand the internal elements, such as preferences, tasks, emotional state, etc.
3.3.1 Context-aware Applications
Context information can be used to improve service in real life situations, which
is called context-awareness. The main advantage of context-aware systems is their
ability to adapt their operations to the current context [13]. For example, if a person
52
is busy (e.g. running, working) or in place that he/she probably wishes not to be
disturbed, then a smart-phone can go silent or send a predefined automatic message
as an answer. Although, context-awareness is widely used as a property of mobile
devices to adapt their behavior according to physical environment, there are many
other applications in different domains, such as recommender systems, process design,
or recognition systems.
Recommender systems aim to generate more relevant recommendations by pro-
viding specific services according to contextual situation of the user [2]. A music
recommendation based on users’ music history of preferences is presented by Hariri
et al. [59]. In their work, the context is not fully observable but they inferred it
from users’ interactions with the system. A friend or item recommendation on so-
cial networks is proposed by Liu and Aberer [92]. The authors, combine the context
and social network information via inferring a users’ preference by learning his/her
friends’ tastes to improve the quality of recommendation.
Applications for recognition systems are generally presented in computer vision
area. Activity recognition problem is one of the most important goals in many surveil-
lance systems. Zhu et al. [166, 167] presented a context-aware activity recognition
framework to detect anomalies. In their work, they proposed a model to learn the
context patterns from set of activities. Another work is presented by Lan et al. [86] to
recognize actions of individuals by using contextual group activities. The contextual
information is captured from the individuals and the relations between the individ-
uals. They show the importance of contextual information on activity recognition
problems.
Another common application area of context-aware systems is process design.
Process design can be applied to solve various problems in different domains. An
53
example of context-aware process design is applied to business theory in relation
to business process management issues in [118]. There are also various application
examples on health care. In [14], authors presented some application examples and
design principles of context-aware computing for medical work in hospitals, such as
context-aware pill container and context-aware bed which react according to context.
3.3.2 Context Inference
Knowledge in the form of contextual attributes in context-aware systems can be
classified into three categories [2]: 1) Fully observable: the contextual attributes,
their values and the structures are known. 2) Partially observable: some information
is known about contextual attributes. 3) Unobservable: no information is known
about contextual attributes.
In fully observable context attributes, contexts attributes can be used directly to
adapt the system to the context. But in the case of partially observable and unobserv-
able contextual attributes, it is required to be able to infer the contexts. In general,
context inference is based on two main steps: 1) reading sensor values to determine
user context, such as user is moving, not-moving, sound is loud, temperature is high,
etc. 2) using these readings to identify the context. Santos et al. [122], provided a
prototype of a context inference engine for mobile applications. They applied a clas-
sification technique since they know the context classes which are walking, running,
idle, resting. In order to learn the classifications of contexts, they built a decision
tree by using the features obtained from multiple sensor readings such as sound, light,
time, etc. Then a context is determined by searching through decision tree based on
the test features extracted from sensor readings.
Instead of using sensor values to infer context attributes, there are some other
54
works that propose approaches to learn contexts for a problem without an observ-
able contextual attribute. In this kind of context inference, the data needs some
preprocessing in order to generate the context models. This preprocessing would be
a classification or clustering technique.
Tu [135] proposed an auto context algorithm to learn a context model for computer
vision tasks. The algorithm learns a classifier for local image patches from a set of
training images and their label maps. Then the classifications are used as context
information for further tasks.
In [18], authors adapt their object tracking application depending on the changes
in background context such as illumination, coloring, scaling, etc. In their work, the
background contexts have been identified by clustering various training backgrounds
to deal with the changing backgrounds. In the test, the best fitting background
cluster is determined for each frame, then the corresponding object descriptor for the
determined context is selected to track objects.
In our case, we do not have any sensor values or a predefined context, but we desire
to learn the contexts of data from itself. The sequential structure and distributions are
the only information we observe to extract the contexts in our generic framework. This
context learning is a clustering problem since we do not have any prior information
about contexts, such as the number of contexts or the features that can be used
to identify the contexts. In the test, we want to identify contexts from a learned
model which is generated via clustering the similar data sequences to capture the data
context. Details of the proposed context learning approach is explained in Chapter 4.
55
Chapter 4
Context Learning
In this chapter, we describe an automated context learning approach for symbolic
sequential data. We first provide our definition of context for symbolic sequential
data. Next, we review clustering approaches for sequential data by summarizing.
Then we present our methodology to perform automated context learning.
4.1 Context in a Symbolic Sequential Data
Context-aware systems have been studied across various application areas to enhance
the effectiveness of the particular application. In general, a context can be identified
with a context attribute such as time, age, weather, location or a combination of
these attributes. For example, an hourly sales metric for a retail store would display
different characteristics for different “days” of the week and a normal heart rate range
would differ according to “age” or “gender” of the people. While “day” information
could be used to identify unique contexts for sales metrics, “age” or “gender” could
be used as contexts for an expected range of heart rates from a population. Trying
to analyze a dataset without first considering these contexts may lead to confusing
56
behaviors or statistics. Selecting a context attribute depends on the definition of a
context. As presented in Chapter 3.3, the definition of a context may change according
to application field or analysis goal.
In this dissertation, we provide a generic framework for anomaly detection on
symbolic sequential data and context learning is used for filtering during a prepro-
cessing step in order to enhance the accuracy of our framework. The target data we
focus on contains a sequence of symbols generated from a finite symbolic alphabet;
there is no context attribute to classify data samples in a straightforward way. We
defined context as a set of data instances that represent similar characteristics, by
assuming that sequential data samples will be structurally similar if they are contex-
tually related. In our definition, “being in a same context” would indicate that the
data samples are collected from a similar source, monitored under similar conditions,
generated for similar purposes, etc. Since the context attribute is unobservable, it is
necessary to be able to categorize the sequential data samples into the contexts with
an unsupervised learning (clustering) approach. Basically, clustering can be defined
as dividing data samples into groups according to their similarities. We allow each
cluster to represent a context. In the following steps of our anomaly detection frame-
work, identifying contexts in a dataset allows us to model normal behavior for each
context individually.
4.2 Clustering for Context Learning
Clustering is one of the most commonly used partitioning methods in the field of
data mining. A number of surveys and texts have described the richness of different
clustering methods [73, 76, 129, 163, 164]. Clustering is an unsupervised learning
57
technique which aims to discover the natural boundaries between data samples by
minimizing the similarity in the clusters and maximizing dissimilarities between the
clusters.
Clustering algorithms can be broadly divided into two categories [72]: 1) hi-
erarchical and 2) partitioning. Hierarchical clustering algorithms recursively find
nested clusters either using agglomerative (bottom-up) or divisive (top-down) meth-
ods. Hierarchical algorithms differ in the criteria that they use to determine similar-
ity between clusters, such as single-linkage, average linkage, and complete linkage [3].
Partitioning-based clustering algorithms find all the clusters simultaneously according
to a predefined cluster number k. The most popular partitioning based clustering al-
gorithm is k-means because of its ease of implementation, simplicity and efficiency [3].
There are three main categories of approaches for clustering sequential data [149]:
1) Sequence similarity, 2) Indirect sequence clustering, and 3) Statistical sequence
clustering.
1) Sequence similarity: This approach depends on defining a similarity or
dissimilarity measure to compute the distances between data samples in a dataset.
A partitioning or hierarchical clustering method can be applied on the extracted
distances.
2) Indirect sequence clustering: This approach depends on generating feature-
based representations from data samples to use classical vector space-based clustering
algorithms.
3) Statistical sequence clustering: This approach depends on modeling data
samples to capture the dynamics of each data groups.
In this thesis we apply partitioning-based clustering for learning the contexts in
a symbolic (discrete) sequence dataset. Since contextual attribute in a symbolic
58
sequence generally corresponds to ordering of symbols, we need a way to cluster
symbolic sequences that captures sequential dependencies.
Sequential data collection procedures may change based on the monitored data or
the application domain. Data collection involves monitoring a data source for a period
of time. This time period can be fixed to a predefined value or it may be variable
based on the start-end times of an event. For example, during program execution,
tracing the data collection task may last for hours or days after the execution has
started. In such cases, we want to be able to recognize the context of a sequential
stream while it is being collected. Next, we present the details of our approach for
context learning.
4.3 Parameter Selection
One of the strengths of our approach is that it can be easily used in various domains
for the anomaly detection task, where the collected data is comprised of discrete
sequences. As mentioned before, context learning is one of the main steps of our
anomaly detection framework, since employing an accurate context learner can im-
prove detection accuracy. There are two important parameters that need to be found
in order to achieve a desired context learning: 1) the number of clusters, and 2) the
number of symbols. Our goal is to identify these parameters automatically based on
the given sequences.
We determine the clustering parameters in two steps. First, the entire training
data is transformed into a feature space and the number of clusters (k value) is
estimated. Second, the sequential data is used to estimate the minimum required
length (l value) of the sequence partition that we need for clustering. To perform this
59
task, an estimated k value in the previous step is used.
Next, we explain the estimation of these two parameters.
4.3.1 Number of Clusters
One important input parameter to any clustering algorithm is to decide how many
clusters are present in the data. If we do not know the number of clusters present,
we must estimate the number of clusters and then determine the membership of the
data samples to those clusters for a given dataset. There are various methods to
select the best number of clusters. We can evaluate the quality of clusters using such
techniques as cross-validation, information theoretic methods, and silhouettes. One of
our goals in context learning is selecting the best number of contexts, automatically.
To achieve this goal, we considered a number of methods, using efficiency as a key
metric for selecting the best cluster number selection method. To perform this task,
we selected a indirect sequence clustering technique. The steps of indirect sequence
clustering are as follows. First, we extracted vector-based features for each sequence,
where the dimension of each vector is equal to the alphabet size. Second, we applied
the x-means clustering algorithm [109] to estimate the number of clusters.
Feature Extraction
Conventional clustering methods are designed for feature vector clustering. One way
to cluster a sequential dataset is transforming the sequences into vectors of features.
Basically, a sequence can be represented in vector form by selecting the unique sym-
bols in the dataset as features. One of the simplest and most commonly used methods
to generate a feature vector is counting the occurrences of each symbol in a sequence,
which is called tf (term frequency). This class of feature vector is commonly used in
60
text clustering. The term frequency tft,d of term t in document D is defined as the
number of times that t occurs in D, where “term” refers to “symbol” and “document”
refers to “symbolic sequence”. Since we would like to work with a relative tf to ad-
dress varying length sequences, a tf vector can be normalized by its length. Dividing
a vector by its length makes it a unit length vector to prevent a bias toward longer
sequences. A tf vector represents the frequency of a symbol in a sequence, though all
the symbols are considered equally important while computing tf. However, generally
rare symbols in a dataset are more informative than frequent symbols that appear
in every sequence. A more enhanced feature vector would include refined weighting
methods based on the frequency of the symbols in the entire dataset. tf-idf (term
frequency - inverse document frequency) is a common feature extraction method that
weights the symbols by their tf weight and their idf weight [4, 121]. Using tf-idf, idf
normalizes the tf score by reducing the weight of symbols which occur more frequently
in the entire dataset.
We can define the idf of t by
idft = logN/dft (4.1)
where N is the total number of sequences and dft (document frequency) is the
sequence frequency of t which presents the number of sequences the t appears in. df
is a measure of the informativeness of symbol t such that the lower values show its
rareness in the dataset and provide higher weights on t. logN/dft is used instead of
N/dft to “dampen” the effect of idf .
The tth symbol (feature) of a particular sequence D can be set to the tf-idf weight
by:
61
tf-idft,D = tft,D∗idft (4.2)
In tf-idf, while tf is increasing with the number of occurrences within a sequence,
idf is increasing with the rarity of the symbol in the entire dataset.
Clustering by X-means
The K-means clustering algorithm partitions a set of data samples into k clusters
based on their features. This requires a predefined k value, which may not be available
in many cases. Furthermore, obtaining the number of clusters k could be the only
reason for running a clustering analysis in some cases. In data clustering, automatic
estimation of k has been one of the most difficult problems in the field [72]. Most
of the proposed solutions to this problem are based on running k-means repeatedly
with different k settings and then choosing the one that is best according to a quality
criteria, such as MML (Minimum Message Length) as applied in [53] or BIC (Bayesian
Information Criterion) as applied in [55].
In our case, we are working with sequential data samples, and we do not know how
many clusters are present in the dataset. Weighted frequency distributions (tf-idf)
are generated from these sequential data samples and are used as feature vectors for
x-means clustering. The reason that we use the x-means algorithm for clustering is
that there is no need to know the number of clusters in advance.
The x-means algorithm, as proposed by Pelleg and Moore [109], is a k-means
extension which efficiently finds k by optimizing a quality metric. In x-means, kd-
tree is used to identify the closest cluster centers for all the data points. The x-
means algorithm runs the k-means algorithm with k=2 repeatedly to split the cluster
centers into regions. After each run of 2-means, decisions are made for the subset
62
of the current cluster as to whether it should be split or not. The decision between
the subset of each center and itself is done by comparing the quality of the two
structures. If the subset of a center is better than the current center according to a
quality metric, the algorithm replaces the cluster center with its subsets. Originally,
x-means applies BIC as a quality metric, though other scoring criteria could also be
applied. Experiments show that the x-means algorithm finds the natural clusters
accurately and it is faster than repeatedly applying accelerated k-means for different
values of k [109].
4.3.2 Required Length (Number of Symbols)
Our clustering approach depends on context learning from the sequences with the
first l symbols, which allows us to work on incomplete sequences. It is desired to
determine the best number of symbols (l) in two ways: 1) a sequence length l has to
be long enough to detect the similarity/dissimilarity between sequences, 2) a sequence
length l has to be short enough to recognize the context in near real-time.
To perform this task, we applied sequence similarity based clustering techniques.
First, we used two sequence similarity measures to compute distance matrices be-
tween data samples by varying l. Second, we used k-medoids clustering algorithm to
generate the clusterings by using the previously estimated number of clusters. Finally,
we evaluated the goodness of clustering results to select best l value.
Similarity Measures
In this thesis we have experimented with two different measures to compute dis-
tances between sequence partitions: 1) Hamming distance and 2) Longest Common
Substring (LCSub).
63
Hamming distance is a metric based on counting the number of positions at which
the corresponding symbols are different in two sequences of equal length. Basically,
the Hamming distance is a kind of edit distance which measures the minimum number
of substitutions required to change one string into the other. The Hamming distance
between the sequences i and j can be computed by:
dHM(i, j) =l∑
x=1
δ(ix, jx) (4.3)
where l is the length of the sequences and δ(ix, jx) = 0 if ix = jx, δ(ix, jx) = 1 if
ix 6= jx . For example, the Hamming distance between two sequences “ABBABC”,
“ABABAA” is 4 since they do not match in 4 positions of the sequence.
Longest Common Substring is a similarity measure to find the longest string that
is a substring of two or more strings. Note that Longest Common Substring is a dif-
ferent measure than the Longest Common Subsequence, since the latter only considers
consecutive common symbols in both sequences. For example, the Longest Common
Subsequence of the sequences “ABBABC”, “ABABAA” is string “BAB” of length
3 and the Longest Common Substring is string “ABAB” of length 4. We compute
distance based on the Longest Common Substring between two sequences i and j as
follows:
dLCSt(i, j) = l − length(LCSti,j) (4.4)
where l is the length of sequences. Since the Longest Common Substring is a
similarity measure, we extracted dissimilarity by subtracting it from the length (l).
We apply both the Hamming distance and the Longest Common Subsequence
measures to compute a distance matrix. Since we extracted a distance matrix for
each l value in a predefined range, we normalize the distance values to eliminate the
64
impact of the length.
d(i, j) =dHM(i, j) + dLCSt(i, j)
2 ∗ l(4.5)
For example, the distance between sequences “ABBABC” and “ABABAA” is
(4 + 3)/12 = 0.58.
Clustering by K-medoids
Originally, the k-means algorithm was designed for the clustering of data vectors
which can be represented in an Euclidean space. Each cluster is centered about a
center point (centroid) which is the mean of the coordinates of the data points in the
cluster. There are two main requirements of k-means, which makes it inappropriate to
cluster sequential data: 1) the k-means algorithm requires that the distances between
the data points and the cluster centroids be calculated at each iteration. 2) the k-
means algorithm requires vector data, since calculating the distances from centroids
involves vector operations.
In our case, we have sequential data points, where the distance values between
datapoints are not derived from an Euclidean space. K-medoids is a variant of the
k-means algorithm where the center points (clustroids) of each cluster are chosen
from the data points, instead of computing the means. Clustroids are the data points
selected which minimize the total distance (4.6) in a cluster.
∑j∈Ci
d(i, j) (4.6)
where Ci is the cluster containing data point i, and d(i, j) is the distance between
i and j.
65
There are two main advantages of k-medoids, which makes it quite appropriate for
clustering sequential data: First, it does not require a repeated distance calculation
at each iteration since the medoids are the actual data points. Second, it doesn’t
require a vector data to compute dissimilarities since a distance matrix computed
by considering the data points . The steps of the k-medoids algorithm [116] are as
follows:
1. Select k samples at random to be the initial cluster medoids.
2. Assign each sample in the dataset to the cluster of the closest medoids.
3. Update the set of medoids by considering Eq 4.6 .
4. Repeat steps 2 and 3 until the medoids become fixed.
We used k-medoids clustering to train a context classifier which is used to classify
new data samples based on the cluster centers.
Cluster Quality
There are two basic measurements that can be used to evaluate quality of a clustering
result: 1) compactness (tightness), which is a measure of the quality of each cluster,
and 2) separation, which is a measure of the quality of the distance between clus-
ters. Silhouette is a graphical representation proposed for partition-based clustering
techniques. A silhouette captures both the clustering tightness and separation [119].
The silhouette value can be used to determine which data samples lie well within
their cluster and which do not [76]. The average silhouette value provides a metric
for judging clustering validity and quality.
The computation of a silhouette value for a data sample depends on two measures:
1) a(i): the average dissimilarity of i to all other data samples within the same cluster,
66
and 2) b(i): the smallest average dissimilarity of i to all data samples in other clusters.
While a(i) is captures how well sample i fits in the assigned cluster (the smaller the
a(i), the better the assignment), b(i) shows how well the sample i is different from
the other clusters. The cluster with the smallest average dissimilarity is called the
neighbor cluster, which is the second-best choice for sample i.
For each data sample i, the silhouette value can evaluated as follows:
silhouette(i) =
1− a(i)/b(i) if b(i) < a(i),
0 if a(i) = b(i),
b(i)/a(i)− 1 if b(i) > a(i).
(4.7)
A silhouette(i) is a value between [-1,1]. A value close to “1” shows that the
data sample is properly clustered. A value close to “-1” shows that the data sample
is poorly clustered and it may be that the sample should be moved to the neighbor
cluster. A value close to “0” shows that the data sample is on the border of the current
and the neighbor cluster, and so it is unclear which cluster the sample belongs to.
The general formulation of computing a silhouette(i) is given by:
silhouette(i) =b(i)− a(i)
max{a(i), b(i)}(4.8)
In order to compute a silhouette value to evaluate a clustering result, we only
need the distances between the data samples and the clustering results of these data
samples [119].
silhouette =1
n
n∑i=1
silhouette(i) (4.9)
The overall average silhouette is the mean of silhouette values for all data samples
in a data set (as shown in 4.9), where n is the number of data samples in the dataset.
67
The silhouette measure can be used to determine the number of clusters by selecting
the cluster number which maximizes the overall average silhouette value [76].
We have used silhouettes to estimate the required length l of a sequence partition
for clustering. Here, l represents the length of a partition in a sequence, starting from
the first symbol in the time sequence. We compute the overall average silhouette for
each clustering result, which are obtained by varying l values.
4.4 Summary of Context Learning
There are two main goals of the context learning process. First, we need to determine
the contexts in a training dataset to be able train a model for each context during
the training phase. Second, we need to recognize the contexts in a test dataset to
be able to evaluate each sequence with a corresponding model produced from the
testing phase. We also want to find the context of a test sequence without requiring
the entire sequence, since we need near real-time evaluation capabilities even for very
long sequences. In order to meet this constraint, we estimate the minimum required
length (l), which is basically the number of symbols that we need to recognize the
context of a sequence.
In the context learning process, we assume that clustering feature vectors helps
to find natural classes in the dataset. We refer to these classes as contexts. In order
to extract feature vectors we apply our tf-idf method which weights the symbols
according to their importance in a given sequence and also across the entire sequence
dataset. Then, we run the x-means clustering algorithm to estimate the number of
natural clusters in a dataset.
In the context learning task, the minimum required length (l) is determined only
68
for test purposes, so we can recognize a context quickly. In order to estimate the
shortest length (l), starting from the first symbol in a sequence, we implement the
following steps. First we compute the distance matrices between sequences by varying
the length l. We use two string similarity/dissimilarity metrics, namely hamming
distance and longest common substring, to find the distances. Then we apply the
k-medoids clustering algorithm on these distance matrices. Finally, the estimation of
l is performed by comparing the quality of obtained clusters using silhouettes.
The context learning presented in this chapter enhanced our framework in two
ways: 1) we apply a context-aware behavioral learning approach during training, 2)
we apply a context-aware anomaly detection approach during testing.
69
Chapter 5
System Call Anomaly Detection
System calls provide an interface between an application and the operating system’s
kernel. Since a program frequently requests services via system calls, a trace of these
system calls provides a rich profile a program behavior. In this section, we present
our approach to the system call anomaly detection problem step by step. We describe
the details of our system call anomaly-based IDS. The IDS consists of two phases,
each with a number of steps:
1. Training, whose steps include: a.) Preprocessing to differentiate the various
contexts in the training dataset and to generate features. b.) Model learning to
build a model of normal behavior for each context by training a HMM for each
cluster.
2. Testing, whose steps include: a.) Preprocessing to classify the test sequence into
one of the previously learned contexts and to generate features. b.) Computing
Anomaly score to identify anomalous behavior as deviations from the model of
normal behavior. c.) Anomaly detection to provide an alarm for the abnormal
behavior that deviates from the norm as a possible threat after filtering.
70
Figure 5.1: Design of our system call anomaly detection framework.
An overview of our system call anomaly detection framework design is presented
in Figure 5.1. Implementation of this framework is described in three sections: First,
structure of experimentation dataset is detailed. Second, preprocessing and feature
extraction from this dataset is discussed. Third, behavior learning is implemented by
training HMMs and finally experimental results are presented.
71
5.1 System Call Trace Dataset
We evaluate our proposed approach on a well-known system call dataset provided by
the University of New Mexico (UNM) [136]. In the UNM benchmark dataset, each
program dataset includes several system call traces which are generated by tracing
various normal and intruded runs of a program. In this thesis, we use the sendmail
traces from the UNM dataset in our experiments. Each program trace includes system
calls associated with the corresponding process IDs, as shown in Table 5.1.
Table 5.1: A sample of the UNM program trace.
Process ID System Call552 19552 105551 5552 104552 104551 5552 106552 105
A trace of a program execution typically includes multiple processes, as it is seen
in Table 5.1. In this thesis, system calls are grouped together according to their PIDs
to create an ordered system call list for each process, as presented in Table 5.2.
Table 5.2: A sample of extracted process traces.
PID:551 PID:5525 195 105
104104106105
72
We applied this PID partitioning on each trace provided in the UNM sendmail
dataset. A system call sequence for a process in UNM sendmail dataset is shown in
Table 5.3. In this table, each number represents an index to the system calls matching
in the provided mapping file. For example, the number 1 represents system call fork,
the number 5 represents system call close, etc.. In this work, these process traces are
used for training and testing.
Table 5.3: System call sequence for PID:552.
19 105 104 104 106 105 104 104 106 105 104 104 106 54 4 5 5 40 40 4 50 5 38 1 105 104 104 106 112 19 19 105104 104 6 6 106 78 112 105 104 104 106 78 93 101 101100 102 105 104 104 106 93 88 112 19 128 95 1 5 95 6 695 5 5 5 5 5
5.2 Preprocessing
To train a training model and identify the anomalous traces accurately, we have
found that effective preprocessing of system call traces is key. Our preprocessing
approach for the system call datasets includes four steps: 1) data partitioning, 2) data
reduction, 3) context learning and 4) feature extraction.
5.2.1 Partitioning
In Section 4.1 we presented the UNM sendmail system call trace dataset and discussed
how we partitioned these program traces into process traces. After this partitioning,
there are 346 processes in the normal execution and 25 processes in the abnormal
execution. The names of the program execution traces and the number of processes
73
are shown in Table 5.4. In this table, the # of processes column in the intrusion data
shows the number of abnormal processes versus the total processes in the correspond-
ing trace. For example, 1 of 6 means that there are 6 processes in the corresponding
execution and only 1 of the 6 process traces includes anomalous execution. Although
there are 25 process traces in the intrusion dataset, actually there are 13 abnormal
processes. Since abnormal program execution typically includes normal processes, in
our experiments we defined a program as abnormal if it includes at least one intruded
process.
Table 5.4: The UNM sendmail trace dataset.
Normal IntrusionFile # of File # ofName processes Name processesbounce 4 sm-280 1 of 6bounce 1 3 sm-314 1 of 6bounce 2 7 fwd-loops-1 2plus 26 fwd-loops-2 1log 147 fwd-loops-3 2queue 12 fwd-loops-4 2daemon 147 fwd-loops-5 1 of 3
sm-10763 1sm-10801 1sm-10814 1
7 normalprogramtraces
346 normalprocesstraces
10 intrudedprogramtraces
13 intrudedprocesstraces
5.2.2 Reduction
In our second step of preprocessing, we remove identical process traces from the
normal dataset to avoid the possibility of using the same sequences in training and
testing. Identical process traces are identified by comparing each process trace with
74
the other process traces. We store only one copy of a repeated process in our dataset.
After this filtering is completed, we have 68 unique process traces in the normal
dataset and 10 unique process traces in the abnormal dataset. We then use this subset
of the traces in training and testing. Although applying our reduction pass over the
abnormal process traces is not necessary, we want to have a consistent reduction
method for all traces (since we will not know if a trace is anomalous until later in this
process).
5.2.3 Clustering for Context Learning
Since a program trace includes execution that cover a number of normal behaviors, we
analyzed unique process traces to detect any structural similarities between processes.
These similarities help us to define a context-aware learning for each cluster. When we
examine all of the unique processes, we cluster the UNM sendmail process traces into
3 sets based on similarities. This analysis led us to train 3 HMMs, one for each cluster.
The concept behind using multiple-HMMs is based on expecting better learning to be
done using HMMs when we have similar training sets. Clustering results of normal
and abnormal processes on a PID basis is shown in Table 5.5. In the rest of chapter,
training, validation and test data are produced by considering this clustering process.
5.2.4 Feature Extraction
One of the most important tasks to perform during behavior learning is extracting
useful and efficient features to profile normal behavior. This step plays an important
role that impacts the success of the training phase. Before we get into the details of
feature extraction, we define the structure of the system call data set;
Definition 5.1: A dataset (D) is a set of system call sequences (S) and it is
75
Table 5.5: Clustering results of UNM sendmail processes
Normal AbnormalSet 1PIDs
Set 2PIDs
Set 3PIDs
Set 1PIDs
Set 2PIDs
Set 3PIDs
551 552 553 170 283 1631407 1402 554 162 317 18312376 1551 1403 182 119 20712387 8844 1409 20612398 1411 1076512409 1414 1080312420 1574 1081612431 157712272 157812827 158212838 158312849 1237812883 1240012900 1241112908 124228840 12433
128071282912840128511290212910
represented as;
D = {S1, S2, S3, ..., Sm} (5.1)
where m is the number of sequences (processes), i=1, 2, ..., m; each Si is the
sequence of system calls. In this example, D represents all unique process traces
extracted from sendmail dataset.
Definition 5.2: A system call sequence (S) for a process shown in Table 5.3 can
be represented as:
76
S = {s1, s2, s3..., sn} (5.2)
where n is the number of system calls in S and i=1,2,...n; si ∈ S is a system call.
In this thesis, we generate fixed-length subsequences (features) by sliding a window
across the process traces and recording each unique subsequence. If we slide a window
of size k on a sequence S given at Equation 5.2 , then we generate the following
subsequences:
{s1, s2, s3..., sk}
{s2, s3, s4..., sk+1}
{s3, s4, s5, ..., sk+2}
...
...
{sn−(k−1), sn−(k−2), ..., sn}
If each of the generated subsequences are unique, there are at most n − (k − 1)
subsequences as a result. One of the most important parameters of using fixed-length
pattern extraction is deciding the appropriate window size k. Researchers have used
different window sizes in previous work to find the optimal k. Many of studies found
that the minimum required window size is “6” for system call anomaly detection on
UNM dataset [146, 50, 87, 130]. Eskin et al. [50] proposed that the optimal window
size is different for each program in the UNM dataset. They compute the conditional
entropy of each program trace for different window sizes and claimed that measuring
the regularity of data can be used to pick the window size. Lee and Xiang [87] also
proposed a similar approach that used conditional relative entropy to estimate the
window size. An alternative approach suggests that conditional entropy may not be
a universal selection metric and supported this position by measuring the regularity
77
of completely random data [130].
As discussed earlier, an intrusion trace may include normal process traces in ad-
dition to the abnormal trace(s). In the UNM dataset, one of the abnormal traces
of the sendmail program (sm-280.int), which performs a decode attack, includes six
different processes. Only one of them (PID:283) is considered as abnormal, since the
other process traces also appear in the normal process trace set. A detailed analysis
of the sendmail program traces shows that the suspect process trace PID:283 in the
sm-230.int file is not detectable if a window of shorter than 6 is used [131]. Therefore,
using a k value smaller than 6 on process PID:283 produces subsequences present in
the normal process traces. In other words, a narrow window produces information
loss during the subsequence production process.
In the system call analysis described in this thesis, features (subsequences) are
extracted by using a sliding window length 6 using a step increment of 1. To generate
the normal and the test data, we extract unique subsequences for each process trace
in the program traces. In our experiments, normal data refers subsequences gener-
ated from all normal process traces, and abnormal data refers to the subsequences
generated from abnormal process traces. For each cluster, normal data is randomly
partitioned into two subsequence sets: 50% for the training set and 50% for the val-
idation/test set. Abnormal data is also added to the test sets for each cluster by
considering the process trace where it is generated. For the evaluation of the trained
model on each cluster, extracted subsequences are considered individually. To per-
form anomaly detection on a process trace, the subsequences that are extracted from
the corresponding process trace are used during the evaluation.
78
5.3 Behavior Learning -Training
In this experiment, half of the subsequences extracted from the normal traces are
used in training. We applied 10-fold cross validation to train and evaluate the model,
therefore, only 45% of the normal dataset is used in the training phase for each fold.
In the training phase, we defined the dimensions and the observation sequences
of the HMM to estimate the model parameters λ = {A,B, π}. The number of ob-
servation symbols M is equal to the number of unique system calls in the dataset.
In the sendmail dataset, there are 53 unique system calls in both normal and ab-
normal traces. Subsequences generated via a sliding window are used as observation
sequences in the HMM model. We set the window length T of the observation se-
quence equal to the sliding window length k. While choosing the observation sequence
length, we considered the benchmark dataset which needs at least k ≥ 6, as discussed
in Section 5.2.4. We limit the length to 6 since learning with fewer parameters results
in faster training.
When training a model with a HMM, it is possible to increase the likelihood value
by increasing the number of parameters in the model. But increasing the number
of parameters can lead to overfitting. To address this issue, we applied Bayesian
Information Criterion (BIC) [124] to select the number of hidden states in our model.
BIC introduces a penalty term for the number of parameters, while computing a
criterion score based on the maximized likelihood. Equation 5.3 provides the details
on how we compute BIC:
BIC = −2 ln(L) + p ln(n) (5.3)
where L is the maximum likelihood, p is the number of free parameters and n is
the number of data points. We experimented by varying the number of hidden states
79
N from 10 to 100. Normalized BIC results for this range of hidden states are shown
in Figure 5.2. Since we prefer to use the simplest model that best fits the data, lower
BIC scores identify the candidate numbers of the hidden states to be chosen for the
model. The lowest BIC values for each fold are found to vary between 40 and 60
hidden states. By considering this range of BIC values, we selected N=53, which is
also equal to the number of unique system calls in UNM sendmail traces.
Figure 5.2: BIC values for various number of Hidden States
5.4 Test and Evaluation
In testing phase, we followed our clustering approach and subsequence extraction
method on process traces. Each set is tested with the corresponding normal test data
extracted from normal process traces and the abnormal data extracted from abnormal
process traces. Results of the testing process are shown using ROC analysis: Figure
5.3 is for the subsequences of 1.Set, Figure 5.4 is for the subsequences of 2.Set and
80
Figure 5.5 is for the subsequences of 3.Set.
Figure 5.3: ROC curve for Set 1
We present the validation test results reporting Area Under the Curve (AUC)
values for each fold of the trained model in our ROC analysis. Our Validation set
includes 50% of all of the normal subsequences and all abnormal subsequences for the
corresponding set of process traces. Results show high detection rates with very low
FPRs for the short sequences.
In Figure 5.6, we have presented the ROC results when using only one HMM for
training, and without considering the clusters. Although the test results are able to
detect most of the abnormal subsequences, we achieved better detection rates via
clustering and training a HMM for each cluster.
81
Figure 5.4: ROC curve for Set 2
5.5 Anomaly Detection
Test results on subsequences of the process traces show that our clustering and train-
ing multiple HMMs approach is successful on detecting abnormal subsequences with
high AUC values for each set. This also shows that the LL values can be used to
predict whether a subsequence is abnormal or normal. But instead of evaluating sub-
sequences individually, an IDS needs to analyze the process trace to decide whether
it is normal or abnormal. To assign an anomaly score for a process trace, we applied
Exponentially-Weighted Moving Average (EWMA) as a filter on the log-likelihood
values of the subsequences within each process. Since EWMA has first introduced by
Robert [117], it has been widely used for detecting shifts in the mean of a sequence
of discrete values. EWMA applies weights on discrete decision values in an exponen-
tially decreasing order to smooth out fluctuations; the most recent values are weighted
82
Figure 5.5: ROC curve for Set 3
most highly. In our case, decision values are the log-likelihood values of subsequences
in process traces. If EWMA values reach a threshold at some point of evaluation,
our system generates an alarm for the considered process trace. To compute EWMA
values for each subsequence in a process trace, we used (5.4):
EWMAt = αYt + (1− α) EWMAt−1 (5.4)
where Yt is the decision value (LL value) at time t, α is the weight which determines
the depth of memory. EWMAt is the value of the EWMA at time period t. α is a
constant smoothing factor between 0 and 1 and can be computed according to a
desired filter width W, as: α = 2W+1
. If α is close to 1, it gives more importance to
recent decision values and discounts older values faster.
Since the EWMA filter generates an anomaly score at each point of the process
83
Figure 5.6: ROC curve for one HMM
trace, there is no need to wait for us to evaluate the entire process sequence to detect
abnormal behavior. Figure 5.7 presents EWMA values on a normal and an abnormal
process trace. While there is some noise in the normal process, the EWMA filter
smooths the noise. Although the abnormal process behaves normally at the beginning
of the trace, it exhibits abnormal behavior after some point. In this instance, EWMA
values are high enough to detect abnormal behavior when an anomaly starts.
In the UNM sendmail dataset, there are 78 distinct process traces, and 10 of those
are considered as abnormal according to the ground truth generated by applying the
stide mechanism described earlier [66]. Figure 5.8 presents the ROC curve which is
found by varying the threshold value on the maximum EWMA filter output of each
distinct process trace.
84
Figure 5.7: Time-series plot for process traces.
5.6 Summary of System Call Trace Analysis
Sequential behavior modeling and prediction is only possible if the data includes
some probabilistic structure. HMMs are one of the most effective dynamic behavior
modeling approach to learn temporal relationships between system calls for anomaly
detection based intrusion detection.
Although there has been significant prior work using HMM on learning program
behavior for anomaly detection, in this application we present a new process for pre-
processing traces, leading to better results for anomaly detection. To differentiate
between various behaviors (contexts), we applied similarity-based clustering on sys-
tem call sequences in benchmark dataset. This approach captures similar behavior
across processes, starting with similar sequential structures. Behavioral clustering
85
Figure 5.8: Process-based evaluation results.
approaches can be improved by applying a more sophisticated clustering algorithm
using a time window.
In this illustrative example, we provide an on-line anomaly detection framework by
using a dynamic anomaly score which is computed for each time step using an EWMA
filter. EWMA provides smoothing of the decision value versus using computed log-
likelihood values of short sequences. Applying the EWMA filter to log-likelihood
values has two advantages in our anomaly detection framework. First, it provides
lower FPRs by smoothing the decision values, and second, our detection model can
run online detection by detecting abnormal behavior at the point of penetration.
Although HMMs are fast enough for on-line anomaly detection during testing,
one of the main drawbacks of using HMMs is their significant computation time
while training. The training dataset size and the observation sequence length both
86
have an huge impact on training time. To address this issue, we used only use 45% of
the normal data for training and clustered those data into three sets to train a HMM
for each set of process traces. We also selected the shortest possible window length
that catches all abnormal subsequences in UNM sendmail traces while extracting
observation sequences.
In this implementation, we have presented an illustrative example to describe
how to apply our proposed approach on system call traces for cyber security. We
considered the details of the UNM system call dataset by considering various normal
behaviors in these programs. Test and detection results show the proposed approach
provides faster and accurate anomaly detection by context-aware behavior learning.
87
Chapter 6
Crowd Anomaly Detection
Prior work in crowd analytics has generated a large number of research studies that
consider a wide range of analysis approaches. In this chapter, we describe our ap-
proach for anomaly detection in a crowd scene to demonstrate the effectiveness and
adaptability of our framework. In our implementation, we show how to generalize
a crowd analytics problem into a behavioral learning problem, while working with
sequential data.
Anomaly detection is one of the main problems in the crowd analysis domain
where a great deal of prior work has been presented. There are two main categories
of behavior learning: 1) object-based (detection-based), and 2) holistic-based. In this
thesis, we explore detection-based methods to extract features, since they provide
more accurate information about the targets in a scene, especially when the crowd
density is low. We have applied our sequence anomaly detection approach to identify
collective behaviors of pedestrians in a crowd.
An overview of our crowd anomaly detection framework design is presented in
Figure 6.1. Implementation of this framework is described next. First, the structure
88
Figure 6.1: Design of our crowd anomaly detection framework.
of experimentation video dataset is detailed. Second, the preprocessing and feature
extraction process is discussed. Third, a behavioral learning method is implemented
to train the HMMs. Finally, experimental results are presented.
6.1 Event Recognition Video Dataset
In video analysis, event recognition is defined as the process of monitoring and ana-
lyzing the events that occur in a video surveillance system in order to detect signs
89
of security problems [71]. It is important to mention that our research is not trying
to detect abnormal objects (e.g., abnormal object size, speed or direction) [95, 15].
Instead, we aim to define behavioral properties of people and recognize the events in
the video to detect abnormal events, such as a sudden dispersion, running or merging
of a large number of people [165, 160].
We presented a background study on crowd analysis in Section 2.2 and discussed
related work on behavioral learning for anomaly detection in a crowd scene in Sec-
tion 3.2. There is a large amount of research for feature extraction the area of com-
puter vision. Since our goal is not trying to develop new feature extraction methods,
we have selected well-known techniques to extract features from a video dataset.
To drive evaluation of our implementation, we have identified benchmark datasets
available for this purpose. The characteristics of our targeted video is presented
in Table 2.2 (indicated in italics). The benchmark dataset used in our experi-
ments is PETS09 (Performance Evaluation of Tracking and Surveillance 2009) crowd
dataset [52, 111] which was recorded for the workshop at Whiteknights Campus,
University of Reading, UK. PETS09 is a publicly available benchmark video dataset
which is generated to address three different crowd analysis problems: 1) Dataset S1:
person count and density estimation, 2) Dataset S2: people tracking, 3) Dataset S3:
flow analysis and event recognition [49, 150]. The event recognition dataset (Dataset-
S3: High Level) contains four video records with timestamps 14-16, 14-27, 14-31 and
14-33. These video records contain one or more of the following set of events [52]:
• Walking: Moving at a “typical” walking pace.
• Running: Moving at a “typical” running pace.
• Evacuation: Rapid dispersion, multiple divergent flows.
90
• Local Dispersion: Multiple, localized, divergent flows.
• Crowd Formation-Gathering/Merging: Convergence of multiple flows.
• Crowd Dispersal-Splitting: Multiple divergent flows.
Two example frames from event recognition videos are shown in Figure 6.2. There
is a running event in both samples. The video specifications are given as; number of
frames (Nf ): 1076, resolution (R): 768x576, frame per second (FPS): 7.
Figure 6.2: Dataset-S3 Events, frames 50 and 150 (left-to-right) [52].
In this illustrative example, we use view-1 video records as our single camera view.
If there are additional video sequences in a given video record, we name them A, B
or C appended to the corresponding video record name. The sequence names, time
intervals and their lengths are provided in Table 6.1.
Since the ground truth of events is not provided for the event recognition dataset,
we composed it manually. In order to ensure accuracy of our ground truth, we also
compared it with the results reported in prior papers that utilized the dataset [57, 26,
133]. In Table 6.2, we present the ground truth for the events in these video sequences,
identifying time intervals using frame numbers. In this table, columns represent the
video sequences and rows represent the events. We also defined an additional event,
91
which we label as loitering, since there are sequence of frames which do not include
running nor walking events.
Table 6.1: Video sequences in the PETS09 Dataset-S3:High Level.
Sequences Time Intervals Length14-16 A 14-16 [0 107] 10814-16 B 14-16 [108 198] 9114-16 C 14-16 [199 222] 2414-27 A 14-27 [0 184] 18514-27 B 14-27 [185 333] 14914-31 14-31 [0 130] 13114-33 A 14-33 [0 310] 31114-33 B 14-33 [311 377] 68
Table 6.2: Time intervals of the events in the PETS09 Dataset-S3:High Level.
EVENTS 14-16 A 14-16 B 14-16 C 14-27 A 14-27 B 14-31 14-33 A 14-33 BLoitering [-] [-] [-] [0-94] [131-184] [185-271] [301-333] [-] [181-310] [311-337]Walking [0-39] [108-174] [-] [95-130] [271-300] [0-130] [0-180] [-]Running [40-107] [175-198] [199-222] [-] [-] [-] [-] [338-377]Evacuation [-] [-] [-] [-] [-] [-] [-] [338-377]Dispersion [-] [-] [-] [95-130] [271-300] [-] [-] [-]Merging [-] [-] [-] [-] [-] [-] [0-180] [-]Splitting [-] [-] [-] [-] [-] [60-130] [-] [-]
We detect and track people in the PETS09 Dataset-S3:High Level benchmark
videos in order to extract multi-object features for each frame. Next, we explain the
feature extraction processes in more detail.
6.2 Preprocessing
We performed a set of data preprocessing steps on each video dataset to prepare it for
training and testing tasks in our crowd anomaly detection framework. These steps
92
include: a.) feature extraction from video, b.) context learning and c.) windowing.
Next, we discuss each of these preprocessing steps in detail.
6.2.1 Feature Extraction from a Video
Feature extraction refers to a set of techniques that are applied to extract symbolic
discrete sequences from a video (frame sequence). These techniques include both
computer vision algorithms used to extract features, and machine learning algorithms
to classify the data in the form of a symbolic sequence.
In Figure 6.3, we present a generic view of a video sequence. Video, which is
a sequence of image frames, is a rich source of information. In order to extract
information from a frame and use it to drive our experiments, we need to apply
computer vision techniques. In our framework, we extract multi-object based features
for each frame in a video stream by tracking individuals in a crowd.
Object tracking is one of the most heavily researched areas in computer vision.
Although tracking is a research subject by itself, it also used in a number of other
areas of research including object recognition, motion recognition, traffic monitoring,
and others. In our work we adopt a tracking by detection methodology [154] to track
multiple people in a video sequence. Then, we extract histogram features for each
frame and prepare these features to use in training. In our framework, we break fea-
ture extraction into four steps: 1) detection, 2) tracking, 3) projective transformation
and 4) feature histogram generation.
Object Detection
In order to detect an object, the most desirable property of a visual feature is its
uniqueness, especially if we want to distinguish objects using a feature space. For
93
Figure 6.3: Feature extraction from a video sequence.
this purpose we compute motion, size and direction features of the objects in every
frame.
We utilize blob detection by first computing a background subtraction. In our
framework, the first step is to discriminate between moving foreground objects and
the background. The background can be determined by computing the distribution
of the typical values each pixel takes, and the foreground can be detected by iden-
tifying pixels values that are sufficiently different from the background distribution.
Large connected foreground regions form blobs and can be considered as detected
objects [154].
94
Object Tracking
Tracking can be defined as an assignment process to connect and label a sequence
of objects across multiple image frames in a video. We use position, velocity and
size features of the objects in the video frame, and record object state information.
To perform tracking, we predict the next state of an object by using its current and
previous states via Kalman filtering [128, 37].
Projective transformation - Homography
The tracking information we extract is not a top-down view, since we are using a
single fixed camera view and the camera is seeing the surveillance area from an angle.
Using this data directly could be problematic when computing of distance and speed
features. In order to accurately identify an objects positions, we applied a projective
transformation technique on the tracked points [106]. A projective transformation,
which is also called homography, is a geometric transformation between two image
planes. Figure 6.4 presents the top-down view of the area and the camera position.
Figure 6.4: Top-down view of the surveillance area and camera position.
95
For each tracked points in each frame, we have applied our projective transfor-
mation and used the resulting transformed tracking information to extract feature
histograms. A reference image frame from the PETS09 event recognition dataset is
shown in Figure 6.5 (a) and its transformation is shown in Figure 6.5 (b).
(a) Original image frame. (b) Transformed image frame.
Figure 6.5: Projective transformation (Homography) result of an original image framefrom PETS09 event recognition dataset.
Histogram Extraction
Although we can detect and track individuals, our aim is to analyze collective be-
haviors of people from sequence of frames. In this application, we extracted three
types of features, which are distance, velocity and direction, for each frame, based on
the extracted multi-object features. We generate three feature histograms from the
extracted features:
• Distance-based features: The position features generated by object tracking are
considered as nodes in a graph for each frame, and the distance between each
96
pair is computed to extract distance-based histograms for each frame. The
histogram is normalized by the number of distance values.
• Velocity-based features: For each object, the positional distance change is com-
puted using the current and the previous position in order to extract velocity-
based histograms, computed in terms of pixels. The histogram is divided into
5 equal bins (1) very slow, 2) slow, 3) normal, 4) fast, and 5) very fast), cap-
turing the minimum and the maximum speed of people in the video dataset.
Then, the histogram is normalized by the number of people.
• Direction-based features: While computing the velocity of each object, the di-
rection of the position change is also extracted to generate direction-based his-
tograms for each frame. Direction of an object is determined to be in one of 8
bins in the range [0, 2π). A histogram is computed for each frame that includes
the relative number of people in each of the 8 bins.
The reason of extracting multiple types of features is so we have comparable event
types for a range of applications. For example, while running or walking events can
be learned via velocity-based features, splitting, local dispersion and merging events
can only be learned if we include distance and/or direction-based features. Since the
defined events are not correlated to the number of people in the scene, we generated
normalized feature histograms to avoid any bias depending on the number of people
in a scene.
Symbolic Representation of Histograms
Our work aims to analyze the behavior of people using a sequence of frames. These
behavior need to be distilled to a set of symbols. By generating feature histograms we
97
can convert the video data to a sequence of feature histograms. Instead of analyzing
feature histograms directly, we need to cluster these histograms to create a common
representation (symbol) for each similar frame. By representing each frame with a
symbol from a finite set of symbols, video analysis for the event recognition problem
becomes a problem of event recognition in a symbolic/discrete sequence. In order
to create such symbols, we used the x-means clustering algorithm using euclidean
distances.
6.2.2 Clustering for Context Learning
The PETS09 dataset was not originally designed to be used for testing anomaly
detection algorithms, but instead designed to test recognition of events. In order to
detect abnormous activities, first we need to define the anomalies that we are looking
for. We defined an abnormal event as a behavioral change according to a learned
behavior, such as sudden changes in velocity or direction of people. The problem we
address is detecting these kind of anomalies from a sequence of symbols.
As presented in Table 6.1, there are only 8 sequences (video records) in the entire
dataset. We can not use these sequences directly for training or testing for two reasons:
1) Number of sequences (8) is not enough to work with. 2) Training dataset is need
to include only normal sequences, and these 8 sequences are including abnormal time
intervals (event changes). Therefore, before starting context learning, we first need
to identify training and test sequences.
In previous work, researchers looked for various kinds of anomalies and applied
different approaches while selecting training and test sequences. For example, anoma-
lies can be detected in each sequence by training with a normal partition of the same
98
sequence [140]. The authors define normal partitions by manually selecting time in-
tervals that contain only normal events using their definition of an anomaly. We
follow a similar methodology, but instead of training and testing each sequence indi-
vidually, we partition the sequences to generate our training dataset. Our partitions
include multiple normal and abnormal behaviors. Partitioning is performed by di-
viding the sequences into non-overlapping smaller video sequences. In Table 6.3, we
present these smaller video partitions by assigning a sequence ID, and any partition
that includes an event change is labeled as abnormal. Only the partitions void of any
behavioral changes are considered as normal. Following this partitioning approach,
we obtain 33 sequences which include 24 normal and 9 abnormal sequences.
In our video analytics application, although the contexts can be identified manu-
ally by considering predefined events, we followed our context learning approach by
assuming we are given a set of symbolic sequences which we know are normal. As
discussed earlier, predefined events in the PETS09 dataset can be detected through
the type of features we have extracted. Since we are looking for changes in behavior to
detect them as anomalous, we want to first learn normal behaviors. Next we describe
our context learning process for each feature type.
• Velocity-based context learning: The symbolic sequences generated through the
velocity-based feature histograms can be used to learn normal behaviors for
loitering, walking and running events. While partitioning the velocity-based
symbolic sequences to generate a training dataset, we restrict our attention to
normal sequences for each of these behaviors. It is important to note that all of
these behaviors by themselves are normal. For example, while individuals that
are running do not constitute an abnormal event, this sequence is abnormal if
people suddenly start running after loitering or walking events.
99
Table 6.3: Non-overlapping sequence partitions generated from the PETS09 Dataset-S3: High Level.
Seq. ID Sequence Partitions LabelS1 14-16 A [0-30] normalS2 14-16 A [31-60] abnormalS3 14-16 A [61-90] normalS4 14-16 A [91-107] normalS5 14-16 B [108-140] normalS6 14-16 B [141-180] abnormalS7 14-16 B [181-198] normalS8 14-16 C [199-222] normalS9 14-27 A [0-40] normalS10 14-27 A [41-70] normalS11 14-27 A [71-110] abnormalS12 14-27 A [111-140] abnormalS13 14-27 A [141-184] normalS14 14-27 B [185-210] normalS15 14-27 B [211-250] normalS16 14-27 B [251-280] abnormalS17 14-27 B [281-310] abnormalS18 14-27 B [211-333] normalS19 14-31 [0-25] normalS20 14-31 [26 50] normalS21 14-31 [51-90] abnormalS22 14-31 [91-110] normalS23 14-31 [111-130] normalS24 14-33 A [0-50] normalS25 14-33 A [51-90] normalS26 14-33 A [91-130] normalS27 14-33 A [131-170] normalS28 14-33 A [171-200] abnormalS29 14-33 A [201-250] normalS30 14-33 A [251-310] normalS31 14-33 B [311-330] normalS32 14-33 B [331-350] abnormalS33 14-33 B [351-377] normal
100
• Distance-based context learning: The symbolic sequences generated through the
distance-based feature histograms can be used to learn normal behavior for
evacuation, dispersion, merging and splitting events. For example, during an
evacuation event, if all of the objects start by being close to eachother, suddenly
the distances between objects become larger. While large distances between the
objects are not abnormal by themselves, the sequence becomes abnormal when
the distances increase suddenly.
• Direction-based context learning: The symbolic sequences generated through
the direction-based feature histograms can be used to learn normal behavior for
evacuation, dispersion, merging and splitting events. For example, in the case
of a splitting event, while the objects are moving in the same direction, they
will start diverge in different directions. Although moving through different
directions is not deemed abnormal, a sequence will become abnormal after we
observe another behavior.
When analyze the datasets presented in Tables 6.2 and 6.3, we see that they
contain loitering, walking and running events. We also see that there are transition
in the sequence of events. In other words, people need to be in one these three
states, which we refer to as velocity-based events. Therefore, we can first apply
context learning process on velocity-based symbolic sequences that we have generated.
Details of our context learning approach are provided in Chapter 4. In order to use
these sequences during the training task, we randomly select the half of the normal
sequences from Table 6.4. In this table, we only include normal sequences where each
partition contains a single type of event (i.e., walking, running or loitering) over all
frames.
In this example, since the event types and the ground truth are already defined,
101
Table 6.4: Normal sequence partitions for velocity-based events.
EVENTS Sequence PartitionsLoitering S9, S10, S13, S14, S15, S18, S29, S30, S31Walking S1, S5, S19, S20, S22, S23, S24, S25, S26, S27Running S3, S4, S7, S8, S33
it is possible to evaluate the results of context learning as a velocity-based event
recognition. The applied context learning process generated three context clusters by
correctly grouping the sequences from the same event together.
We follow similar context learning steps for distance and direction based symbolic
sequences. In distance-based context learning, the results are separated into three
clusters. In direction-based context learning, the results are separated into four clus-
ters. We trained a HMM for each context in a given symbolic sequence type. These
results lead us generate three HMMs for velocity-based sequences, four HMMs for
distance-based sequences and three HMMs for direction based sequences.
6.2.3 Windowing
Since the video sequences are not of equal length, and the number of sequences is
small and insufficient to properly train our model, we applied a windowing technique
to extract fixed length subsequences from the generated symbolic sequential data. In
order to extract subsequences, we slide a window of length k across each sequence.
While selecting a good k value, we considered the required number of frames needed to
capture event changes in the PETS09 event recognition videos. We manually selected
k = 10, since this is the minimum number of frames we needed in order to visually
differentiate between two different events in a sequence of frames.
102
6.3 Behavior Learning and Anomaly Detection
In order to learn behaviors, we need to train the HMMs. Basically, HMM training is
defined as finding the best-fit model parameters. These parameters include the state
transition and observation symbol probability distributions, which are specific for a
given set of observation data.
The same training sequences for context learning are used during the HMMs
training. A windowing technique is applied to these training sequences in order to
generate observation data for HMM training. For each feature type, we trained a
HMM for every context that was identified during the context learning process. For
example, we trained three HMMs for velocity-based symbolic sequences by seeding
each of them with one of the observation data sets extracted from the sequence
clusters.
In our evaluation of our crowd analysis framework, we applied 10-fold cross val-
idation during training and testing of the model. In the training phase, we defined
the dimensions and the observation sequences of the HMMs to estimate the model
parameters λ = {A,B, π}.
The subsequences generated from using the sliding window are used as the obser-
vation sequences for the HMM model. We set the window length T of the observation
sequence equal to k, the width of the sliding window. The number of observation sym-
bols M and the number of hidden states N are set equal to the number of unique
symbols in each sequence type. While clustering the histograms, to capture their
symbolic representations, we use a unique number symbols for each type of feature
histogram. Therefore, we need to assign M and N value for each sequence type,
according to the size of the symbolic alphabet as listed below:
103
• Velocity-based symbolic sequences :∑
= 8 symbols
• Distance-based symbolic sequences :∑
= 15 symbols
• Direction-based symbolic sequences:∑
= 12 symbols
Next, we test and evaluate our ability to detect abnormal events in the subse-
quences.
6.4 Testing and Evaluation
In the testing phase, we followed all of the prepossessing steps we applied during
training. We can summarize these steps as follows:
• First, we extract three different features for each object in a frame: i) position,
ii) direction and iii) speed. After this point, every step is performed for each
feature type separately.
• Second, features from multiple objects are used to construct normalized feature
histograms for each frame, one for each feature type. This provides us with
three histograms for each frame, and three histogram sequences for each video.
• Third, we classify the histograms to aggregate the similar feature histograms
into a single symbolic value within the symbolic alphabet which was identified
in the training phase. Then, we generate three symbolic sequences for each
sequence of frames by representing feature histograms with their corresponding
symbols.
• Fourth, we attempt to identify contexts in the test sequences by classifying them
according to the first l symbols in the sequence. We previously discussed how we
104
determine the required length for this classification in Chapter 4. The values
for our three sequence types are lvelocity = 6 symbols, ldistance = 12 symbols,
ldirection = 11 symbols.
• Fifth, a windowing technique is applied to generate fixed-length subsequences
from the videos, which are then used during testing. It is important to note that
we are applying this subsequence extraction process within each context, only
after the context recognition step. So, we are careful not to mix subsequences.
Instead, the subsequences generated from a sequence are tested within the same
HMM.
• Sixth, trained HMMs are used for the evaluation of subsequences (observa-
tion sequences). The evaluation of an observation sequence can be defined as
computing the probability of observing that sequence from a given HMM. This
probability is based on computing the log-likelihood of the observation sequence
by applying forward algorithm [114]. For each subsequence in the test dataset,
these log-likelihood values are used as decision values to detect if there is an
anomaly.
Next, we present results for abnormal event detection in three types of sequences.
Velocity-based Context-aware Subsequence Test
In Figure 6.6, we show the ROC curve for 10-fold cross validation, first without
considering any context. In these results, the video sequences are represented using
only the symbols generated through the velocity-based features, and only one HMM
is trained for all the sequences.
105
Figure 6.6: ROC curve for the velocity-based symbolic sequences, training and testingwith only a single model.
In Figure 6.7 (a), (b) and (c), we show the results from same the symbolic se-
quences, but this time contexts are identified using our context learning approach.
Each subsequence is tested within the model generated from samples from the corre-
sponding context.
Direction-based Context-aware Subsequence Test
We present results for direction-based symbolic sequence without contexts in Fig-
ure 6.8 and in Figure 6.9 (a), (b), (c) and (d) with contexts.
Distance-based subsequence Test
Anomaly detection ROC curves obtained for the distance-based symbolic sequences
are shown in Figure 6.10 when using a single model and in Figure 6.9 (a), (b) and (c)
when using multiple normal models, one for each context.
106
(a) Context-1 test set. (b) Context-2 test set.
(c) Context-3 test set.
Figure 6.7: ROC curves for velocity-based symbolic sequences, training and testingon a per-context basis.
107
Figure 6.8: The ROC curve for direction-based symbolic sequences, training andtesting using only a single model.
6.5 Anomaly Detection
Our proposed method for crowd anomaly detection depends on the evaluation of
symbolic sequences generated from three different features of the people in the crowd.
In previous section, we presented results generated by applying our methodology to
the generated subsequences. In this section, we present the detection results for the
video sequences given in Table 6.3. In order to evaluate a video sequence, we use the
anomaly score of subsequences, and applied EWMA as a filter.
A high-level design of entire crowd anomaly detection system is presented in Fig-
ure 6.12. Since there are three symbolic sequences for each short video, we evaluated
results for each of them individually, and then combined the results to a final anomaly
detection rate. We consider a sequence abnormal if it is detected as abnormal by any
of the three individual evaluations.
108
(a) Context-1 test set. (b) Context-2 test set.
(c) Context-3 test set. (d) Context-4 test set.
Figure 6.9: ROC curves for direction-based symbolic sequences, training and testingon a per-context basis.
109
Figure 6.10: ROC curve for distance based symbolic sequences, training and testingwith only a single model.
6.6 Summary of Crowd Analysis
In this thesis, we proposed a new method to detect anomalies in a crowded scene. We
have characterized crowd behavior by extracting three different symbolic sequences
based on three feature types. We targeted predefined events in PETS09 event recog-
nition dataset. The features extracted were based on the velocity, distance and di-
rection of individuals present in each frame. Since anomalies have a wide range of
different characteristics, these features may be insufficient to detect all possible ab-
normal activities. For example, although the velocity-based symbolic sequences are
very helpful to learn different crowd behaviors, our results show that in the PETS09
dataset, distance-based and direction-based features are better predictors of abnor-
mal events. Therefore, we analyzed the video sequences by considering each of these
sequences separately to achieve a final anomaly detection result. In the end, we com-
bined the test results from all three symbolic sequences. If any of these tests detect
an anomaly, we labeled that subsequence as abnormal.
110
(a) Context-1 test set. (b) Context-2 test set.
(c) Context-3 test set.
Figure 6.11: ROC curves for distance-based symbolic sequences, training and testingon a per-context basis.
111
Figure 6.12: High-level design of crowd anomaly detection framework.
A major challenge in crowd analysis is the generation of ground truth for the video
dataset. Since the PETS09 event recognition dataset only includes video frames and
the ground-truth is not provided either for tracking or event recognition, we had
to generate the ground truth manually. The ground-truth extraction for detection
and tracking of multiple objects is performed using an annotation tool presented by
Dollar et al. [44, 45, 42]. This annotation tool helps to annotate a video efficiently by
interpolating the detection points in the intermediate frames between the two frames
that were selected. By using this tool, we extract the ground truth for tracking each
object semi-manually.
In this experiment, we show that it is possible to detect crowd anomalies from
symbolic sequential representations of frames. With the illustrative example provided
in this thesis, we also showed the generality of our anomaly detection framework on
112
symbolic sequences. Our experimental results show that context learning achieves
significantly better detection performance than a no-context approach for the anomaly
detection task on symbolic sequences.
113
Chapter 7
Summary and Conclusion
This thesis has presented a new approach for improving anomaly detection on sequen-
tial data. The thesis focuses on a specific set of techniques targeting the detection of
anomalous behavior in a discrete, symbolic, and sequential dataset.
In Chapter 1, we motivated and outlined the goals of this thesis. We also high-
lighted some of the key challenges of performing anomaly detection problem on sym-
bolic sequences. In Chapter 2, we presented background information for anomaly
detection algorithms and discussed the two application domains we focused on in this
thesis: system call intrusion detection and crowd analytics. In Chapter 3, we sur-
veyed related work for two application domains in terms sequence anomaly detection
and motivated our work on context-based learning as it applies in different applica-
tion domains. In Chapter 4, we described our context learning approach as applied
on symbolic sequential data. In Chapter 5, we presented our system call anomaly
detection framework for Host-Based Intrusion Detection. In Chapter 6, we presented
our crowd anomaly detection framework for an automatic crowd surveillance system.
114
In this chapter, we summarize the contributions of this thesis and provide direc-
tions for future work.
7.1 Contributions of the Thesis
In our anomaly detection design, we focus on the basic research issues arising in
sequential data analysis, and claim that the anomaly detection task on symbolic se-
quential data can be improved by first identifying multiple normal behaviors from
a data source. The results of our illustrative examples indicate that sufficient in-
formation exists in symbolic sequences to model multi-class normal behavior of a
data source. The key to extracting this valuable information is to apply the proper
partitioning method.
In summary, this thesis enhances the current state-of-the-art and makes key con-
tributions to the following areas:
• Machine learning/ Datamining: We have devising a novel, effective and generic
context-aware anomaly detection framework for discrete sequential data by
adapting sophisticating machine learning algorithms.
• Application domains: We applied our novel anomaly detection approach to two
different application domains, each of which is critical to security, cyber and
physical in nature.
The framework presented in this thesis analyzes symbolic sequential data sets by
learning normal behavior models during the training phase, and detecting abnormal
events in the testing phase. We divide the training phase into two stages: context
learning and normal behavior learning for each context. In the first stage, the symbolic
115
sequences in a dataset are clustered. In the second stage, a HMM is trained to model
each cluster. As in the training, the testing phase is also divided into two stages. In
the first stage, sequences are classified into one of the previously identified contexts.
In the second stage, the corresponding trained HMM is used to identify how likely
the test sequence can be considered normal.
Our results show that our context-aware anomaly detection approach when run
on symbolic sequences achieves significantly better detection performance for the
datasets tested in each domain. Furthermore, we addressed several issues in the design
of a practical anomaly detection system when working with symbolic sequences. We
considered the impact of context clustering, window size selection, and filtering on
our detection results.
This thesis makes contributions to the field of machine learning by proposing a
novel approach to parameter estimation for sequential data clustering during context
learning. This approach includes two steps: 1) estimating the number of natural
clusters, 2) estimating the minimum required subsequence length to be able to classify
a sequence accurately. In the first step, the idea is to identify similar sequences using
their frequency distributions to estimate the number of clusters in a dataset. In the
second step, the idea is to perform this natural clustering by computing the distances
between sequences with a combination of string similarity measures. The second step
estimates the number of symbols that are needed to perform accurate analysis on a
stream of sequential symbols. The advantages of performance context learning with
only a portion of a sequence can be seen in testing phase. It is critical to detect
anomalies while the event is occurring, both in cyber and physical security.
Next, we present future research directions that this work can take.
116
7.2 Future Work
In this dissertation, we have designed a context-aware anomaly detection framework
for symbolic sequences. We anticipate this thesis will spawn a number of future
research threads. First of all, our current work explores anomaly detection on two
application domains, but there are many other sequential data applications that could
benefit from our approach. We expect to see future work in DNA sequencing, spam
filtering, sales analytics, and social network analytics as future avenues of research on
sequential data.
Our work in unsupervised context learning depends on estimating two parame-
ters: the number of clusters k and the length of the required length l. The proposed
methodology is effective for grouping similar sequences. However, our approach comes
with a few limitations which can be improved in future work. First, the current
method estimates the number of clusters k, which can potentially hide the sequen-
tial nature of the data. Future work could also focus on estimation of both k and
l simultaneously. An improved method may estimate these parameters in an incre-
mental way by applying a model selection criterion such as BIC [124] or AIC (Akaike
information criterion) [112].
Second, the current method used to measure distances between sequences may
be prone to noise, since it is based on string-based similarity metrics. The approach
records exact position matches and the longest common consecutive subsequence.
Future work could include improving the distance computation method by considering
alignment-based similarity measures.
In future work, we also would like to investigate:
System call anomaly detection:
117
• Additional information: In our current work, each unique system call in a pro-
gram execution trace is represented with symbol from a finite set of symbols.
Future work may include extending our method by incorporating the system
call attributes. Considering these attributes would help to guide a more robust
context learning approach, and could lead to improved detection rates.
Crowd anomaly detection:
• Feature extraction: Our current feature extraction approach is based on detec-
tion of multiple objects and tracking those objects in every frame. The current
tracking methodology may be improved by investigating more sophisticated de-
tection and tracking algorithms. In order to detect abnormal events in highly
crowded scene, we may be able to extract holistic-based features without ap-
plying a detection methodology [101].
• Symbol representation: An issue we would face when applying our work in a
real-world setting would be the generation of symbolic representations of feature
histograms while testing with new data. In the current work, since we were
provided with training and testing data, we could apply clustering to assign
a symbol for each feature histogram in a data preparation step. Therefore,
we could generate symbolic sequences from the same symbolic alphabet for
both the training and test data. In a live system scenario, there would be
some feature histograms which are not very similar to any feature histogram
clusters. Although, this would be a sign of new event or an anomaly, we still
want to evaluate this sequence of symbols to detect abnormal events. There
are two ways to address this problem: 1) We can classify these new histograms
by assigning a symbol from the current alphabet by selecting the most similar
118
cluster. If the symbol assigned was not wrong, then this might degrade the entire
anomaly detection system since our approach depends on learning behaviors
from a sequence of symbols. 2) We could define a threshold value for symbol
assignment. If none of the histogram clusters are close to the incoming sample,
then create a new symbol. In this way, since there are new symbols in the test
sequence, anomaly detection system would directly produce an alarm. In this
case, a system administrator needs to decide if this event is an anomaly or a
new normal event which was missed during training. If the sample is deemed
normal, then the system need to retrain to adapt this new behavior. As a future
work, we want to improve our framework by considering this problem in a live
streaming scenario.
• Additional events: In our current work, the number of events is somewhat
limited, and depends on the benchmark dataset we have used. Future work
can explore how we can apply anomaly detection to a wider variety of events
occurring in sequential data.
119
Bibliography
[1] Adomavicius, G., and Tuzhilin, A. Context-aware recommender systems.
In Recommender systems handbook. Springer, 2011, pp. 217–253.
[2] Adomavicius, G., and Tuzhilin, A. Context-aware recommender systems.
In Recommender systems handbook. Springer, 2011, pp. 217–253.
[3] Aggarwal, C. C., and Zhai, C. A survey of text clustering algorithms. In
Mining Text Data. Springer, 2012, pp. 77–128.
[4] Aizawa, A. An information-theoretic perspective of tf–idf measures. Informa-
tion Processing & Management 39, 1 (2003), 45–65.
[5] Ali, S., and Shah, M. Floor fields for tracking in high density crowd scenes.
In Computer Vision–ECCV 2008. Springer, 2008, pp. 1–14.
[6] Aloysius, G., and Binu, D. An approach to products placement in super-
markets using prefixspan algorithm. Journal of King Saud University-Computer
and Information Sciences 25, 1 (2013), 77–87.
[7] Alpaydin, E. Introduction to machine learning. MIT press, 2004.
120
[8] Andrade, E., Blunsden, O., and Fisher, R. Performance analysis of
event detection models in crowded scenes. In Visual Information Engineering,
2006. VIE 2006. IET International Conference on (2006), IET, pp. 427–432.
[9] Andrade, E. L., Blunsden, S., and Fisher, R. B. Hidden markov models
for optical flow analysis in crowds. In Pattern Recognition, 2006. ICPR 2006.
18th International Conference on (2006), vol. 1, IEEE, pp. 460–463.
[10] Andrade, E. L., Blunsden, S., and Fisher, R. B. Modelling crowd
scenes for event detection. In Pattern Recognition, 2006. ICPR 2006. 18th
International Conference on (2006), vol. 1, IEEE, pp. 175–178.
[11] Axelsson, S. Intrusion detection systems: A survey and taxonomy. Tech.
rep., Technical report, 2000.
[12] Bace, R., and Mell, P. Nist special publication on intrusion detection
systems. Tech. rep., DTIC Document, 2001.
[13] Baldauf, M., Dustdar, S., and Rosenberg, F. A survey on context-
aware systems. International Journal of Ad Hoc and Ubiquitous Computing 2,
4 (2007), 263–277.
[14] Bardram, J. E. Applications of context-aware computing in hospital work:
examples and design principles. In Proceedings of the 2004 ACM symposium on
Applied computing (2004), ACM, pp. 1574–1579.
[15] Basharat, A., Gritai, A., and Shah, M. Learning object motion patterns
for anomaly detection and improved object detection. In Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (2008), IEEE,
pp. 1–8.
121
[16] Bazire, M., and Brezillon, P. Understanding context before using it. In
Modeling and using context. Springer, 2005, pp. 29–40.
[17] Boiman, O., and Irani, M. Detecting irregularities in images and in video.
International Journal of Computer Vision 74, 1 (2007), 17–31.
[18] Borji, A., Frintrop, S., Sihite, D. N., and Itti, L. Adaptive object
tracking by learning background context. In Computer Vision and Pattern
Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference
on (2012), IEEE, pp. 23–30.
[19] Bradley, A. P. The use of the area under the roc curve in the evaluation of
machine learning algorithms. Pattern recognition 30, 7 (1997), 1145–1159.
[20] Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of
Markov Chain Monte Carlo. Taylor & Francis US, 2011.
[21] Brostow, G. J., and Cipolla, R. Unsupervised bayesian detection of
independent motion in crowds. In Computer Vision and Pattern Recognition,
2006 IEEE Computer Society Conference on (2006), vol. 1, IEEE, pp. 594–601.
[22] Campbell, W., Campbell, J., Reynolds, D., Jones, D., and Leek, T.
Phonetic speaker recognition with support vector machines. In in Advances in
Neural Information Processing Systems (2004).
[23] Candamo, J., Shreve, M., Goldgof, D. B., Sapper, D. B., and
Kasturi, R. Understanding transit scenes: A survey on human behavior-
recognition algorithms. Intelligent Transportation Systems, IEEE Transactions
on 11, 1 (2010), 206–224.
122
[24] Caswell, B., Beale, J., and Baker, A. Snort Intrusion Detection and
Prevention Toolkit. Syngress, 2007.
[25] Challenger, R., Clegg, C., and Robinson, M. Understanding crowd
behaviours, vol. 1: Practical guidance and lessons identified. London: The
Stationery Office (TSO) (2010).
[26] Chan, A. B., Morrow, M., and Vasconcelos, N. Analysis of crowded
scenes using holistic properties. In Performance Evaluation of Tracking and
Surveillance workshop at CVPR (2009), pp. 101–108.
[27] Chan, A. B., and Vasconcelos, N. Mixtures of dynamic textures. In
Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on
(2005), vol. 1, IEEE, pp. 641–647.
[28] Chan, K., Lee, T.-W., Sample, P. A., Goldbaum, M. H., Weinreb,
R. N., and Sejnowski, T. J. Comparison of machine learning and traditional
classifiers in glaucoma diagnosis. Biomedical Engineering, IEEE Transactions
on 49, 9 (2002), 963–974.
[29] Chan, K.-P., and Fu, A.-C. Efficient time series matching by wavelets. In
Data Engineering, 1999. Proceedings., 15th International Conference on (1999),
IEEE, pp. 126–133.
[30] Chandola, V., Banerjee, A., and Kumar, V. Outlier detection: A
survey. Tech. Rep. 07-017, Department of Computer Science, University of
Minnesota, 2007.
[31] Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A
survey. ACM Computing Surveys (CSUR) 41, 3 (2009), 1–58.
123
[32] Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection for
discrete sequences: A survey. Knowledge and Data Engineering, IEEE Trans-
actions on 24, 5 (2012), 823–839.
[33] Chen, D.-Y., and Huang, P.-C. Motion-based unusual event detection in
human crowds. Journal of Visual Communication and Image Representation
22, 2 (2011), 178–186.
[34] Chen, W.-H., Hsu, S.-H., and Shen, H.-P. Application of svm and ann
for intrusion detection. Computers & Operations Research 32, 10 (2005), 2617–
2634.
[35] Cheung, S.-C. S., and Kamath, C. Robust techniques for background
subtraction in urban traffic video. In Proceedings of SPIE (2004), vol. 5308,
pp. 881–892.
[36] Chong, X., Liu, W., Huang, P., and Badler, N. I. Hierarchical crowd
analysis and anomaly detection. Journal of Visual Languages & Computing
(2013).
[37] Comaniciu, D., Ramesh, V., and Meer, P. Kernel-based object tracking.
Pattern Analysis and Machine Intelligence, IEEE Transactions on 25, 5 (2003),
564–577.
[38] Debar, H., Dacier, M., and Wespi, A. Towards a taxonomy of intrusion-
detection systems. Computer Networks 31, 8 (1999), 805–822.
[39] Debar, H., Dacier, M., and Wespi, A. A revised taxonomy for intrusion-
detection systems. In Annales des telecommunications (2000), vol. 55, Springer,
pp. 361–378.
124
[40] Dehghan, A., Idrees, H., Zamir, A. R., and Shah, M. Keynote: au-
tomatic detection and tracking of pedestrians in videos with various crowd
densities, 2011.
[41] Denning, D. E. An intrusion-detection model. Software Engineering, IEEE
Transactions on, 2 (1987), 222–232.
[42] Dollar, P. Piotr’s Image and Video Matlab Toolbox (PMT). http:
//vision.ucsd.edu/~pdollar/toolbox/doc/index.html.
[43] Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S. Behavior
recognition via sparse spatio-temporal features. In 2005 IEEE International
Workshop on Visual Surveillance and Performance Evaluation of Tracking and
Surveillance (2005), IEEE, pp. 65–72.
[44] Dollar, P., Wojek, C., Schiele, B., and Perona, P. Pedestrian detec-
tion: A benchmark. In CVPR (2009).
[45] Dollar, P., Wojek, C., Schiele, B., and Perona, P. Pedestrian detec-
tion: An evaluation of the state of the art. PAMI 34 (2012).
[46] Dourish, P. What we talk about when we talk about context. Personal and
ubiquitous computing 8, 1 (2004), 19–30.
[47] Du, Y., Wang, H., and Pang, Y. A hidden markov models-based anomaly
intrusion detection method. In Intelligent Control and Automation, 2004.
WCICA 2004. Fifth World Congress on (2004), vol. 5, IEEE, pp. 4348–4351.
125
[48] Ektefa, M., Memar, S., Sidi, F., and Affendey, L. S. Intrusion detec-
tion using data mining techniques. In Information Retrieval & Knowledge Man-
agement,(CAMP), 2010 International Conference on (2010), IEEE, pp. 200–
203.
[49] Ellis, A., Shahrokni, A., and Ferryman, J. M. Pets2009 and winter-pets
2009 results: A combined evaluation. In Performance Evaluation of Tracking
and Surveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop
on (2009), IEEE, pp. 1–8.
[50] Eskin, E., Lee, W., and Stolfo, S. J. Modeling system calls for intru-
sion detection with dynamic window sizes. In DARPA Information Survivabil-
ity Conference & Exposition II, 2001. DISCEX’01. Proceedings (2001), vol. 1,
IEEE, pp. 165–175.
[51] Fawcett, T. An introduction to roc analysis. Pattern recognition letters 27,
8 (2006), 861–874.
[52] Ferryman, J., and Ellis, A. Pets2010: Dataset and challenge. In Advanced
Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE International
Conference on (2010), IEEE, pp. 143–150.
[53] Figueiredo, M. A., and Jain, A. K. Unsupervised learning of finite mixture
models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24,
3 (2002), 381–396.
[54] Forrest, S., Hofmeyr, S. A., Somayaji, A., and Longstaff, T. A. A
sense of self for unix processes. In Security and Privacy, 1996. Proceedings.,
1996 IEEE Symposium on (1996), IEEE, pp. 120–128.
126
[55] Fraley, C., and Raftery, A. E. How many clusters? which clustering
method? answers via model-based cluster analysis. The computer journal 41,
8 (1998), 578–588.
[56] Fu, T.-c. A review on time series data mining. Engineering Applications of
Artificial Intelligence 24, 1 (2011), 164–181.
[57] Garate, C., Bilinsky, P., and Bremond, F. Crowd event recognition
using hog tracker. In Performance Evaluation of Tracking and Surveillance
(PETS-Winter), 2009 Twelfth IEEE International Workshop on (2009), IEEE,
pp. 1–6.
[58] Ge, W., Collins, R. T., and Ruback, R. B. Vision-based analysis of
small groups in pedestrian crowds. Pattern Analysis and Machine Intelligence,
IEEE Transactions on 34, 5 (2012), 1003–1016.
[59] Hariri, N., Mobasher, B., and Burke, R. Context-aware music recom-
mendation based on latenttopic sequential patterns. In Proceedings of the sixth
ACM conference on Recommender systems (2012), ACM, pp. 131–138.
[60] Heady, R., Luger, G., Maccabe, A., and Servilla, M. The architecture
of a network-level intrusion detection system. Department of Computer Science,
College of Engineering, University of New Mexico, 1990.
[61] Herranz, J., Nin, J., and Sole, M. Optimal symbol alignment distance: a
new distance for sequences of symbols. Knowledge and Data Engineering, IEEE
Transactions on 23, 10 (2011), 1541–1554.
127
[62] Hervieu, A., Bouthemy, P., and Le Cadre, J.-P. A statistical video
content recognition method using invariant features on object trajectories. Cir-
cuits and Systems for Video Technology, IEEE Transactions on 18, 11 (2008),
1533–1543.
[63] Hoang, X., and Hu, J. An efficient hidden markov model training scheme
for anomaly intrusion detection of server applications based on system calls. In
Networks, 2004.(ICON 2004). Proceedings. 12th IEEE International Conference
on (2004), vol. 2, IEEE, pp. 470–474.
[64] Hoang, X. D., Hu, J., and Bertok, P. A multi-layer model for anomaly
intrusion detection using program sequences of system calls. In Networks,
ICON2003. The 11th IEEE International Conference on (2003), IEEE, pp. 531–
536.
[65] Hodge, V. J., and Austin, J. A survey of outlier detection methodologies.
Artificial Intelligence Review 22, 2 (2004), 85–126.
[66] Hofmeyr, S. A., Forrest, S., and Somayaji, A. Intrusion detection using
sequences of system calls. Journal of computer security 6, 3 (1998), 151–180.
[67] Hong, J.-y., Suh, E.-h., and Kim, S.-J. Context-aware systems: A litera-
ture review and classification. Expert Systems with Applications 36, 4 (2009),
8509–8522.
[68] Hu, J., Yu, X., Qiu, D., and Chen, H.-H. A simple and efficient hidden
markov model scheme for host-based anomaly intrusion detection. Network,
IEEE 23, 1 (2009), 42–47.
128
[69] Hu, W., Liao, Y., and Vemuri, V. R. Robust anomaly detection using
support vector machines. In Proceedings of the international conference on
machine learning (2003), pp. 282–289.
[70] Hu, W., Xiao, X., Fu, Z., Xie, D., Tan, T., and Maybank, S. A
system for learning statistical motion patterns. Pattern Analysis and Machine
Intelligence, IEEE Transactions on 28, 9 (2006), 1450–1464.
[71] Jacques Junior, J. C. S., Musse, S. R., and Jung, C. R. Crowd analysis
using computer vision techniques. Signal Processing Magazine, IEEE 27, 5
(2010), 66–77.
[72] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition
Letters 31, 8 (2010), 651 – 666.
[73] Jain, A. K., and Dubes, R. C. Algorithms for clustering data. Prentice-Hall,
Inc., 1988.
[74] Jiang, F., Wu, Y., and Katsaggelos, A. K. Abnormal event detection
based on trajectory clustering by 2-depth greedy search. In Acoustics, Speech
and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on
(2008), IEEE, pp. 2129–2132.
[75] Kang, D.-K., Fuller, D., and Honavar, V. Learning classifiers for misuse
and anomaly detection using a bag of system calls representation. In Informa-
tion Assurance Workshop, 2005. IAW’05. Proceedings from the Sixth Annual
IEEE SMC (2005), IEEE, pp. 118–125.
129
[76] Kaufman, L., and Rousseeuw, P. J. Finding groups in data: an introduc-
tion to cluster analysis. Wiley series in probability and mathematical statistics
(2005).
[77] Keogh, E., Lonardi, S., and Chiu, B.-c. Finding surprising patterns in a
time series database in linear time and space. In Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining
(2002), ACM, pp. 550–556.
[78] Khreich, W., Granger, E., Sabourin, R., and Miri, A. Combining
hidden markov models for improved anomaly detection. In Communications,
2009. ICC’09. IEEE International Conference on (2009), IEEE, pp. 1–6.
[79] Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K. W., and Vogel-
stein, B. Detection and quantification of rare mutations with massively paral-
lel sequencing. Proceedings of the National Academy of Sciences 108, 23 (2011),
9530–9535.
[80] Klaser, A., Marsza lek, M., Schmid, C., and Zisserman, A. Human
focused action localization in video. In Trends and Topics in Computer Vision.
Springer, 2012, pp. 219–233.
[81] Kohavi, R., et al. A study of cross-validation and bootstrap for accuracy
estimation and model selection. In IJCAI (1995), vol. 14, pp. 1137–1145.
[82] Kratz, L., and Nishino, K. Anomaly detection in extremely crowded scenes
using spatio-temporal motion pattern models. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on (2009), IEEE, pp. 1446–
1453.
130
[83] Kratz, L., and Nishino, K. Going with the flow: pedestrian efficiency in
crowded scenes. In Computer Vision–ECCV 2012. Springer, 2012, pp. 558–572.
[84] Kratz, L., and Nishino, K. Tracking pedestrians using local spatio-temporal
motion patterns in extremely crowded scenes. Pattern Analysis and Machine
Intelligence, IEEE Transactions on 34, 5 (2012), 987–1002.
[85] Kruegel, C., Mutz, D., Valeur, F., and Vigna, G. On the detection
of anomalous system call arguments. In Computer Security–ESORICS 2003.
Springer, 2003, pp. 326–343.
[86] Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., and Mori, G.
Discriminative latent models for recognizing contextual group activities. Pattern
Analysis and Machine Intelligence, IEEE Transactions on 34, 8 (2012), 1549–
1562.
[87] Lee, W., and Xiang, D. Information-theoretic measures for anomaly de-
tection. In Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE
Symposium on (2001), IEEE, pp. 130–143.
[88] Liao, Y., and Vemuri, V. R. Use of k-nearest neighbor classifier for intrusion
detection. Computers & Security 21, 5 (2002), 439–448.
[89] Liao, Y., and Vemuri, V. R. Using text categorization techniques for in-
trusion detection. In USENIX Security Symposium (2002), vol. 12, pp. 51–59.
[90] Lin, J., Keogh, E., Lonardi, S., and Patel, P. Finding motifs in time
series. In Proc. of the 2nd Workshop on Temporal Data Mining (2002), pp. 53–
68.
131
[91] Liu, G., McDaniel, T. K., Falkow, S., and Karlin, S. Sequence anoma-
lies in the cag7 gene of the helicobacter pylori pathogenicity island. Proceedings
of the National Academy of Sciences 96, 12 (1999), 7011–7016.
[92] Liu, X., and Aberer, K. Soco: a social network aided context-aware recom-
mender system. In Proceedings of the 22nd international conference on World
Wide Web (2013), International World Wide Web Conferences Steering Com-
mittee, pp. 781–802.
[93] Lobo, J. M., Jimenez-Valverde, A., and Real, R. Auc: a misleading
measure of the performance of predictive distribution models. Global ecology
and Biogeography 17, 2 (2008), 145–151.
[94] Maggi, F., Matteucci, M., and Zanero, S. Detecting intrusions through
system call sequence and argument analysis. Dependable and Secure Computing,
IEEE Transactions on 7, 4 (2010), 381–395.
[95] Mahadevan, V., Li, W., Bhalodia, V., and Vasconcelos, N. Anomaly
detection in crowded scenes. In Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on (2010), IEEE, pp. 1975–1981.
[96] Markou, M., and Singh, S. Novelty detection: a review - part 1: statistical
approaches. Signal processing 83, 12 (2003), 2481–2497.
[97] Markou, M., and Singh, S. Novelty detection: a review - part 2:: neural
network based approaches. Signal processing 83, 12 (2003), 2499–2521.
[98] Marques, J. S., Jorge, P. M., Abrantes, A. J., and Lemos, J. Tracking
groups of pedestrians in video sequences. In Computer Vision and Pattern
132
Recognition Workshop, 2003. CVPRW’03. Conference on (2003), vol. 9, IEEE,
pp. 101–101.
[99] Maruster, L., Weijters, A. T., van der Aalst, W. W., and van den
Bosch, A. Process mining: Discovering direct successors in process logs. In
Discovery Science (2002), Springer, pp. 364–373.
[100] Maxion, R. A., and Roberts, R. R. Proper use of ROC curves in In-
trusion/Anomaly Detection. University of Newcastle upon Tyne, Computing
Science, 2004.
[101] Mehran, R., Oyama, A., and Shah, M. Abnormal crowd behavior detec-
tion using social force model. In Computer Vision and Pattern Recognition,
2009. CVPR 2009. IEEE Conference on (2009), IEEE, pp. 935–942.
[102] Montgomery, D. C., Jennings, C. L., and Kulahci, M. Introduction to
time series analysis and forecasting, vol. 526. John Wiley & Sons, 2011.
[103] Mutz, D., Valeur, F., Vigna, G., and Kruegel, C. Anomalous sys-
tem call detection. ACM Transactions on Information and System Security
(TISSEC) 9, 1 (2006), 61–93.
[104] Ng, B. Survey of anomaly detection methods. United States. Department of
Energy, 2006.
[105] Oliver, N. M., Rosario, B., and Pentland, A. P. A bayesian computer
vision system for modeling human interactions. Pattern Analysis and Machine
Intelligence, IEEE Transactions on 22, 8 (2000), 831–843.
133
[106] Park, S., and Trivedi, M. M. Homography-based analysis of people and
vehicle activities in crowded scenes. In Applications of Computer Vision, 2007.
WACV’07. IEEE Workshop on (2007), IEEE, pp. 51–51.
[107] Parvathy, R., Thilakan, S., Joy, M., and Sameera, K. Anomaly detec-
tion using motion patterns computed from optical flow. In Advances in Com-
puting and Communications (ICACC), 2013 Third International Conference on
(2013), IEEE, pp. 58–61.
[108] Patcha, A., and Park, J.-M. An overview of anomaly detection techniques:
Existing solutions and latest technological trends. Computer Networks 51, 12
(2007), 3448–3470.
[109] Pelleg, D., and Moore, A. W. X-means: Extending k-means with efficient
estimation of the number of clusters. In ICML (2000), pp. 727–734.
[110] Pelleg, D., and Moore, A. W. Active learning for anomaly and rare-
category detection. In Advances in Neural Information Processing Systems
(2005), pp. 1073–1080.
[111] PETS. Performance evaluation of tracking and surveillance 2009 benchmark
data, 2009. [Online; accessed 08-March-2014].
[112] Posada, D., and Buckley, T. R. Model selection and model averaging
in phylogenetics: advantages of akaike information criterion and bayesian ap-
proaches over likelihood ratio tests. Systematic biology 53, 5 (2004), 793–808.
[113] Qayyum, A., Islam, M., and Jamil, M. Taxonomy of statistical based
anomaly detection techniques for intrusion detection. In Emerging Technologies,
2005. Proceedings of the IEEE Symposium on (2005), IEEE, pp. 270–276.
134
[114] Rabiner, L. R. A tutorial on hidden markov models and selected applications
in speech recognition. Proceedings of the IEEE 77, 2 (1989), 257–286.
[115] Reisman, P., Mano, O., Avidan, S., and Shashua, A. Crowd detection in
video sequences. In Intelligent Vehicles Symposium, 2004 IEEE (2004), IEEE,
pp. 66–71.
[116] Reynolds, A. P., Richards, G., and Rayward-Smith, V. J. The appli-
cation of k-medoids and pam to the clustering of rules. 173–178.
[117] Roberts, S. W. Control chart tests based on geometric moving averages.
Technometrics 1, 3 (1959), pp. 239–250.
[118] Rosemann, M., Recker, J., and Flender, C. Contextualisation of busi-
ness processes. International Journal of Business Process Integration and Man-
agement 3, 1 (2008), 47–60.
[119] Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and
validation of cluster analysis. Journal of computational and applied mathematics
20 (1987), 53–65.
[120] Salem, M. B., Hershkop, S., and Stolfo, S. J. A survey of insider
attack detection research. In Insider Attack and Cyber Security. Springer, 2008,
pp. 69–90.
[121] Salton, G., and Buckley, C. Term-weighting approaches in automatic text
retrieval. Information processing & management 24, 5 (1988), 513–523.
135
[122] Santos, A. C., Cardoso, J. M., Ferreira, D. R., Diniz, P. C., and
Chaınho, P. Providing user context for mobile and social networking appli-
cations. Pervasive and Mobile Computing 6, 3 (2010), 324–341.
[123] Scarfone, K., and Mell, P. Guide to intrusion detection and prevention
systems (idps). NIST Special Publication 800, 2007 (2007), 94.
[124] Schwarz, G. Estimating the dimension of a model. The annals of statistics
6, 2 (1978), 461–464.
[125] Shameli-Sendi, A., Ezzati-Jivan, N., Jabbarifar, M., and Dagenais,
M. Intrusion response systems: survey and taxonomy. SIGMOD Rec 12 (2012),
1–14.
[126] Sharma, N., Sharma, P., Irwin, D., and Shenoy, P. Predicting so-
lar generation from weather forecasts using machine learning. In Smart Grid
Communications (SmartGridComm), 2011 IEEE International Conference on
(2011), IEEE, pp. 528–533.
[127] Solmaz, B., Moore, B. E., and Shah, M. Identifying behaviors in crowd
scenes using stability analysis for dynamical systems. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 34, 10 (2012), 2064–2070.
[128] Stauffer, C., and Grimson, W. E. L. Learning patterns of activity using
real-time tracking. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on 22, 8 (2000), 747–757.
[129] Steinbach, M., Karypis, G., Kumar, V., et al. A comparison of docu-
ment clustering techniques. In KDD workshop on text mining (2000), vol. 400,
pp. 525–526.
136
[130] Tan, K. M., and Maxion, R. A. ” why 6?” defining the operational limits
of stide, an anomaly-based intrusion detector. In Security and Privacy, 2002.
Proceedings. 2002 IEEE Symposium on (2002), IEEE, pp. 188–201.
[131] Tan, K. M., and Maxion, R. A. Determining the operational limits of an
anomaly-based intrusion detector. Selected Areas in Communications, IEEE
Journal on 21, 1 (2003), 96–110.
[132] Tandon, G., and Chan, P. Learning rules from system call arguments
and sequences for anomaly detection. In ICDM Workshop on Data Mining for
Computer Security (DMSEC) (2003), pp. 20–29.
[133] Thida, M., Eng, H.-L., Dorothy, M., and Remagnino, P. Learning
video manifold for segmenting crowd events and abnormality detection. In
Computer Vision–ACCV 2010. Springer, 2011, pp. 439–449.
[134] Tsai, C.-F., Hsu, Y.-F., Lin, C.-Y., and Lin, W.-Y. Intrusion detection
by machine learning: A review. Expert Systems with Applications 36, 10 (2009),
11994–12000.
[135] Tu, Z. Auto-context and its application to high-level vision tasks. In Com-
puter Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on
(2008), IEEE, pp. 1–8.
[136] UNM. Unm system call dataset, 2013. [Online; accessed 28-November-2013].
[137] Varghese, S. M., and Jacob, K. P. Process profiling using frequencies of
system calls. In Availability, Reliability and Security, 2007. ARES 2007. The
Second International Conference on (2007), IEEE, pp. 473–479.
137
[138] Vaswani, N., Roy-Chowdhury, A. K., and Chellappa, R. ” shape ac-
tivity”: a continuous-state hmm for moving/deforming shapes with application
to abnormal activity detection. Image Processing, IEEE Transactions on 14,
10 (2005), 1603–1616.
[139] Wang, S., and Miao, Z. Anomaly detection in crowd scene. In Signal
Processing (ICSP), 2010 IEEE 10th International Conference on (2010), IEEE,
pp. 1220–1223.
[140] Wang, T., and Snoussi, H. Histograms of optical flow orientation for abnor-
mal events detection. In Performance Evaluation of Tracking and Surveillance
(PETS), 2013 IEEE International Workshop on (2013), IEEE, pp. 45–52.
[141] Wang, W., Zhang, X., and Gombault, S. Constructing attribute weights
from computer audit data for effective intrusion detection. Journal of Systems
and Software 82, 12 (2009), 1974–1981.
[142] Wang, X., et al. Learning motion patterns using hierarchical bayesian mod-
els. PhD thesis, Massachusetts Institute of Technology, 2009.
[143] Wang, X., Ma, K. T., Ng, G.-W., and Grimson, W. E. L. Trajec-
tory analysis and semantic region modeling using nonparametric hierarchical
bayesian models. International journal of computer vision 95, 3 (2011), 287–
312.
[144] Wang, X., Ma, X., and Grimson, E. Unsupervised activity perception
by hierarchical bayesian models. In Computer Vision and Pattern Recognition,
2007. CVPR’07. IEEE Conference on (2007), IEEE, pp. 1–8.
138
[145] Wang, X., Tieu, K., and Grimson, E. Learning semantic scene models by
trajectory analysis. In Computer Vision–ECCV 2006. Springer, 2006, pp. 110–
123.
[146] Warrender, C., Forrest, S., and Pearlmutter, B. Detecting intru-
sions using system calls: Alternative data models. In Security and Privacy,
1999. Proceedings of the 1999 IEEE Symposium on (1999), IEEE, pp. 133–145.
[147] Xing, Z., Pei, J., and Keogh, E. A brief survey on sequence classification.
ACM SIGKDD Explorations Newsletter 12, 1 (2010), 40–48.
[148] Xiong, G., Cheng, J., Wu, X., Chen, Y.-L., Ou, Y., and Xu, Y.
An energy model approach to people counting for abnormal crowd behavior
detection. Neurocomputing 83 (2012), 121–135.
[149] Xu, R., and Wunsch, D. Survey of clustering algorithms. Neural Networks,
IEEE Transactions on 16, 3 (2005), 645–678.
[150] Yang, J., Vela, P., Shi, Z., and Teizer, J. Probabilistic multiple people
tracking through complex situations. In 11th IEEE International Workshop on
Performance Evaluation of Tracking and Surveillance (2009).
[151] Ye, N., and Chen, Q. An anomaly detection technique based on a chi-
square statistic for detecting intrusions into information systems. Quality and
Reliability Engineering International 17, 2 (2001), 105–112.
[152] Yeung, D.-Y., and Chow, C. Parzen-window network intrusion detectors.
In Pattern Recognition, 2002. Proceedings. 16th International Conference on
(2002), vol. 4, IEEE, pp. 385–388.
139
[153] Yeung, D.-Y., and Ding, Y. Host-based intrusion detection using dynamic
and static behavioral models. Pattern recognition 36, 1 (2003), 229–243.
[154] Yilmaz, A., Javed, O., and Shah, M. Object tracking: A survey. Acm
computing surveys (CSUR) 38, 4 (2006), 13.
[155] Yolacan, E. N., Dy, J. G., and Kaeli, D. R. System call anomaly detec-
tion using multi-hmms. In Software Security and Reliability-Companion (SERE-
C), 2014 IEEE Eighth International Conference on (2014), IEEE, pp. 25–30.
[156] Yue, W. T., and Cakanyıldırım, M. A cost-based analysis of intrusion de-
tection system configuration under active or passive response. Decision Support
Systems 50, 1 (2010), 21–31.
[157] Zamboni, D., et al. Using internal sensors for computer intrusion detection.
Center for Education and Research in Information Assurance and Security,
Purdue University (2001).
[158] Zaraska, K. Ids active response mechanisms: Countermeasure subsytem for
prelude ids. Tech. rep., Citeseer, 2002.
[159] Zhan, B., Monekosso, D. N., Remagnino, P., Velastin, S. A., and
Xu, L.-Q. Crowd analysis: a survey. Machine Vision and Applications 19, 5-6
(2008), 345–357.
[160] Zhang, Y., Ge, W., Chang, M.-C., and Liu, X. Group context learning
for event recognition. In Applications of Computer Vision (WACV), 2012 IEEE
Workshop on (2012), IEEE, pp. 249–255.
140
[161] Zhang, Z., and Li, M. Crowd density estimation based on statistical analysis
of local intra-crowd motions for public area surveillance. Optical Engineering
51, 4 (2012), 047204–1.
[162] Zhang, Z., and Shen, H. Application of online-training svms for real-time
intrusion detection with different considerations. Computer Communications
28, 12 (2005), 1428–1442.
[163] Zhao, Y., and Karypis, G. Evaluation of hierarchical clustering algorithms
for document datasets. In Proceedings of the eleventh international conference
on Information and knowledge management (2002), ACM, pp. 515–524.
[164] Zhao, Y., and Karypis, G. Empirical and theoretical comparisons of selected
criterion functions for document clustering. Machine Learning 55, 3 (2004),
311–331.
[165] Zhou, B., Wang, X., and Tang, X. Understanding collective crowd be-
haviors: Learning a mixture model of dynamic pedestrian-agents. In Computer
Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (2012),
IEEE, pp. 2871–2878.
[166] Zhu, Y., Nayak, N. M., and Roy-Chowdhury, A. K. Context-aware
activity recognition and anomaly detection in video. Selected Topics in Signal
Processing, IEEE Journal of 7, 1 (2013), 91–101.
[167] Zhu, Y., Nayak, N. M., and Roy-Chowdhury, A. K. Context-aware
modeling and recognition of activities in video. In Computer Vision and Pattern
Recognition (CVPR), 2013 IEEE Conference on (2013), IEEE, pp. 2491–2498.
141
top related