chapter 2 literature survey -...

31
27 CHAPTER 2 LITERATURE SURVEY A large proportion of research in developing IDS focuses on developing new system architectures to improve the accuracy and completeness of the IDS. Several research areas in the domain of IDS help move the field towards a set of ideal requirements as listed in Table 2.1. Table 2.1 Ideal requirements of IDS Accuracy No false positives Completeness No false negatives Performance Real time detection Fault Tolerance IDS not becoming security vulnerability itself Timeliness Handling large amounts of data Scalability Quick propagation of information in the network to react to potential intrusions using IDS Intrusion Detection procedures are classified into three categories and they differ in the reference data that is used for detecting unusual activity. Signature based or MD considers signatures of unusual activity for detection. AD mechanism considers a profile of normal system activity and Protocol-Based or Specification based detection considers constraints that characterize the normal behavior of a particular protocol or a program. The trend is to apply ML to IDS that offers flexibility for detection and lends itself conveniently to AD. The AD operates assuming that the attacks are different from the normal activity and try to focus on

Upload: vanhuong

Post on 17-Dec-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

27

CHAPTER 2

LITERATURE SURVEY

A large proportion of research in developing IDS focuses on developing

new system architectures to improve the accuracy and completeness of the IDS.

Several research areas in the domain of IDS help move the field towards

a set of ideal requirements as listed in Table 2.1.

Table 2.1 Ideal requirements of IDS

Accuracy No false positives

Completeness No false negatives

Performance Real time detection

Fault Tolerance IDS not becoming security vulnerability itself

Timeliness Handling large amounts of data

Scalability Quick propagation of information in the network to react to potential intrusions using IDS

Intrusion Detection procedures are classified into three categories and

they differ in the reference data that is used for detecting unusual activity. Signature

based or MD considers signatures of unusual activity for detection. AD mechanism

considers a profile of normal system activity and Protocol-Based or Specification

based detection considers constraints that characterize the normal behavior of a

particular protocol or a program. The trend is to apply ML to IDS that offers

flexibility for detection and lends itself conveniently to AD. The AD operates

assuming that the attacks are different from the normal activity and try to focus on

Page 2: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

28

identifying unusual behavior in a host or a network. However it is now common to

develop hybrid systems, which may combine misuse and anomaly detectors, host

based and network based modules, and event correlation and stateless detectors.

With increasing research on hybrid IDS, recent research focuses on correlating alerts

between the different modules in an efficient manner [3, 4]. Alert aggregation is one

such area in which similar alerts/events are grouped into a single generalized event.

With this method, data required to analyze to detect intrusion by the system

administrator gets reduced. Event correlation is another research area, which has

been established well for MD. MD based IDS often uses a set of rules or signatures

as attack model, with each rule usually dedicated to detect a different attack. Earlier

research work emphasized that data set for analysis can be obtained by real traffic,

sanitized traffic and simulated traffic [5,6]. But in real time fast response to external

events within an extremely short time is demanded and expected. Therefore, an

alternative algorithm to implement real time learning is imperative for critical

applications for fast changing environments. Even for offline applications, speed is

still a need. A real time learning algorithm that reduces training time and human

effort to nearly zero would always be of considerable value. The advent of new

technologies has greatly increased the ability to monitor and resolve the details of

changes in order to analyze better. Analyzing large amount of data is still a new

challenge. For identifying frequently changing trend data need to be analyzed and

corrected. In some cases, feature selection may improve the performance of the

detection as it simplifies the complexity problem by reducing the dimensions.

Researchers have proposed several methods of feature selection to achieve real time

IDS. The major benefit of feature selection is that the amount of data required to

process is significantly reduced, without compromising the performance of the

detection.

2.1 CURRENT IDS PRODUCTS

IDS can be classified according to many different features [7,8].

Table 2.2 lists some of the currently available IDS with features.

Page 3: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

29

Table 2.2 Leading IDS products currently available

Name Description

SNORT

An open source network Intrusion Prevention and Detection System (IDS/IPS). SNORT is developed by Sourcefire thatcombines the benefits of signature, protocol and anomaly based inspection. SNORT is one of the widely deployedIDS/IPS technologies worldwide.

COUNTERACT

Delivers an entirely unique approach to prevent network intrusions. The system stops attackers based on their “proven intent” to attack. It will not use signatures, AD orpattern matching of any kind. To launch an attack, an attacker needs knowledge about a network's resources. Prior to attack, intruders compile vulnerability and configuration information by scanning and probing. This information is used to launch attacks based on the unique structure and characteristics of the targeted network. These characteristics of intruders are used by COUTERACT to prevent intrusions.

AIRMAGNET Provides a simple, scalable WLAN monitoring solution This enables an organization to proactively mitigate all types ofwireless threats.

BRO – IDS

An open source, Unix based NIDS. Bro will passively monitor network traffic and looks for suspicious activity. Bro detects intrusions by first parsing network traffic to extract its application level semantics and then executes event oriented analyzers. It will compare the activity with patterns deemed to be troublesome.

CISCO INTRUSION PREVENTION SYSTEM (IPS)

One of the most and widely deployed IPS. It provides protection against more than 30,000 known threats, Timely signature updates and Cisco Global Correlation to dynamically recognize, evaluate, and stop emerging Internet threats. Cisco IPS includes industry leading research and the expertise of Cisco Security Intelligence Operations. It also protects against increasingly sophisticated attacks, including Directed attacks, Worms, Botnets, Malware, application abuse. It provides intrusion prevention that stops outbreaks at the network level and supports a wide range of deployment options, with near real time updates for the most recent threat.

Page 4: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

30

Table 2.2 Leading IDS products currently available (Continued)

Name Description

JUNIPER NETWORKS INTRUSION DETECTION AND PREVENTION (IDP)

Offers comprehensive coverage by leveraging multiple detection mechanisms. Backed by Juniper Networks Security Lab, signatures for detection of new attacks are generated on a daily basis. Working very closely with many software vendors to assess new vulnerabilities, it’s not uncommon for IDP Series to be equipped to prevent attacks which have not yet occurred. Such coverage ensures that organizations can merely react to new attacks and can proactively secure network from future attacks.

McAFee HOSTINTRUSION PREVENTION FOR SERVER

Defends servers from known and new zero day attacks with McAfee Host Intrusion Prevention. Boosts security with low costs and simplify compliance by reducing the frequency of patching new signatures.

SOURCEFIRE INTRUSION PREVENTION SYSTEM

Built on the foundation of the award-winning Snort® rules-based detection engine. It uses a powerful combination of vulnerability and AD based inspection methods.

STRATA GUARD IDS / IPS

This award winning high speed IDS/IPS gives a real time protection from network attacks and malicious traffic. It will prevent Malware, spyware, port scans, viruses, and DoS and Distributed DoS(DDoS) attacks.

JUNIPER NETWORKS INTRUSION DETECTION AND PREVENTION (IDP)

Offers comprehensive coverage by leveraging multiple detection mechanisms. Backed by Juniper Networks Security Lab, signatures for detection of new attacks are generated on a daily basis. Working very closely with many software vendors to assess new vulnerabilities, it’s not uncommon for IDP Series to be equipped to prevent attacks which have not yet occurred. Such coverage ensures that organizations can merely react to new attacks and can proactively secure network from future attacks.

McAFee HOSTINTRUSION PREVENTION FOR SERVER

Defends servers from known and new zero day attacks with McAfee Host Intrusion Prevention. Boosts security with low costs and simplify compliance by reducing the frequency of patching new signatures.

SOURCEFIRE INTRUSION PREVENTION SYSTEM

Built on the foundation of the award-winning Snort® rules-based detection engine. It uses a powerful combination of vulnerability and AD based inspection methods.

STRATA GUARD IDS/ IPS

This award winning high speed IDS/IPS gives a real time protection from network attacks and malicious traffic. It will prevent Malware, spyware, port scans, viruses, and DoS and Distributed DoS(DDoS) attacks.

Page 5: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

31

Over the years, researchers and designers have used many techniques to

design IDS. But, there have been one or more issues with the existing IDS. Current

AD methods are mainly classified as Statistical Anomaly Detection, Detection

Based on Neural Network and Detection Based on DM, etc. The IDS for the AD

should first learn the characteristics of normal activities and abnormal activities, and

then the IDS detect traffic that deviate from normal activities. AD tries to determine

whether deviation from established normal usage patterns can be flagged as

intrusions [9]. AD techniques are based on the assumption that misuse or intrusive

behavior deviates from normal system procedure [10]. The advantage of AD is that

it can detect attacks that are never seen before but it is ineffective in detecting

insiders’ attacks. Shoubridge [11] developed IDS that can analyze critical network

events and trends. The [12,13] authors represents the dynamic network as a directed

graph and similarity measures were calculated that showed a change in trend of the

network behavior over time. With the same principle, Pincombe [14] developed

IDS that uses graph distance metrics, such as weight, modality, and diameter, to

compute graph similarities. Cumulative summation and minimum mean square

errors are then used recursively to detect CP. Although this method is faster

compared to previous methods, it did not provide good results for all graph distance

metrics. Hence an open question still remains as to which distance measure if

different types of graphs exists.

Recently DM and ML methods [15-18] for a data stream have been

actively proposed. A data stream is an ordered sequence of objects o1,…,on that must

be accessed in the same order. It can be read only once or a specified number of

times. Hence, it is not possible to maintain all the objects of a data stream in the

main memory. Each object should be examined only once to analyze the data

stream. The memory space for data stream analysis should be confined finitely,

although new objects get generated infinitely over time. Newly generated objects

should be analyzed as quickly as possible to maintain up to date results with

minimum false alarms. Therefore reducing false positives is major area of research.

Currently the detection of outliers has gained significant research interest with the

insight that outliers can be the key discovery for a possible new attack.

Page 6: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

32

2.2 OUTLIER DETECTION (OD)

OD refers to the problem of finding interesting patterns in data that are

very different from the rest of the data. Such a pattern found, may often contain

useful information regarding abnormal behavior of the system. These patterns are

usually called outliers or noise. OD is an extensively researched area that finds

immense use in application domains such as credit card fraud detection, illicit access

in computer networks, military surveillance for enemy activities and many others.

Detection of outlier approaches found in literatures [19-21] has varying scopes and

abilities. Due to lack of prior knowledge on the data set collected OD problem falls

into the category of unsupervised learning. Another area of research is semi

supervised OD where some examples of outlier and inlier will be available as a

training set. The semi supervised outlier detection methods perform better than

unsupervised methods since additional label information is available. But such

outlier samples for training are not always available and if available may be diverse.

Thus learning from known types of outliers is not necessarily useful in detecting

unknown types of outliers. OD searches for objects that do not follow rules and

expectations in the data set. The detection of an outlier may be evidence that there

are new trends/patterns in data. Although, outliers are considered noise or errors,

they may carry important information. OD depends on the applied detection

methods and also data structures that are used. Depending on the approaches used in

OD, the methodologies can be broadly classified as Distance based, Density based

or Soft Computing based. Selecting subspaces in the case of OD is a complex and a

challenging problem [22] and outliers are rare and very hard to collect [23].

Rejecting some dimensions for the sake of easy calculation may lead to some loss of

important and also interesting knowledge.

2.3 STATISTICAL BASED ANOMALY DETECTION

Statistics is the widely used tool to build behavior based IDS [24,25].

The system or behavior of the user is measured by a number of variables sampled

over time. This includes

Page 7: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

33

1. User login and logout time of each session.

2. Duration of the resource usage.

3. Amount of processor and memory consumed during that session etc.

4. Number of commands executed that are sampled over time.

One of the popular IDS is Intrusion Detection Expert System (IDES) that

works on statistical based AD. IDES monitors the users, remote hosts and target

systems with different parameters that include CPU usage, command usage, network

activity etc. Vectors are formed with these parameters and statistical profiles are

updated to reflect the new user behavior. To detect anomalies, IDES processes each

new data set and verifies it against the known profile. If any deviations are detected,

they are reported as probable intrusions. IDES is not suitable, if the parameters have

multi model distributions. This problem is sorted out in the next version of IDES

known as Next-generation Intrusion Detection Expert System (NIDES) [26]. NIDES

stores only statistics such as frequencies, means, variances, and covariance of the

profile since storing the audit data itself is too cumbersome. Given a profile with n

measures, NIDES characterizes any point in the n-space of the measures to be

anomalous, if it is sufficiently far from an expected or defined value. NIDES

evaluate the total deviation and not just the deviation of each individual measure.

Wisdom and Sense [27] is specifically designed using statistical anomaly detection

that analyzes behavior of users. Based on the activities of users over a period of time

the system updates a set of rules that statistically describe the behavior of the users.

Current behavior is then matched against these rules to detect inconsistent behavior.

These rules are regularly updated to analyze/detect new usage patterns. One of the

methods may be to model a system that keeps averages of all or any one of these

variables and detect whether thresholds are exceeded based on the standard

deviation. This model is too simple to represent the data faithfully. Even after

comparing the variables of individual users with aggregates group statistics may not

yield much improvement. Therefore, a more complex model needs to be developed

that compares profiles of long term and short term users or system activities. These

profiles are periodically updated as the behavior of user activities and this model are

now used in a number of intrusion detection tools and prototypes.

Page 8: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

34

2.4 MACHINE LEARNING FOR ANOMALY DETECTION

AI is the simulation of human intelligence in machines with a feature to

be able to make decisions. ML is a branch of AI that is specifically concerned with

enabling machines to understand information. Recent research focuses more on the

combination of techniques to improve the detection rates of ML classifiers. For

example, Mahbod et.al [28] examined the performance of seven ML algorithms on

the KDD Cup’99 dataset. They found that different techniques performed better on

different classes of intrusion. By combining the best techniques for each class,

overall performance of the detector has been increased. However, there are still

discrepancies in the findings reported in the literature as to how well different

techniques perform on the different classes of intrusions. ML is an ideal technology

for defending against attacks. Knowing that programmers tend to repeat mistakes, it

provides defenders with an advantage by detecting flaws before an intrusion

happens. Sophisticated IDS may use statistical techniques such as Naïve Bayes [29]

to find new vulnerabilities. This enables the defender to capture, mislead or use

other counter measures against the attacker. ML provides an advantage to the

defender because it can detect any anomaly. Thus, the attacker would need to hide

byte patterns in addition to finding and exploiting vulnerabilities. This requires the

attacker to add complexity to bypass defenses. The IDS will learn with each attack

and ML makes the system more intelligent and secure over time. ML is an

algorithmic method in which an application automatically learns from the input and

provides the feedback to improve its performance over time. Unlike statistical

methods, which aim at determining the deviations in traffic features, ML based

approach aims at detecting anomalies using some unique mechanism. ML is focused

on finding relationships in data by analyzing the process and are classified as

1. Supervised Learning (SL) – Attempts to learn some function with

given input vector and actual output.

2. Unsupervised Learning (UL) – Attempts to learn only with given

input vector by identifying relationships among data.

Page 9: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

35

3. Reinforcement Learning (RL) [30,31] – Learns with a single bit of

information which indicates to the neuron whether the output is good

or bad.

These ML techniques can also recognize the patterns not presented

during a training phase. Some of the ML techniques used to detect attacks are Naïve

Bayesian, SVM and ANN. Most of the ML algorithms applied to intrusion detection

have not considered minimizing the false alarms. The cost associated with false

alarm is more expensive than misdetection.

2.5 MACHINE LEARNING VERSUS STATISTICAL TECHNIQUES

A wide range of real world applications are discussed in the community

of Statistical Analysis and Data Mining. Statistical techniques usually assume an

underlying distribution of data and require the elimination of data instances

containing noise. Statistical methods, though computationally intense, can be

applied to analyze the data. Statistical methods are widely used to build behavior-

based IDS. The behavior of the system is measured by a number of variables

sampled over time such as the resource usage duration, the amount of processor-

memory-disk resources consumed during that session etc. The model keeps averages

of all the variables and detects whether thresholds are exceeded based on the

standard deviation of the variable.

2.6 INSTANCE BASED LEARNING (IBL)

Researchers have also employed IBL techniques in intrusion detection

and event correlation/fault management as a means to obtain a more flexible system

compared with most Expert Systems (ES). The drawbacks of using ES are extracting

knowledge of intrusions and coding this in the form of rules. This is difficult and

time consuming as managing and updating the rule base dynamically is a difficult

task. Another problem is specific rules cannot detect slight variations of known

attacks. IBL operates by solving these problems based on solved instances/cases

unlike ES which require previous knowledge to determine specific rules [32]. The

Page 10: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

36

knowledge repository of instances/cases can be updated automatically and the

system can learn from its own experience during operation. However, IBL is not as

efficient as ES in performing event correlation and has high memory requirements

as it is necessary to store a large number of cases/rules. Case Based Reasoning

(CBR) may be used to improve the performance in acquiring and representing the

knowledge for IDS. Lane [33] developed IDS to perform anomaly detection by

means of IBL. In this the system builds up user profiles based on UNIX commands,

which are used to catch long term, unconventional as well as data that is misused.

The research focus is on data reduction techniques, addressing the general issue of

high memory requirement of IBL. However, IBL was not able to maintain the

characteristics of the users as compared to clustering method. Hence, clustering is

considered as the better alternative.

2.7 CHANGE POINT TECHNIQUE

Large scale computer network intrusions during the final stages can be

identified by observing the abrupt changes/threshold in the network traffic [34].

However, these changes are hard to detect and difficult to distinguish from usual

traffic fluctuations in the early stages. Researchers have developed efficient adaptive

sequential and batch sequential methods for an early detection of attacks/intrusions

that lead to changes in network traffic. These methods employ a statistical analysis

of network traffic to detect very subtle traffic changes. The algorithms are based on

CP detection methods that utilize threshold to achieve an alarm. The CP algorithms

are self learning, allow for the detection of attacks with a small average delay and

are computationally simple and thus can be implemented online. Application of CP

models falls into various categories such as Gaussian observations with varying

mean or variance, Poisson process with a piecewise constant rate, changing linear

regression models and Hidden Markov Models (HMM) with time varying transition

matrices. CP detection methods can be divided into two categories, posterior and

sequential. In posterior tests the entire data set is collected first and CP is detected

off-line based on the analysis on the data collected. In contrast sequential tests are

done on-line with the data collected and the analysis is made on the fly. In the

research work on statistical data analysis, detecting changes in mean of a given data

Page 11: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

37

series plays an important role. Some of the approaches for CP detection are [35-37]

Chauvenet’s Criterion, Peirce’s Criterion, CUmulative SUM (CUSUM), Direct

Density Ratio estimation have been actively explored by the ML community, e.g.,

Kernel Mean Matching, Generalized Likelihood Ratio (GLR) and Direct Density

Ratio.

2.7.1 Coefficient of Variation (CV)

The behavior of certain type of data increases proportionally to the

average and the average shifts upwards at least by 50% so as the standard deviation

[38]. The common examples of this include filling the process, systems

measurements and accuracy of systems. CV can be used for such processes to better

characterize the ratio of the standard deviation to the average.

2.7.2 Chauvenet's Criterion [25,39]

From the mean value of a given sample of N measurements, a scatter is

defined from this criterion. All data points which fall within a band around the mean

that corresponds to the probability of [1-(1/2N)] should be retained. Data points are

considered for rejection only if the probability of the deviation obtained from the

mean is less than 1/2N.

2.7.3 Peirce's criterion [40]

This technique applies a rigorous method based on probability theory

which can be used to eliminate data “outliers” or spurious data in a better way.

However, Peirce's criterion can be applied more generally to a data set which

follows Gaussian distributions. A piecewise segmented function as proposed by

Stephen M Ross which caters for time dependent data where the CP is qualified as

the points between successive segments. A CP may be detected by discovering the

point such that all errors of local model fittings of segments to the data before and

after that point is minimized. However, it is computationally expensive to converge

to such a point as the local model fitting is required as many times as the number of

points between the successive points whenever the data is given as an input.

Page 12: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

38

2.7.4 CUSUM (CUmulative SUM) [41-43]

CUSUM charts can be used to detect deviations from a given

predetermined values. This method computes the standard deviation of the observed

data from the desired process mean. This is accumulated over time to compute the

CUSUM at each given point. The basic rules for interpreting a CUSUM values are if

the data is above the overall average - CUSUM value increases, if the data is below

the overall average - CUSUM value decreases and if values have shifted it means a

there is a sudden change in direction. CUSUM method applies a hypothesis test to

distinguish between acceptable and unacceptable (quality) attribute values. CUSUM

can also be used to detect a shift in a normal mean based on inferences of the normal

distribution. It should be noted that the data provided to CUSUM calculations have

to follow the normal distribution. The continuous normal or Gaussian probability

distribution is parameterized by the population mean μ and the population variance

�2.

2.7.5 Generalized Likelihood Ratio (GLR) [37, 44]

This is an intuitive approach for handling the testing problems based on

discrepancy measures. The logarithmic value of the likelihood ratio between two

consecutive intervals in time-series data will be monitored for detecting change

points. The above premise has been extensively explored in the DM community in

connection with real world applications. Because of the computational cost of the

GLR, nonlinear models such as NN have never been employed, even for off line

analysis. Recent advances in both training algorithms and speed of computer has

made it possible to implement GLR for both off line and real time applications.

2.7.6 DDR (Direct Density Ratio) [45,46]

This is an estimation that has been actively explored in the ML

community. Kernel Mean Matching (KMM) avoids density estimation and directly

gives an estimate of the importance at test points. The values of the importance are

unknown in practice, so there is a need to estimate from the sample data that is

Page 13: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

39

collected. If the training and test densities are estimated separately from the data

samples, then it is possible to estimate the importance by taking the ratio of the two

estimated densities. But this approach will suffer from the curse of dimensionality, if

the data has neither low dimensions nor a simple distribution. Vapnik [47] suggested

that DDR estimation is very crucial in statistical learning.

2.8 APPLICATION OF DM IN DEVELOPING IDS

Due to large volumes of intrusion detection, data set researchers have

applied many DM and ML algorithms for detecting intrusions. DM with ML can be

defined as the process of extracting patterns from large data sets by combining

methods from statistics and AI techniques. DM is seen as an increasingly important

tool by an enterprise to transform data into Business Intelligence (BI) giving an

informational advantage. It is also currently used in a wide range of profiling

practices, such as marketing, surveillance, fraud detection, and scientific discovery

[48-50]. The relevance of DM in detecting intrusion is still an open research area in

intelligent computing. DM can be used to clean, classify and study large amount of

network data to correlate violation for intrusion detection. The main reason for using

DM techniques for IDS is due to the enormous volume of existing and newly

appearing network data that require processing. The amount of data accumulated

each day by a network is enormous. DM algorithms can be used for misuse

detection and Anomaly Detection AD. Many DM algorithms have already been used

for AD such as DT, Naïve Bayesian (NB), Neural Networks (NN), SVM etc.

The earlier work emphasized that data can be obtained by three

ways [51]:

i. By using real traffic.

ii. Using sanitized traffic.

iii. Using simulated traffic.

But in real time fast response to external events within an extremely short

time is demanded and expected. Therefore, an alternative algorithm to implement

Page 14: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

40

real time learning is imperative for critical applications for fast changing

environments. Even for offline applications, speed is still a need, and a real-time

learning algorithm that reduces training time and human effort to nearly zero would

always be of considerable value. Mining data in real time is a big challenge.

2.8.1 Artificial Neural Networks (ANN)

ANN consists of a collection of processing units called neurons that are

well interconnected in a given topology. ANN has the ability of learning by example

and generalization from limited, noisy, and incomplete data. Hence ANN has been

successfully employed in a wide range of data intensive applications. ANN

contributions to and performance in the intrusion detection domain can be classified

as:

2.8.2 Feed Forward Neural Networks (FFNN)

FFNN is the first and the simplest type of ANN devised. Two types of

FFNN are commonly used in modeling either normal or intrusive patterns.

2.8.2.1 Multi Layered Feed Forward (MLFF) Neural Networks

MLFF uses various learning techniques and the most popular is Back

Propagation (MLFFBP). MLFFBP networks were applied to develop IDS primarily

in anomaly detection of user behavior level [52,53]. To distinguish between normal

and abnormal behavior Seth Freeman [54] used data set that consists of user

behavior. Ryan [55] considered the command patterns and their frequency of

execution. The recent research interest is to detect software behavior that is

described by sequences of system calls. Since system call sequences are more stable

than commands Ghosh [56] built a model using MLFFBP for the lpr program and

the DARPA BSM98 dataset. Detailed descriptions of this dataset can be found at

http://www.ll.mit.edu/IST/ideval/data/ data_index.html. The network traffic is

another vital data source that can be applied on network packets for the detection of

misuse. Although the training and test iterations required a day to complete,

experiments showed MLFFBP was successful as a binary classifier to correctly

Page 15: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

41

identify attacks in the test data. MLFFBP can also be used as a Multi Class

Classifier (MCC). Such NN will have multiple output neurons and is more flexible.

Mukkamala, Sung and Ajith [57] compared twelve different learning algorithms on

the KDD99 dataset. They found that resilient back propagation achieved the best

performance in terms of accuracy and training time.

2.8.2.2 Radial Basis Function Neural Networks (RBFNN)

RBFNN are another popular type of FFNN. The classification is

performed by measuring distances between inputs and the centers of the RBFNN

hidden neurons. RBFNN are much faster than back propagation and is suitable for

problems with large data set [58]. Many researchers [59, 60] have developed

systems using RBFNN that can learn from multiple local clusters for well known

attacks and for normal events. A hybrid approach is used that integrates both misuse

and anomaly detections in a hierarchical RBF network. The first layer has an RBF

anomaly detector that identifies whether an event is normal or not. Anomaly events

are then passed through an RBF misuse detector chain for a specific type of attack.

Anomaly events which could not be classified were saved to a database and a

C-Means clustering algorithm clustered these events into different groups. Later a

misuse RBF detector was trained on each group, and added to the misuse detector

chain. Finally all intrusion events were automatically and adaptively detected and

labeled.

Since RBF and MLFF networks are widely used Jiang and Zhang [61]

compared the RBF and MLFF networks for misuse and anomaly detection on the

KDD99 dataset. Their experiments have shown that for misuse detection, BP has a

slightly better performance than RBF in terms of detection rate and false positive

rate, but requires longer training time. For AD, the RBF network improves

performance with a high detection rate and a low false positive rate, and requires

less training time. In general RBF networks achieve better performance which was

also concluded by Hofmann et. al [62] using the DARPA98 dataset.

Page 16: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

42

2.8.3 Recurrent Neural Networks (RNN)

It is important but difficult to detect attacks spread over a period of time.

The window size defined should be adjustable in predicting the user behavior.

A large window size is needed to enhance deterministic behavior when users

perform a particular job. During this time their behavior is stable and predictable.

When users are switching from one job to another, behavior becomes unstable and

unpredictable. Hence a small window size is required in order to quickly forget

meaningless past events. The inclusion of memory in NN led to the invention of

RNN or Elman network [63]. RNN was used in applications of forecasting, where a

network predicted the next event given an input sequence. If there is a deviation

between a predicted output and an actual event, an alarm was generated. Sheikhan

et.al [64] modified the RNN model with three layers. The results showed that the

model had an improvement in Classification Rate (CR), Detection Rate (DR) and

Cost Per Example (CPE). The model was compared with similar related works and

also the simulated MLP and Elman-based intrusion detectors. Ghosh et.al [65]

compared RNN with MLFFBP network for forecasting system call sequences and

the results showed that RNN achieved the best performance, with a detection

accuracy of 77.3% and zero false positives. Cheng et.al [66] developed a RNN to

detect network anomalies using the KDD99 dataset and emphasized the importance

of payload information in network packets. They showed that by discarding the

payload leads to an undesirable information loss and indicated that with payload

information the system outperformed RNN. Much research work confirms that RNN

outperforms MLFF networks in detection accuracy and generalization capability.

The Cerebellar Model Articulation Controller (CMAC) NN [67] is an additional

type of RNN which has the capability of incremental learning. This will avoid

retraining a NN every time when a new intrusion is detected.

2.8.4 Self Organizing Maps (SOM)

SOM and Adaptive Resonance Theory (ART) are two unsupervised

Neural Networks based on statistical clustering algorithms. They group objects by

similarity measure and are suitable for intrusion detection tasks. When grouped

Page 17: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

43

normal behavior will be densely populated around one or two centers and abnormal

behavior or intrusions appear in sparse regions as outliers. SOM are Single Layer

Feed Forward Networks (SLFFN) where data is clustered in a low dimensional grid

[68]. It preserves topological relationships of input data according to their similarity

and is one of the most popular NN. Fox first employed SOM to detect viruses in a

multiuser machine in 1990. Researchers [69, 70] used SOMs to learn patterns of

normal system activities which have been used for misuse detection. Other

classification algorithms, such as FFNN were then trained on the output from the

SOM. Sarasamma et.al [71] proposed a work that calculates the probability of a

record mapped to a heterogeneous neuron being of a type of attack. A confidence

factor was defined to determine the type of attack that dominated the neuron. They

showed that different subsets of features were good at detecting different attacks.

The results showed that false positive rates were significantly reduced in hierarchical

SOMs as compared to single layer SOMs. Rhodes [72] examines network packets

and stated that every network protocol layer has a unique structure and function.

Malicious activities aiming at a specific protocol should also be unique and it is

unrealistic to build a single SOM to tackle all these activities. They organized a

multilayer SOM in which each layer corresponds to one protocol layer. Zanero [73]

analyzed payload of network packets and proposed a multi layer detection

framework. K-means algorithm was used to avoid calculating the distance between

each neuron. This greatly improved the runtime efficiency of the algorithm.

Several NN techniques have been used in intrusion detection and are

described as landmarks in the development of IDS. The aim is to simulate the

operation of the human brain, make it flexible and adaptable to environmental

changes. An alternative approach to training ANNs is proposed using A to evolve

the weights of the ANN, referred to as an ENN [74]. Hybrid systems developed

using NN and Fuzzy Logic [75] performed well with limited training sets on labeled

alerts. An excellent improvement was provided by hybrid systems with solutions for

real-world problems.

Page 18: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

44

2.8.5 Bayesian Networks (BN)

BN is a probabilistic model that represents a set of variables and their

probabilistic independencies. BN are directed acyclic graphs with nodes

representing variables and edges representing the encoded conditional dependencies

between the variables [76]. They have been applied in AD in different ways and

have been utilized in the decision process of hybrid systems. Ben et.al [77]

developed an AD system that employed NB that assumes complete independency

between the nodes with two layers. BN are utilized in the decision process of hybrid

systems. They offer a sophisticated way of dealing with most hybrid systems that

generally obtain high false alarm rates. This is due to the simplistic approach of

combining the outputs of the techniques during the decision phase. Hybrid host

based AD system consists of the detection techniques like analyzing string length,

character distribution structure, and identifying learned tokens, in which a BN can

be used to decide the final output classification. Generally, in anomaly intrusion

detection, the number of possible features is large, but an attacker’s activity is

usually related to just a few features. Furthermore, the effectiveness of a specific

feature mainly depends on the behavior and for this reason activity can be analyzed

using individual feature independently. A typical AD method relies on statistical

analysis with an advantage that it can generate a concise profile containing only a

statistical summary without maintaining the activities. This can lessen the burden of

computation overhead for real time intrusion monitoring. However, when the value

of each feature varies largely, the statistical summary failed to make a concise

profile. However, most conventional classification algorithms [78], do not consider

any updates in a data set and are not suitable for real time data. Consequently, the

concept of updating should be incorporated, and a classification method that

considers updates in the data set has been proposed. The basic assumption of

conventional classification algorithms is that the data set is fixed and available

before classification can be performed. This assumption is valid only when static

knowledge embedded in a specific data set is the target of clustering. Therefore, it is

very important to identify an appropriate data set that reflects the characteristics of

the target application domain very well. Hence conventional classification

Page 19: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

45

algorithms pose limitations as the normal behavior of a user is generally analyzed

off-line. Kok-Chin Khor et.al [79] implemented BN by selecting important features

by utilizing feature selection algorithm and filter approach. With respect to

performance they concluded that the BN performed equivalently well in detecting

network attacks. Mutz [80] extended the work by proposing an application based

IDS that considers system call arguments during analysis of user commands. Most

IDS exclude this information, which is a reason for the occurrence of False

Negatives (FN), as it is possible to execute intrusions with valid system calls.

Authors also focus on parameters like CPU load, since the IDS should not take up

too many resources. This is due to the fact that it may prevent the user from using

the computer efficiently. In their work the CPU load remained relatively low and

during stress tests, the increase in CPU load was within 20% on average.

2.8.6 Decision Trees (DT)

DT is popular in IDS, as they yield good performance and offer some

benefits over other ML techniques. For example, they learn quickly compared to

ANN and the tree structure built from the training data can be used to produce rules

for ES. DTs cannot generalize to new attacks in the same manner as certain other

ML approaches and they are not suitable for anomaly detection. New findings

demonstrate that DTs are very sensitive to the training data and do not learn well

from imbalanced data. DTs have been successfully implemented to IDS both as a

standalone and as a part of hybrid systems. An example to the success of DTs is an

application of a C5.0 DT [81]. Lot of work has been carried out to examine the

performance of several ML techniques on the KDD Cup 99 data set, including a

C4.5 DT. The DT provided good accuracy but could not perform well as other

techniques on some classes of intrusion. An ANN and k-means clustering obtained

higher detection rates and able to generalize from learned data to new, unseen, data.

Classification is a method of mapping from a set of attributes to a particular class.

DT induction is one of the classification algorithms in DM. The DT classifies the

given data item using the values of its attributes. The DT is constructed from a set of

pre-classified data set which is also known as training set. The main approach is to

select the attributes, which best divides the data items into their classes. The major

Page 20: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

46

problem is deciding the attribute that will best partition the data into various classes.

The ID3 algorithm uses the Information Gain (IG) approach to solve this problem by

using the concept of Entropy, which measures the impurity of a data items. DT

induction has been implemented with several algorithms. ID3 later on got extended

to C4.5 and C5.0. C4.5 avoids over fitting the data and can handle continuous

attributes. C4.5 builds the tree from a set of data items using the best attribute to test

in order to divide the data item into subsets and then it uses the same procedure on

each sub set recursively. The best attribute to divide the subset at each stage is

selected using the IG of the attributes. Intrusion detection can be considered as a

classification problem where each network connection is identified either as an

attack or normal based on some existing data. DT can solve the problem of intrusion

detection by learning the model from the data set. Later using DT it is possible to

classify the new data item into one of the classes specified in the data set. Learning

is based on the training data and can predict the future data as one of the attack or

normal. DT works well with large data sets and this is important as large amounts of

network data flow across computer. The high performance of DT makes them

applicable in real time intrusion detection. Generalization accuracy of DT is another

useful property for intrusion detection model. New attacks on the system with small

variations of known attacks can also be detected after the model is built. This ability

to detect new intrusions is possible due to the generalization accuracy of DT.

2.8.7 Support Vector Machines (SVM)

SVM is a supervised learning algorithm that is used increasingly in IDS.

The classification performance of SVM model is better than the classification

methods, such as ANN [82]. The benefits of SVM are that they learn very

effectively with high dimensional data. A SVM maps input feature vectors into a

higher dimensional feature space through some nonlinear mapping. SVMs can learn

a larger set of patterns and are able to scale better, because the classification

complexity does not depend on the dimensionality of the feature space. SVMs also

have the ability to update the training patterns dynamically whenever there is a new

pattern detected during classification. The main disadvantage is that SVM can only

Page 21: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

47

handle binary class classification whereas intrusion detection requires multi-class

classification.

The SVM is one of the most successful classification algorithms in the

DM area. The training time of SVM is a serious problem in the processing of large

data sets which limits its use in DM applications as it requires the processing of

huge data sets. Normally it would take years to train SVM on a data set consisting of

one million records. Many researchers have carried out work to enhance SVM in

order to increase its training performance [83-85]. This is achieved either through

random selection or approximation of the marginal classifier. These approaches are

still not feasible as multiple scans of entire data set are required which is also

expensive to perform [86]. Seo [87] applied SVM to Host-based AD of

masquerades. Their work is to analyze sequences of UNIX commands executed by

users on a host. Kim applied SVM with a Radial Basis Function (RBF) kernel,

analyzing commands over a sliding window and achieved a detection rate of 80.1%.

Seo examines two different kernels, K-gram and String kernel, which yielded higher

detection rates of 89.61% and 97.40%, respectively. The drawback is the same as

with the RBF kernel employed by Seo and Cha, that the false positive rate is higher.

Seo also examined a hybrid of the two kernel methods, which gave nearly identical

results as obtained by Kim. An unsupervised class of SVM was proposed by Dennis

[88], which has been adopted in several studies, comparing its performance with

clustering techniques. SVMs are supervised learning algorithms, which have been

applied increasingly to misuse detection in the last decade. One of the primary

benefits of SVMs is that they learn very effectively from high dimensional data.

Furthermore, they are trained very quickly. Mukkamala [89] conducted a

comparative study of feed forward MLP and SVM for misuse detection. Identical

detection rates were obtained, and the SVM training time was comparatively less

than MLP. SVM algorithms are binary classifiers, which will be sufficient for only

distinguishing between normal and attack. Recent SVM algorithms support multi

class learning [90]. The approach is to combine several two classes of SVM. Sung

and Mukkamala [91] applied SVM to network based intrusion detection with five

types of SVM. For each SVM, the training data is partitioned into two classes as

Page 22: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

48

normal or intrusions. The hybrid technique adopted is that SVM with the highest

output value is taken as the final output. Peddabachigari [92] conducted practical

analysis of SVM and DT performed as standalone detectors and also as hybrids.

Performance was considered as a parameter and the results indicate that the hybrid

method performs better. Due to the magnitude of data involved in network-based

intrusion detection, Rung [93] proposed a hybrid which combines SVM with

weighted voting schema technique to shorten the training time. A hierarchical

clustering algorithm was employed to locate boundary points in the data that best

separates the two classes. These classes are then used to train the SVM as an

iterative process. During each iteration the support vectors were recalculated and the

SVM is tested against a stopping criterion. This is to determine if a desirable

threshold of accuracy is exceeded or not. The evaluation was done on the DARPA98

data set and the accuracy was improved. This was mainly due to correctly

classifying more DoS attacks. However there was an increase in false positive rates.

Song [94] proposed a Robust SVM (RSVM) that was developed to better deal with

noise. The RSVM was applied to host based intrusion detection by Hu [95]. The

benefit of using RSVM is that it produces less support vectors, which makes it a

quicker algorithm.

Ganapathy [96] pointed out that SVM can obtain generalization ability

with less training time through simulation experiments on a few artificial and real

benchmark function approximation and classification problems. They have indicated

that SVMs can perform well in text classification problems. Recently a significant

contribution showing the relationship between Extreme Learning Machines (ELM)

and SVM in the context of classification is made [97]. Recently researchers have

made a more in depth exploration of their relationship, and compared the

performance of ELM, SVM, and Least Squares SVM (LSSVM) [98]. ELM provides

a unified learning platform to different applications, such as regression, binary, and

multiclass classifications for the LSSVM, Proximal SVM (PSVM) [99] and other

regularization algorithms. ELM avoids issues involving manual tuning control

parameters like learning rate, learning epochs etc which are difficult to manage in

traditional approaches and reaches good solutions analytically. ELM can be

Page 23: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

49

implemented and used easily with faster learning speed, response time and ease of

implementation that are keys to the success in the design of IDS. ELM algorithm

tends to achieve similar or better generalization performance at much faster learning

speed than the SVM and LSSVM algorithms. However, there also remain several

aspects needing further consideration. Recent experimental investigations focus

mainly on the comparisons of SVM and ELM. Both are applied to a variety of

examples but the advantages/disadvantages of applying these methods are still

unknown in real time network intrusion area. Knowing such information may

provide more insight into the SVM and ELM algorithms because the former is based

on the Structural Risk Minimization (SRM) principle which is especially suited for

learning small samples, while the latter is based on the inductive principle known as

Empirical Risk Minimization (ERM). The results can strengthen the understanding

on the essential relationship between SVM and ELM. This can also serve as

complementary knowledge for the past experimental and theoretical comparisons

between them. SVM algorithms are binary classifiers that are sufficient to

distinguish between normal and intrusive data. Recent SVM algorithms support

multi class learning. The approach combined several two-class SVMs and for each

SVM, the training data is partitioned into two classes so that one represents an

original class and the other class represents the attacks. It is also necessary to specify

an upper bound parameter C that can be determined experimentally. This results in a

cross-validation procedure, which is wasteful both for computation as well as data.

Kernel based ML algorithms are based on mapping data from the original

input feature space to a kernel feature space of higher dimensionality to solve a

linear problem in that space. These methods allow us to interpret and design learning

algorithms geometrically in the kernel space. SVM is one of the several Kernel

based techniques available in the field of ML. The choice of a proper kernel function

plays an important role in SVM based classification/regression. It is difficult to

choose one which gives the best generalization for a given dataset. Many Kernels

have been proposed in the SVM literature. Cheng [100] creates a kernel function

suitable for the training data using a GA mechanism. They showed that their genetic

Page 24: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

50

kernel has good generalization abilities when compared with the polynomial and the

RBF kernel functions.

Ye [101] proposed an orthogonal Chebyshev kernel function. Chebyshev

polynomials are first constructed through Chebyshev formulae. Then based on these

polynomials Chebyshev kernels are created satisfying Mercer condition. They

showed that it is possible to reduce the number of support vectors using this kernel.

Wang et. al [102] proposed the Weighted Mahalanobis Distance Kernels. They first

find the data structure for each class in the input space via agglomerative

hierarchical clustering and then construct the weighted Mahalanobis distance kernels

which are affected by the size of clusters they reside in. Xu [103] proposed using the

weighted Levenshtein distance as a kernel function for strings. They used the UCI

splice site recognition dataset for testing their proposed specific kernel which got the

best results in this problem. They used the boosting paradigm to construct the

learned kernel. Their approach is suitable in learning tasks where the test data

distribution is different from the training data distribution. Lodhi [104] introduced a

novel kernel for comparing two text documents. The kernel is an inner product in the

feature space consisting of all subsequences of length k. A subsequence is any order

sequence of k characters occurring in the text though not necessarily contiguously.

These subsequences were given weightage based on some decay factor of their full

length in the text, hence putting some emphasis on contiguous characters. Rieck et al

[105] proposed an algorithm for computation of similarity measures for sequential

data. The algorithm uses suffix trees for efficient calculation of various kernel

functions. Its worst-case run-time is linear in the length of sequences and

independent of the underlying embedding language, which can cover words,

k-grams or all contained subsequences. Experiments with network intrusion

detection, Dynamic Network Analysis (DNA) and text processing applications

demonstrate the utility of distances and similarity coefficients for sequences as

alternatives to classical kernel functions.

Many of the detection results reported till date using ML algorithms with

DT, NN and SVM indicate that attacks involving more features in the data set have

substantially lower detection rates. Hence feature relevance analysis is another

Page 25: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

51

research area of interest to substantiate the performance of ML IDS. The objective is

to investigate the relevance of the features with respect to dataset labels. That is, for

normal behavior and each type of attack the system should determine the most

relevant feature, which best discriminates the given class from the others.

To achieve this IG, which is the underlying feature selection measure for

constructing DT can be used. For a given class, the feature with the highest IG is

considered the most discriminative feature. Researchers have proposed several

methods of feature selection to achieve real time IDS. The major benefit of feature

selection is that the amount of data required to process is significantly reduced,

without compromising the performance of the detection. In some cases, feature

selection may improve the performance of the detection as it simplifies the

complexity problem by reducing the dimensions.

2.9 IMPORTANCE OF FEATURE SELECTION FOR IDS

Data preprocessing is considered as an important step in IDS. The

amount of data set that needs to be examined for detection of attack is very large

even for a small network. Analysis is very difficult as the number of features

available in the data set can make it harder to detect suspicious behavior patterns.

As complex relationships exist between the features, it is better to reduce the amount

of data to be processed for IDS. This is particularly important if real time intrusion

detection is preferred. Reduction of features can be made by considering the data

that is not useful by filtering. Data can be grouped or clustered by storing the

characteristics of the clusters instead of the individual data. Feature selection can

improve classification performance by reducing the computational complexity and is

an important preprocessing technique. Feature selection is the important step in

building intrusion detection models [106,107]. This will also increase the available

time for detecting intrusions but most of the work is still done manually and the

features selection depends strongly on expert domain knowledge. ML technique

provides the wrapper and the filter models for automatic feature selection. The major

problem that many researchers face is how to choose the optimal set of features.

This is because not all features are relevant to the learning algorithm. Irrelevant and

redundant features with noisy data can affect the learning algorithm by severely

Page 26: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

52

degrading the performance with respect to training and testing time. Feature

selection was proven to have a significant impact on the performance of the

classifiers. Many researchers as in [108-110] illustrate that feature selection can

reduce the building and testing time of a classifier. Currently two models are of most

importance namely the filter model and the wrapper model. In the filter model

statistical characteristics of a data set is considered directly without relating to any

learning algorithm. The filter model uses a measure such as correlation, consistency,

or distance measures to compute the relevance of a set of features. In contrast, the

wrapper model will assess the features that are selected by learning algorithm’s

performance. The wrapper model uses the predictive accuracy of a classifier as a

means to evaluate the “goodness” of a feature set. Hence the wrapper model

requires more time [111]. The requirement of computational resources to find the

best feature subsets is also more in wrapper model. In order to increase the

computational efficiency, usually the filter method is used for selection of features

from high dimensional data sets. It is well known that the redundant features can

reduce the performance of IDS. A major challenge in the IDS feature selection

process is to choose appropriate measures that can precisely determine the relevance

and the relationship between features of a given data set.

2.10 RELEVANCE VECTOR MACHINES (RVM)

In spite of good performance with different datasets, SVM still suffers

from shortcomings such as visualization/interpretation of model, kernel choice and

kernel specific parameter. Recently RVM, another kernel based approach is being

explored for classification and regression problems. RVM proposed by Tipping

[112] is a sparse ML algorithm that is similar to the SVM in many respects. RVM is

another area of interest in the research community as they provide a number of

advantages. The advantage of RVM over the SVM is the availability of probabilistic

predictions, using arbitrary kernel functions and not requiring setting of the

regularization parameter. RVM is based on a Bayesian formulation of a linear

model with an appropriate sparse weight prior distribution. The sparseness property

enables selection of the proper kernel at each location by pruning all irrelevant

kernels which results in a sparse data representation. As a result, they can generalize

Page 27: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

53

well and provide inferences at very low computational cost [113]. Through the use

of proper kernels in SVM, good generalization performance can be achieved. Some

desirable properties are, SVM fits functions in high dimensional feature spaces and

large space of functions available in feature space. It is sparse, which means only a

subset of training data set is retained at runtime that improves the computational

efficiency. Although relatively sparse, SVM makes unnecessary use of basis

functions as the number of Support Vector (SV) required typically grows linearly

with the size of the training data set. SVM outputs a point estimate with regression

and a binary decision in classification. As a result it is difficult to estimate the

conditional distribution to capture the uncertainty during prediction. In RVM the

kernel function must be the continuous symmetric kernel of positive integer operator

to satisfy Mercer condition. Maintaining its classification accuracy RVM has the

ability to yield a decision function that is much sparser than SVM. This leads to

significant reduction in the computational complexity of the decision function and

thereby making it more suitable for real time applications.

The RVM produces a function which is comprised of a set of kernel

functions also known as basis functions and a set of weights. This function

represents a model for the system presented to the learning process from a set of

training data set. The kernels and weights calculated by the learning process and the

model function defined by the weighted sum of kernels are fixed. From this set of

training vectors the RVM selects a sparse subset of input vectors which are deemed

to be relevant by the probabilistic learning scheme [114]. This is used for building a

function that estimates the output of the system from the inputs. These relevant

vectors are used to form the basis functions and comprise the model function.

2.11 CURRENT STATE OF IDS

IDS typically consist of security functions, firewall, IPS/IDS and some

filtering functions like anti-spam, antivirus and URL. Recent challenge in

developing IDS is to develop security software solutions and appliances to defend

against the threats faced by enterprise networks. The main focus is to develop

systems that work in real time with detection, prevention and response [115].

Page 28: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

54

Detection can be done either through static signatures or anomaly detection. New

research work focuses on approaches that can secure the network by looking for the

reason about risks. This can happen before an attack happens and limit exposure to

threats. A general framework of IPS uses a trigger based approach to do reactive

network measurement [116]. NN approaches combine the complexity of some of

the statistical techniques with the ML objective of imitating human intelligence.

This is done at a more “unconscious” level and hence there is no accompanying

ability to make learned concepts transparent to the user. Important problems remain

to be solved although variety of security tools incorporating AD functionalities

exists. IDS are continuously evolving with the goal of improving the security and

protection of networks and computer infrastructures but there still exist several open

research issues. Some of the most significant challenges in the area are:

1. Low detection efficiency : Due to the high FP rate it calls for the

exploration and development of new, accurate processing schemes, as

well as better structured approaches to modeling network systems.

2. Low throughput and high cost: Due to the high data rates, IDS is

intended to optimize intrusion detection concerned with grid

techniques and distributed detection paradigms.

3. Absence of appropriate metrics: Due to lack of a general

framework to evaluate and compare different techniques assessing

IDS is a real challenge. Research shows that most of the IDS systems

perform poorly in defending themselves from attacks and significant

efforts should be done to improve intrusion detection technology in

this aspect.

2.11.1 Intrusion Prevention System (IPS)

The inadequacies inherent in current defense have driven the

development of a new breed of security products known as IPS [117]. IPS software

has all the capabilities of IDS and can also attempt to stop possible incidents. This

section provides an overview of IPS technologies and describes the key functions,

methodologies that they use. An overview of the major classes of IPS technologies

Page 29: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

55

is also provided in [6, 58]. The purpose of an IPS is to not only detect that an attack

is occurring, but also to stop it. To do so, it can be considered to be an advanced

combination of a firewall and IDS. Recent trends in industry show that more and

more companies are choosing IPS-based solutions over IDS-based solutions,

primarily due to the need to actively block worm and hacker attacks, instead of

passively monitoring them as an IDS system would do. IPS research took its root

from IDS research and some researchers define IPSs as combination of IDSs with

added functionalities.

So IPS can be defined as an in line product that focuses on identifying

and blocking malicious network activity in real time. In general, there are two

categories namely Rate based and Content based IPS. The devices often look like

firewalls and often have some basic firewall functionality. But firewalls block all

traffic except that for which they have a reason to pass, whereas IPS pass all traffic

except that for which they have a reason to block.

2.11.1.1 Rate based IPS

Rate-based IPS blocks traffic based on network load that includes flow of

too many packets in a specified time, number of connections per unit time or the

number of errors that are generated. In the presence of these, a rate based IPS kicks

in and blocks, throttles or otherwise mediates the traffic. Most useful rate based IPS

include a combination of powerful configuration options with range of response

technologies. The process includes limiting queries to the DNS server and/or offers

other simple rules covering bandwidth and connection limit. A rate-based Intrusion

Prevention System can set a threshold of maximum amount of traffic to be directed

at a given port or service. If the threshold is exceeded, the IPS will block all further

traffic of the source IP only, still allowing other users to use that service.

The major problem in deploying rate based IPS products is deciding what

constitutes an overload. For any rate based IPS to work properly, the network owner

needs to know not only what ‘‘normal’’ traffic levels are but also other network

details such as how many connections their servers can handle. However, most

commercial products do not yet provide any help in establishing this base line

Page 30: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

56

behavior, but require the services of a ‘‘trained’’ product specific systems engineer

who often spend hours on site setting-up the IPS. Because rate based IPS requires

frequent tuning and adjustment, they will be most useful in very high volume Web,

application and mail server environments.

2.11.1.2 Content based IPS

This is also referred to as signature and anomaly based. Content based

IPS blocks traffic based on attack signatures and protocol anomalies and they are the

natural evolution of the IDS and firewalls. If the packets do not comply with TCP/IP

and if any suspicious behavior is detected, IPS will trigger and block future traffic

from that host. The recent content based IPS offers a range of techniques for

identifying malicious content and many options for how to handle the attacks, such

as simply dropping bad packets to dropping future packets from the same attacker,

and advanced reporting and alerting strategies. As content based IPS offer intrusion

detection like technology for identifying threats and blocking them, they can be used

deep inside the network to complement firewalls and provide security policy

enforcement as they often require less manual maintenance and fine tune to perform

a useful function as compared to rate based method. The major challenge in

designing IPS is the fact that it is designed to work in line, presenting a potential

choke point and single point of failure. If passive IDS fail, the worst that can happen

is that some attempted attacks may go undetected. If an in-line device fails, it can

seriously impact the performance of the network. The latency rises to an

unacceptable value and if the device fails, a self inflicted DoS condition may also

occur. Even though IPS device does not fail altogether it still has the potential to act

as a bottleneck, increasing latency and reducing throughput as it struggles to keep up

with Gigabit or more of network traffic.

2.11.2 Intrusion Response System (IRS)

The task of most traditional IDSs is to detect intrusion, but once the alert

is generated human intervention is required and implementing an automated action

of response is certainly a challenge. For a traditional IRS, such a response involves

notifying the central decision core, wait its arbitration, and apply decision.

Page 31: CHAPTER 2 LITERATURE SURVEY - shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/17838/5/12_chapter 2.pdf · 29 Table 2.2 Leading IDS products currently available

57

The current IRSs meet only a subset of the above challenges, and none will address these problems. The general principles followed in the development of the IRS naturally classify them into two categories.

2.11.2.1 Static Decision Making

This class of IRS provides a static mapping of the alert from the detector to the response that is to be deployed. The IRS includes basically a look-up table where the administrator has anticipated all alerts possible in the system and an expert indicated responses to be taken for each. In some cases, the response site is the same as the site from which the alarm was flagged, as with the responses often bundled with anti-virus products (disallow access to the file that was detected to be infected) or network-based IDS (terminate a network connection which matches a signature for anomalous behavior).

2.11.2.2 Dynamic Decision Making

This class of IRS reasons about an ongoing attack based on the observed alerts and determines an appropriate response to take. The first step in the reasoning process is to determine which services in the system are likely affected, taking into account the characteristics of the detector, the network topology, etc. The actual choice of the response is then taken dependent on a host of factors, such as, the amount of evidence about the attack, the severity of the response, etc. The challenges in designing an IRS are the attacks through automated scripts are fast moving and the owner of the distributed system does not have knowledge of or access to the internals of the different services.

2.11.3 Artificial Immune Systems

Artificial Immune Systems (AIS) have been extensively researched in the last decade, mainly for AD. Much research has been conducted on AIS as the model lends itself conveniently to AD. Several researchers came to the conclusion that the model has problems with scalability, limiting its application to real problems. Consequently, some researchers considered alternative models, while others have in recent years proposed enhancements to address scalability [118-120].