improving some artificial immune algorithms for...

MINISTRY OF EDUCATION VIETNAMESE ACADEMY

AND TRAINING OF SCIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY————————————

NGUYEN VAN TRUONG

IMPROVING SOME ARTIFICIAL IMMUNE

ALGORITHMS FOR NETWORK INTRUSION

DETECTION

THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

IN MATHEMATICS

Hanoi - 2019

MINISTRY OF EDUCATION VIETNAMESE ACADEMY

AND TRAINING OF SCIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY————————————

NGUYEN VAN TRUONG

IMPROVING SOME ARTIFICIAL IMMUNE

ALGORITHMS FOR NETWORK INTRUSION

DETECTION

THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHYIN MATHEMATICS

Major: Mathematical foundations for InformaticsCode: 62 46 01 10

Scientific supervisor:

1. Assoc. Prof., Dr. Nguyen Xuan Hoai

2. Assoc. Prof., Dr. Luong Chi Mai

Hanoi - 2019

Acknowledgments

First of all I would like to thank is my principal supervisor, Assoc. Prof.,

Dr. Nguyen Xuan Hoai for introducing me to the field of Artificial Immune System.

He guides me step by step through research activities such as seminar presentations,

paper writing, etc. His genius has been a constant source of help. I am intrigued

by his constructive criticism throughout my PhD. journey. I wish also to thank my

co-supervisor, Assoc. Prof., Dr. Luong Chi Mai. She is always very enthusiastic in

our discussion promising research questions. It is a pleasure and luxury for me to work

with her. This thesis could not have been possible without my supervisors’ support.

I gratefully acknowledge the support from Institute of Information Technology,

Vietnamese Academy of Science and Technology, and from Thai Nguyen University

of Education. I thank the financial support from the National Foundation for Science

and Technology Development (NAFOSTED), ASEAN-European Academic University

Network (ASEA-UNINET).

I thank M.Sc. Vu Duc Quang, M.Sc. Trinh Van Ha and M.Sc. Pham Dinh

Lam, my co-authors of published papers. I thank Assoc. Prof., Dr. Tran Quang

Anh and Dr. Nguyen Quang Uy for many helpful insights for my research. I thank

colleagues, especially my cool labmate Mr. Nguyen Tran Dinh Long, in IT Research

& Development Center, HaNoi University.

Finally, I thank my family for their endless love and steady support.

Certificate of Originality

I hereby declare that this submission is my own work under my scientific super-

visors, Assoc. Prof., Dr. Nguyen Xuan Hoai, and Assoc. Prof., Dr. Luong Chi Mai. I

declare that, it contains no material previously published or written by another person,

except where due reference is made in the text of the thesis. In addition, I certify that

all my co-authors allow me to present our work in this thesis.

Hanoi, 2019

PhD. student

Nguyen Van Truong

i

Contents

List of Figures v

List of Tables vii

Notation and Abbreviation viii

INTRODUCTION 1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1 BACKGROUND 5

1.1 Detection of Network Anomalies . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Host-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.2 Network-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 A brief overview of human immune system . . . . . . . . . . . . . . . . 8

1.3 AIS for IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.1 AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 AIS features for IDS . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Selection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4.1 Negative Selection Algorithms . . . . . . . . . . . . . . . . . . . 12

ii

1.4.2 Positive Selection Algorithms . . . . . . . . . . . . . . . . . . . 15

1.5 Basic terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.5.1 Strings, substrings and languages . . . . . . . . . . . . . . . . . 16

1.5.2 Prefix trees, prefix DAGs and automata . . . . . . . . . . . . . 17

1.5.3 Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.5.4 Detection in r-chunk detector-based positive selection . . . . . . 20

1.5.5 Holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.5.6 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . 22

1.5.7 Ring representation of data . . . . . . . . . . . . . . . . . . . . 23

1.5.8 Frequency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.6 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.6.1 The DARPA-Lincoln datasets . . . . . . . . . . . . . . . . . . . 27

1.6.2 UT dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.6.3 Netflow dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.6.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2 COMBINATION OF NEGATIVE SELECTION AND POSITIVE SE-

LECTION 30

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 New Positive-Negative Selection Algorithm . . . . . . . . . . . . . . . . 31

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 GENERATION OF COMPACT DETECTOR SET 43

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 New negative selection algorithm . . . . . . . . . . . . . . . . . . . . . 45

iii

3.3.1 Detectors set generation under rcbvl matching rule . . . . . . . 45

3.3.2 Detection under rcbvl matching rule . . . . . . . . . . . . . . . 48

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 FAST SELECTION ALGORITHMS 51

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 A fast negative selection algorithm based on r-chunk detector . . . . . . 52

4.4 A fast negative selection algorithm based on r-contiguous detector . . . 57

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 APPLYING HYBRID ARTIFICIAL IMMUNE SYSTEM FOR NET-

WORK SECURITY 66

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 Hybrid positive selection algorithm with chunk detectors . . . . . . . . 69

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4.3 Performance metrics and parameters . . . . . . . . . . . . . . . 72

5.4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

CONCLUSIONS 78

Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Published works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

iv

BIBLIOGRAPHY 81

v

List of Figures

1.1 Classification of anomaly-based intrusion detection methods . . . . . . 7

1.2 Multi-layered protection and elimination architecture . . . . . . . . . . 9

1.3 Multi-layer AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Outline of a typical negative selection algorithm. . . . . . . . . . . . . . 13

1.5 Outline of a typical positive selection algorithm. . . . . . . . . . . . . . 15

1.6 Example of a prefix tree and a prefix DAG. . . . . . . . . . . . . . . . . 18

1.7 Existence of holes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.8 Negative selections with 3-chunk and 3-contiguous detectors. . . . . . . 23

1.9 A simple ring-based representation (b) of a string (a). . . . . . . . . . . 25

1.10 Frequency trees for all 3-chunk detectors. . . . . . . . . . . . . . . . . . 26

2.1 Binary tree representation of the detectors set generated from S. . . . . 33

2.2 Conversion of a positive tree to a negative one. . . . . . . . . . . . . . . 33

2.3 Diagram of the Detector Generation Algorithm. . . . . . . . . . . . . . 35

2.4 Diagram of the Positive-Negative Selection Algorithm. . . . . . . . . . 37

2.5 One node is reduced in a tree: a compact positive tree has 4 nodes (a)

and its conversion (a negative tree) has 3 node (b). . . . . . . . . . . . 38

2.6 Detection time of NSA and PNSA. . . . . . . . . . . . . . . . . . . . . 40

2.7 Nodes reduction on trees created by PNSA on Netflow dataset. . . . . . 41

2.8 Comparison of nodes reduction on Spambase dataset. . . . . . . . . . . 41

3.1 Diagram of a algorithm to generate perfect rcbvl detectors set. . . . . . 47

4.1 Diagram of the algorithm to generate positive r-chunk detectors set. . . 55

vi

4.2 A prefix DAG G and an automaton M . . . . . . . . . . . . . . . . . . 57

4.3 Diagram of the algorithm to generate negative r-contiguous detectors set. 61

4.4 An automaton represents 3-contiguous detectors set. . . . . . . . . . . . 62

4.5 Comparison of ratios of runtime of r-chunk detector-based NSA to run-

time of Chunk-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 Comparison of ratios of runtime of r-contiguous detector-based NSA to

runtime of Cont-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vii

List of Tables

1.1 Performance comparison of NSAs on linear strings and ring strings. . . 24

2.1 Comparison of memory and detection time reductions. . . . . . . . . . 39

2.2 Comparison of nodes generation on Netflow dataset. . . . . . . . . . . . 40

3.1 Data and parameters distribution for experiments and results comparison. 49

4.1 Comparison of our results with the runtimes of previously published

algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Comparison of Chunk-NSA with r-chunk detector-based NSA. . . . . . 63

4.3 Comparison of proposed Cont-NSA with r-contiguous detector-based NSA. 64

5.1 Features for NIDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Distribution of flows and parameters for experiments. . . . . . . . . . . 73

5.3 Comparison between PSA2 and other algorithms. . . . . . . . . . . . . 74

5.4 Comparison between ring string-based PSA2 and linear string-based PSA2. 76

viii

Notation and Abbreviation

Notation

` Length of data samples

Sr Set of ring presentations of all strings in S

|X| Cardinality of set XΣ An alphabet, a nonempty and finite set of symbols

Σk Set of all strings of length k on alphabet Σ, where k is a

positive integer.

Σ∗ Set of all strings on alphabet Σ, including an empty string.

r Matching threshold

Dpi Set of all positive r-chunk detectors at position i.

Dni Set of all negative r-chunk detectors at position i.

CHUNKp(S, r) Set of all positive r-chunk detectors.

CHUNK(S, r) Set of all negative r-chunk detectors.

CONT(S, r) Set of all r-contiguous detectors.

L(X) Set of all nonself strings detected by X.

rcbvl r-contiguous bit with variable length.

Abbreviation

AIS Artificial Immune System

ACC Accuracy Rate

ACO Ant Colony Optimization

ANIDS Anomaly Network Intrusion Detection System

BBNN Block-Based Neural Network

Chunk-NSA Chunk Detector-Based Negative Selection Algorithm

Cont-NSA Contiguous Detector-Based Negative Selection Algorithm

DR Detection Rate

DAG Directed Acyclic Graph

FAR False Alarm Rate

GA Genetic Algorithm

HIS Human Immune System

HIDS Host Intrusion Detection System

IDS Intrusion Detection System

ix

ML Machine Learning

MLP Multilayer Perceptron

NIDS Network Intrusion Detection System

NS Negative Selection

NSA Negative Selection Algorithm

NSM Negative Selection Mutation

PNSA Positive-Negative Selection Algorithm

PSA Positive Selection Algorithm

PSA2 Two-class Positive Selection Algorithm

PSO Particle Swarm Optimization

PSOGSA Particle Swarm Optimization-Gravitational Search Algorithm

RNSA Real-valued NSA

SVM Support Vector Machines

TCP Transmission Control Protocol

VNSA Variable length detector-based NSA

1

INTRODUCTION

Motivation

Internet users and computer networks are suffering from rapidly increasing num-

ber of attacks. In order to keep them safe, there is a need for effective security monitor-

ing systems, such as Intrusion Detection Systems (IDS). However, intrusion detection

has to face a number of different problems such as large network traffic volumes, im-

balanced data distribution, difficulties to realize decision boundaries between normal

and abnormal actions, and a requirement for continuous adaptation to a constantly

changing environment. As a result, many researchers have attempted to use different

types of approaches to build reliable intrusion detection system.

Computational intelligence techniques, known for their ability to adapt and to

exhibit fault tolerance, high computational speed and resilience against noisy informa-

tion, are hopefully alternative methods to the problem.

One of the promising computational intelligence methods for intrusion detection

that have emerged recently are artificial immune systems (AIS) inspired by the biolog-

ical immune system. Negative selection algorithm (NSA), a dominating model of AIS,

is widely used for intrusion detection systems (IDS) [55, 52]. Despite its successful

application, NSA has some weaknesses: 1-High false positive rate (false alarm rate)

and false negative rate, 2-High training and testing time, 3-Exponential relationship

between the size of the training data and the number of detectors possibly generated for

testing, 4-Changeable definitions of ”normal data” and ”abnormal data” in dynamic

network environment [55, 79, 92]. To overcome these limitations, trends of recent works

are to concentrate on complex structures of immune detectors, matching methods and

hybrid NSAs [11, 94, 52].

Following trends mentioned above, in this thesis we investigate the ability of

NSA to combine with other classification methods and propose more effective data

2

representations to improve some NSA’s weaknesses.

Scientific meaning of the thesis: to provide further background to improve per-

formance of AIS-based computer security field in particular and IDS in general.

Reality meaning of the thesis: to assist computer security practicers or experts

implement their IDS with new features from AIS origin.

The major contributions of this research are: Propose a new representation of

data for better performance of IDS; Propose a combination of existing algorithms as

well as some statistical approaches in an uniform framework; Propose a complete and

non-redundant detector representation to archive optimal time and memory complex-

ities.

Objectives

Since data representation is one of the factors that affect the training and testing

time, a compact and complete detector generation algorithm is investigated.

The thesis investigates optimal algorithms to generate detector set in AIS. They

help to reduce both training time and detecting time of AIS-based IDSs.

Also, it is regarded to propose and investigate an AIS-based IDS that can

promptly detect attacks, either if they are known or never seen before. The proposed

system makes use of AIS with statistics as analysis methods and flow-based network

traffic as experimental data.

Problem statements

Since the NSA has some limitations as listed in the first section, this thesis

concentrates on three problems:

1. The first problem is to find compact representations of data. Objectives of this

problem’s solution is not only to minimize memory storage but also to reduce

testing time.

2. The second problem is to propose algorithms that can reduce training time

and testing time in compared with all existing related algorithms.

3

3. The third problem is to improve detection performance with respect to reduc-

ing false alarm rates while keeping detection rate and accuracy rate as high as

possible.

Solutions of these problems can partly improve first three weaknesses as listed in the

first section. Regarding to the last NSAs’ weakness about changeable definitions of

”normal data” and ”abnormal data” in dynamic network environment, we consider it

as a risk in our proposed algorithm and left it for future work.

Logically, it is impossible to find an optimal algorithm that can both reduce time

and memory complexities and obtain best detection performance. These aspects are

always in conflict with each other. Thus, in each chapter, we will propose algorithms

to solve each problem quite independently.

The intrusion detection problem mentioned in this thesis can be informally

stated as:

Given a finite set S of network flows which labeled with self (normal) or nonself

(abnormal). The objective is to build classifying models on S that can label exactly an

unlabeled network flow s.

Outline of thesis

The first chapter introduces the background knowledge necessary to discuss the

algorithms proposed in following chapters. First, detection of network anomalies is

briefly introduced. Following that, the human immune system, artificial immune sys-

tem, machine learning and their relevance are reviewed and discussed. Then, popular

datasets used for experiments in the thesis are examined. related works.

In Chapter 2, a combination method of selection algorithms is presented. The

proposed technique helps to reduce detectors storage generated in training phase. Test-

ing time, an important measurement in IDS, will also be reduced as a direct consequence

of a smaller memory complexity. Tree structure is used in this chapter (and in Chapter

5) to improve time and memory complexities.

A complete and nonredundant detector set, also called perfect detectors set,

4

is necessary to archive acceptable self and nonself coverage of classifiers. A selection

algorithm to generate a perfect detectors set is investigated in Chapter 3. Each detector

in the set is a string concatenated from overlapping classical ones. Different from

approaches in the other chapters, discrete structure of string-based detectors in this

chapter are suitable for detection in distributed environment.

Chapter 4 includes two selection algorithms for fast training phase. The optimal

algorithms can generate a detectors set in linear time with respect to size of training

data. The experiment results and theoretical proof show that proposed algorithms

outperform all existing ones in term of training time. In term of detection time, the

first algorithm and the second one is linear and polynomial, respectively.

Chapter 5 mainly introduces a hybrid approach of positive selection algorithm

with statistics for more effective NIDS. Frequencies of self and nonself data (strings) are

contained in leaves of trees representing detectors. This information plays an important

role in improving performance of the proposed algorithms. The hybrid approach came

as a new positive selection algorithm for two-class classification that can be trained

with samples from both self and nonself data types.

5

Chapter 1

BACKGROUND

The human immune system (HIS) has successfully protected our bodies against

attacks from various harmful pathogens, such as bacteria, viruses, and parasites. It

distinguishes pathogens from self-tissue, and further eliminates these pathogens. This

provides a rich source of inspiration for computer security systems, especially intrusion

detection systems [92]. Hence, applying theoretical immunology and observed immune

functions, its principles, and its models to IDS has gradually developed into a new

research field, called artificial immune system (AIS).

How to apply remarkable features of HIS to archive scalable and robust IDS

is considered a researching gap in the field of computer security. In this chapter, we

introduce the background knowledge necessary to discuss the algorithms proposed in

following chapters that can partly fulfill the gap.

Firstly, a brief introduction to network anomaly detection is presented. We

then overview HIS. Next, immune selection algorithms, detectors, performance metrics

and their relevance are reviewed and discussed. Finally, some popular datasets are

examined.

1.1 Detection of Network Anomalies

The idea of intrusion detection is predicated on the belief that an intruder’s

behavior is noticeably different from that of a legitimate user and that many unautho-

rized actions are detectable [65]. Intrusion detection systems (IDSs) are deployed as a

second line of defense along with other preventive security mechanisms, such as user

6

authentication and access control. Based on its deployment, an IDS can act either as

a host-based or as a network-based IDS.

1.1.1 Host-Based IDS

A Host-Based IDS (HIDS) monitors and analyzes the internals of a computing

system. A HIDS may detect internal activity such as which program accesses what

resources and attempts illegitimate access, for example, an activity that modifies the

system password database. Similarly, a HIDS may look at the state of a system and

its stored information whether it is in RAM or in the file system or in log files or

elsewhere. Thus, one can think of a HIDS as an agent that monitors whether anything

or anyone internal or external has circumvented the security policy that the operating

system tries to enforce [12].

1.1.2 Network-Based IDS

A Network-Based IDS (NIDS) detects intrusions in network data. Intrusions

typically occur as anomalous patterns. Most techniques model the data in a sequential

fashion and detect anomalous subsequences. The primary reason for these anomalies

is the attacks launched by outside attackers who want to gain unauthorized access to

the network to steal information or to disrupt the network. In a typical setting, a

network is connected to the rest of the world through the Internet. The NIDS reads

all incoming packets or flows, trying to find suspicious patterns. For example, if a

large number of TCP connection requests to a very large number of different ports are

observed within a short time, one could assume that there is someone committing a

port scan at some of the computers in the network. Port scans mostly try to detect

incoming shell codes in the same manner that an ordinary intrusion detection system

does. In addition to inspecting the incoming traffic, a NIDS also provides valuable

information about intrusion from outgoing or local traffic. Some attacks might even be

staged from the inside of a monitored network or network segment; and therefore, not

regarded as incoming traffic at all. The data available for intrusion detection systems

can be at different levels of granularity, like packet level traces or Cisco netflow data.

7

The data is high dimensional, typically, with a mix of categorical as well as continuous

numeric attributes. Misuse-based NIDSs attempt to search for known intrusive patterns

while an anomaly-based intrusion detector searches for unusual patterns. Today, the

intrusion detection research is mostly concentrated on anomaly-based network intrusion

detection because it can detect both known and unknown attacks [12].

1.1.3 Methods

On the basis of the availability of prior knowledge, the detection mechanism

used, the mode of performance and the ability to detect attacks, existing anomaly

detection methods are categorized into six broad categories [41] as shown in Fig. 1.1.

This figure is adapted from [12].

AnomalyDetection

SupervisedLearning

Learning

Learning

Unsupervised

Probabilistic

SoftComputing

Knowledgebased

CombinationLearners

Parametric

Non-ParametricClustering

AssociationMining

Outlier mining

ANN based

Rough Set based

Fuzzy Logic

GA based & Ant Colony

Artificial Immune System

Ensemble based

Fusion based

Hybrid

Rule based & ExpertSystem based

Ontology & Logic based

Figure 1.1: Classification of anomaly-based intrusion detection methods

AIS is a fairly new research subfield of Computational intelligence. It was

considered as a system that acts intelligently: What it does is appropriate for its

circumstances and its goal; it is flexible to changing environments and changing goals;

it learns from experience; also it makes appropriate choices given perceptual limitations

and finite computation [68].

8

1.1.4 Tools

IDS tools are used for purposes such as information gathering, victim identi-

fication, packet capture, network traffic analysis and visualization of traffic behavior.

These tools for both commercial and free purposes can be examplified, such as Snort,

Suricata, Bro, OSSEC, Samhain, Cisco Secure IDS, CyberCop, and RealSecure. Some

immune-related IDS tools including LISYS [10], which is based on TCP packages, and

MILA [26], a multilevel immune learning algorithm proposed for novel pattern recog-

nition.

However, despite their initially promising and influential properties, immune-

based IDSs never made it beyond the prototype stage [83]. Two main issues that

impeded the progress of immune algorithms were identified: large computational cost

to achieve acceptable coverage of the potentially anomalous region [54], and the failure

of these algorithms to generalize properly beyond the training set [79].

1.2 A brief overview of human immune system

Mainly being inspired by the human immune system, researchers have devel-

oped AISs intellectually and innovatively. Physical barriers, physiological barriers, an

innate immune system, and an adaptive immune system are main factors of a multi-

layered protection architecture included in our human immune system; among which,

the adaptive immune system being capable of adaptively recognizing specific types of

pathogens, and memorizing them for accelerated future responses is a complex of a

variety of molecules, cells, and organs spread all over the body [46]. Pathogens are for-

eign substances like viruses, parasites and bacteria which attack the body. Figure 1.2,

adapted from [77], presents a multi-layered protection and elimination architecture.

T cells and B cells cooperate to distinguish self from nonself. On the one hand,

T cells recognize antigens with the help of major histocompatibility complex (MHC)

molecules. Antigen presenting cells ingest and fragment antigens to peptides. MHC

molecules transport these peptides to the surface of antigen presenting cells. T cells,

whose receptors bind with these peptide-MHC combinations, are said to recognize

9

Figure 1.2: Multi-layered protection and elimination architecture

antigens. On the other hand, B cells recognize antigens by binding their receptors

directly to antigens. The bindings actually are chemical bonds between receptors and

epitopes. The more complementary the structure and the charge between receptors and

epitopes are, the more likely binding will occur. The strength of the bond is termed

affinity. To avoid autoimmunity, T cells and B cells must pass a negative selection

stage, where lymphocytes matching self cells are killed.

Prior to negative selection, T cells undergo positive selection. This is because in

order to bind to the peptide-MHC combinations, they must recognize self MHC first.

Thus, the positive selection will eliminate T cells with weak bonds to self MHC. T cells

and B cells, which survive the negative selection, become mature, and enter the blood

stream to perform the detection task. Since these mature lymphocytes have never

encountered antigens, they are naive. Naive T cells and B cells can possibly auto-react

with self cells, because some peripheral self proteins are never presented during the

negative selection stage. To prevent self-attack, naive cells need two signals in order

to be activated: one occurs when they bind to antigens, and the other is from other

sources as a confirmation. Naive T helper cells receive the second signal from innate

system cells. In the event that they are activated, T cells begin to clone. Some of

the clones will send out signals to stimulate macrophages or cytotoxic T cells to kill

antigens, or send out signals to activate B cells. Others will form memory T cells. The

activated B cells migrate to a lymph node. In the lymph node, a B cell will clone itself.

10

Meanwhile, somatic hyper mutation is triggered, whose rate is 10 times higher than

that of the germ line mutation, and is inversely proportional to the affinity. Mutation

changes the receptor structures of offspring; hence offspring have to bind to pathogenic

epitopes captured within the lymph nodes. If they do not bind, they will simply die

after a short time. Whereas, in case they succeed in binding, they will leave the lymph

node and differentiate into plasma or memory B cells.

In summary, the HIS is a distributed, self-organizing and lightweight defense

system for the body. These remarkable features fulfill and benefit the design goals of

an intrusion detection system, thus resulting in a scalable and robust system [53].

1.3 AIS for IDS

1.3.1 AIS model for IDS

Figure 1.3 illustrates the steps necessary to obtain an AIS solution for a secu-

rity problem, as firstly envisioned by de Castro and Timmis [27] and latter adopted

by Fernandes et al. [35]. Firstly, the security domain of the system to model needs

to be identified. Secondly,the immune entities that best fit the needs of the system

should be picked from the immunological theories. That should ease pointing out the

representation of the entities. In the step of the affinity measures one should take into

account a matching rule that outputs if two elements should bind.

Figure 1.3: Multi-layer AIS model for IDS

11

1.3.2 AIS features for IDS

According to Kim et al. [55], AIS features can be illustrated and summarized

as follows.

Firstly, a distributed IDS supports robustness, configurability, extendibility and

scalability. It is robust since the failure of one local intrusion detection process does

not cripple the overall IDS. It is also easy to configure a system since each intrusion

detection process can be simply tailored for the local requirements of a specific host.

The addition of new intrusion detection processes running on different operating sys-

tems does not require modification of existing processes and hence it is extensible. It

can also scale better, since the high volume of audit data is distributed amongst many

local hosts and is analyzed by those hosts.

Secondly, a self-organizing IDS provides adaptability and global analysis. With-

out external management or maintenance, a self organizing IDS automatically detects

intrusion signatures which are previously unknown and/or distributed, and eliminates

and/or repairs compromised components. Such a system is highly adaptive because

there is no need for manual updates of its intrusion signatures as network environments

change. Global analysis emerges from the interactions among a large number of varied

intrusion detection processes.

Next, a lightweight IDS supports efficiency and dynamic features. A lightweight

IDS does not impose a large overhead on a system or place a heavy burden on CPU

and I/O. It places minimal work on each component of the IDS. The primary functions

of hosts and networks are not adversely affected by the monitoring. It also dynami-

cally covers intrusion and non-intrusion pattern spaces at any given time rather than

maintaining entire intrusion 8 and non-intrusion patterns.

One more important feature is a multi-layered IDS which increases robustness.

The failure of one-layer defense does not necessarily allow an entire system to be

compromised. While a distributed IDS allocates intrusion detection processes across

several hosts, a multi-layered IDS places different levels of sensors at one monitoring

place.

Additionally, a diverse IDS provides robustness. A variety of different intrusion

12

detection processes spread across hosts will slow an attack that has successfully com-

promised one or more hosts. This is because an understanding of the intrusion process

at one site provides limited or no information on intrusion processes at other sites.

Finally, it is a disposable IDS that increases robustness, extendibility and config-

urability. A disposable IDS does not depend on any single component. Any component

can be easily and automatically replaced with other components. These properties are

important in an effective IDS, as well as being established properties of the HIS.

1.4 Selection algorithms

The main developments within AIS have focussed on three immunological the-

ories: clonal selections, immune networks and negative selections. Negative selection

approaches are based on self-nonself discrimination in biology system. This property

makes it attractive for computer and network security researchers. A survey by G. C.

Silva and D. Dasgupta in [71] showed that in five-year period 2008-2013, NSA predom-

inated all the other models of AIS in term of published papers relating to both network

security and anomaly detection. This trend triggers for much of the research work in

this thesis.

A model of AIS, positive selection algorithm (PSA), is also investigated. Under

some conditions, we will prove in a follow section that PSA is adequate to NSA in term

of anomaly detection performance.

1.4.1 Negative Selection Algorithms

Negative selection is a mechanism employed to protect the body against self-

reactive lymphocytes. Such lymphocytes can occur because the building blocks of

antibodies are different gene segments that are randomly composed and undergo a fur-

ther somatic hypermutation process. Therefore, this process can produce lymphocytes

which are able to recognise self-antigens [85].

NSAs are among the most popular and extensively studied techniques in ar-

tificial immune systems that simulate the negative selection process of the biological

immune system. Stephanie Forrest et al. [38] proposed an algorithmic model of this

13

process, which can be considered as a classifier that learns from only self samples

(negative examples)1.

A typical NSA comprises of two phases: detector generation and detection [7,

50]. In the detector generation phase (Fig. 1.4.a), the detector candidates are generated

by some random processes and censored by matching them against given self samples

taken from a set S (representing the system components). The candidates that match

any element of S are eliminated and the rest are kept and stored in the detector

set D. In the detection phase (Fig. 1.4.b), the collection of detectors are used to

distinguish self (system components) from nonself (outlier, anomaly, etc.). If incoming

data instance matches any detector, it is claimed as nonself or anomaly. Figure 1.4 is

adapted from [38].

End

Input new samples

Match any detector?

Self

Nonself

Yes

No

Begin

End

Generate Random Candidates

Match self samples?

Accept as new detector

Yes

NoEnough detectors?

Yes

Begin

No

(a) Generation of detector set (b) Detection of new instances

Figure 1.4: Outline of a typical negative selection algorithm.

Concept matching or recognition, are used both in the detector generation phase

and in the anomaly detection phase. Regardless of representation, a matching rule on

a detector d and a data sample s can be informally defined as a distance measure

between d and s within a threshold. Matching threshold exposes the concept of partial

matching: two points do not have to be exactly the same to be considered matching.

1At the writing time of this thesis, the paper has been cited more than 2300 times

14

A partial matching rule can support an approximation or a generalization in the algo-

rithms. The choice of the matching rule or the threshold in a matching rule must be

application specific and representation dependent [51]. For real-valued representation,

some popular rules are Euclidean distance and Manhattan distance. In string represen-

tation, rcb(r-contiguous bits) matching rule and r-chunk matching rule are the most

famous ones and they are formally presented in following section.

Since its introduction, NSA has had many applications such as in computer virus

detection [37, 5], monitoring UNIX processes [36], anomaly detection [22, 26], intrusion

detection [19, 54, 46, 59, 18, 93], scheduling [64], fault detection and diagnosis [45, 72],

negative database [33, 98], negative authentication [25, 20]. Moreover, NSA has been

quite successfully applied in immunology where they are used as models to provide

insight into fundamental principles of immunity and infection [15], and to illustrate

the immunological processes such as HIV infection [56, 57].

The most significant characteristics of a NSA making its uniqueness and strength

are:

• No prior knowledge of nonself is required [29].

• It is inherently distributable; no communication between detectors is needed [30].

• It can hide the self concept [33].

• Compared with other change detection methods, NSAs do not depend on the

knowledge of defined normal. Consequently, checking activity of each site can

be based on a unique signature of each while the same algorithm is used over

multiple sites.

• The quality of the check can be traded off against the cost of performing a check

[38].

• Symmetric protection is provided so the malicious manipulation on detector set

can be detected by normal behavior of the system [38].

• If the process of generating detectors is costly, it can be distributed to multiple

sites because of its inherent parallel characteristics.

15

• Detection is tunable to balance between coverage (matching probability) and the

number of detectors [29].

1.4.2 Positive Selection Algorithms

Contrary to NSAs, PSAs have been less studied in the literature. PSAs are

mainly developed and applied in intrusion detection [23, 73, 44, 66], malware detec-

tion [39], spam detection [81], and classification [40, 67]. Stibor et al. [80] argues that

positive selection might have better detection performance than negative one. How-

ever, for problems and applications that the number of detectors generated by NSAs is

much less than that of self samples, negative selection is obviously a better choice [51].

Similar to NSA, a PSA contains two phases: detector generation and detection.

In the detector generation phase (Fig. 1.5.a), the detector candidates are generated by

some random processes and matched against the given self sample set S. The candi-

dates that do not match any element in S are eliminated and the rest are kept and

stored in the detector set D. In the detection phase (Fig. 1.5.b), the collection of de-

tectors are used to distinguish self from nonself. If incoming data instance matches any

detector, it is claimed as self. In other words, detectors modeling involves generating a

End

Input new samples

Match any detector?

Self

Nonself

Yes

No

Begin

End

Generate Random Candidates

Match self samples?

Accept as new detector

Yes

NoEnough detectors?

Yes

Begin

No

(a) Generation of detector set (b) Detection of new instances

Figure 1.5: Outline of a typical positive selection algorithm.

16

set of strings (patterns) that do not match any string in a training dataset too strongly

(negative selection) or weakly match at least one string from the same dataset (positive

selection). Having obtained the detectors, one usually examines a set of testing dataset

(i.e., ”antigens”), for which we search one or all matching detectors for classification.

1.5 Basic terms and definitions

In selection algorithms, an essential component is the matching rule which de-

termines the similarity between detectors and self samples (in the detector generation

phase) and coming data instances (in the detection phase). Obviously, the matching

rule is dependent on detector representation. In this thesis, both self and nonself cells

are represented as strings of fixed length. This representation is a simple and popular

representation for detectors and data in AIS, and other representations (such as real

valued) could be reduced to binary, a special case of string [42, 51].

1.5.1 Strings, substrings and languages

An alphabet Σ is nonempty and finite set of symbols. A string s ∈ Σ∗ is a

sequence of symbols from Σ, and its length is denoted by |s|. A string is called empty

string if its length equals 0. Given an index i ∈ {1, . . . , |s|}, then s[i] is the symbol

at position i in s. Given two indices i and j, whenever j ≥ i, then s[i . . . j] is the

substring of s with length j − i + 1 that starts at position i and whenever j < i, then

s[i . . . j] is the empty string. If i = 1, then s[i . . . j] is a prefix of s and, if j = |s|,

then s[i . . . j] is a suffix of s. For a proper prefix or suffix s′ of s, we have in addition

|s′| < |s|. Given a string s ∈ Σ`, another string d ∈ Σr with 1 ≤ r ≤ `, and an index

i ∈ {1, . . . , ` − r + 1}, we say that d occurs in s at position i if s[i . . . i + r − 1] = d.

Moreover, concatenation of two strings s and s′ is s + s′.

A set of strings S ⊆ Σ∗ is called a language. For two indices i and j, we define

S[i . . . j] = {s[i . . . j]|s ∈ S}.

17

1.5.2 Prefix trees, prefix DAGs and automata

A prefix tree T is a rooted directed tree with edge labels from Σ where for all

c ∈ Σ, every node has at most one outgoing edge labeled with c. For a string s, we write

s ∈ T if there is a path from the root of T to a leaf such that s is the concatenation

of the labels on this path. The language L(T ) described by T is defined as the set of

all strings that have a nonempty prefix s ∈ T . For example, for T as in Fig. 1.6.a we

have 0 ∈ T and 10 ∈ T , but 1 6∈ T . Furthermore, 0 ∈ L(T ), 01 ∈ L(T ) since 0 ∈ T and

11 6∈ L(T ) since no prefix of 11 lies in T . Trees for self dataset and nonself dataset are

called positive trees and negative trees, respectively.

A prefix DAG D is a directed acyclic graph with edge labels from Σ, where

again for all c ∈ Σ, every node has at most one outgoing edge labeled with c. Similar

to prefix trees, the terms root and leaf used to refer to a node without incoming and

outgoing edges, respectively. We write s ∈ D if there is a path from a root node to

a leaf node in D that is labeled by s. Given n ∈ D, the language L(D,n) contains

all strings that have a nonempty prefix that labels a path from n to some leaf. For

instance, if D is the DAG in Fig. 1.6.b and n is its lower left node, then L(D,n) consists

of all strings starting with 11. Moreover, we define L(D) = ∪n is a root of DL(D,n).

A finite automaton is a tuple M = (Q, q0, Qa,Σ,∆), where Q is a set of states

with a distinguished initial state q0 ∈ Q,Qa ⊆ Q the set of accepting states, Σ the

alphabet of M , and ∆ ⊆ Q× Σ×Q the transition relation. Furthermore, we assume

that the transition relation is unambiguous: for every q ∈ Q and every c ∈ Σ there is

at most one q′ ∈ Q with (q, c, q′) ∈ ∆. It is common to represent the transition relation

as a graph with node set Q (with the initial state and the accepting states highlighted

properly) and labeled edges (a c-labeled edge from q to q′ if q′ ∈ Q with (q, c, q′) ∈ ∆).

An automaton M is said to accept a string s if its graph contains a path from q0 to

some q ∈ Qa whose concatenated edge labels equal s (note that this path may contain

loops). The language L(M) contains all strings accepted by M .

A prefix DAG can be turned into a finite automaton to decide the membership

of strings in languages. The details steps of this process is presented in Chapter 4.

18

01

0

1 0

1

0

1a. b.

Figure 1.6: Example of a prefix tree and a prefix DAG.

1.5.3 Detectors

In PSAs and NSAs, an essential component is the matching rule which deter-

mines the similarity between detectors and self samples (in the detector generation

phase) and coming data instances (in the detection phase). Obviously, the matching

rule is dependent on detector representation. For string based AIS, the r-chunk and

r-contiguous detectors are among the most common matching rules. A r-chunk match-

ing rule can be seen as a generalisation of the r-contiguous matching rule, which helps

AIS to achieve better results on data where adjacent regions of the input data sequence

are not necessarily semantically correlated, such as in network data packets [9].

An important difference between rcb and r-chunk matching rules is holes, or the

undetectable strings, that they may induce. This concept is presented in Section 1.5.5.

Given a nonempty alphabet Σ and finite set of symbols, positive and negative

r-chunk detectors, r-contiguous detectors, rcbvl-detectors could be defined as follows:

Definition 1.1 (Positive r-chunk detectors). Given a self set S ⊆ Σ`, a tuple (d, i) of

a string d ∈ Σr, where r ≤ `, and an index i ∈ {1, ..., ` − r + 1} is a positive r-chunk

detector if there exists a s ∈ S such that d occurs in s.

Definition 1.2 (Negative r-chunk detectors). Given a self set S ⊆ Σ`, a tuple (d, i) of

a string d ∈ Σr, r ≤ `, and an index i ∈ {1, ..., `− r+ 1} is a negative r-chunk detector

if d does not occurs any s ∈ S.

Although some proposed approaches in following chapters can be implemented

on any finite alphabet, binary strings used in all examples are binary, Σ = {0, 1}, just

for easy understanding.

Example 1.1. Let ` = 6, r = 3. Given a set S of five self strings: s1 = 010101, s2

= 111010, s3 = 101101, s4 = 100011, s5 = 010111. The set of some positive r-chunk

19

detectors is {(010,1), (111,1), (101,2), (110,2), (010,3), (101,3), (101,4), (010,4),

(111,4))}. The set of some negative r-chunk detectors is {(000,1), (001,1), (011,1),

(001,2), (010,2), (100,2), (000,3), (100,3), (000,4), (001,4), (100,4)}.

Definition 1.3. Given a self set S ⊆ Σ`, a string d ∈ Σ` is a r-contiguous detector if

d[i, . . . , i + r − 1] does not occurs any s ∈ S for all i ∈ {1, ..., `− r + 1}.

Example 1.2. Let ` = 5, r = 3. Given a set of 7 self strings S = {01111, 00111,

10000, 10001, 10010, 10110, 11111}. The set of all 3-contiguous detectors is {01011,

11011}. This example is adapted from [32].

We also use the following notations:

• Dpi = {(d, i)|(d, i) is a positive r-chunk detector} is set of all positive r-chunk

detectors at position i, i = 1, . . . , `− r + 1.

• Dni = {(d, i)|(d, i) is a negative r-chunk detector} is set of all negative r-chunk

detectors at position i, i = 1, . . . , `− r + 1.

• CHUNKp(S, r) = ∪`−r+1i=1 Dpi is set of all positive r-chunk detectors.

• CHUNK(S, r) = ∪`−r+1i=1 Dni is set of all negative r-chunk detectors.

• CONT(S, r) is the set of all r-contiguous detectors that do not match any string

in S.

• For a given detectors set X, L(X) is the set of all nonself strings detected by X.

We also say that Σ` \ L(X) is the set of all self strings detected by X.

Example 1.3. Let ` = 5, matching threshold r = 3. Suppose that we have the set S

of six self strings s1 = 00000, s2 = 00010, s3 = 10110, s4 = 10111, s5 = 11000, s6 =

11010. Dp1 = {(000,1), (101,1), (110,1)} (Dp1 is set of all left most substrings of

length r of all s ∈ S), Dn1 = {(001,1), (010,1), (011,1), (100,1), (111,1)}, Dp2 =

{(000,2), (001,2), (011,2), (100,2), (101,2)}, Dn2 = {(010,2), (110,2), (111,2)}, Dp3

= {(000,3), (010,3), (110,3), (111,3)}, Dn3 = {(001,3), (011,3), (100,3), (101,3)}

(note that Dpi ∪Dni = Σ3, i = 1, 2, 3).

20

The self space covered by the set of CHUNKp(S, 3) is {0, 1}5\ L(CHUNKp(S, 3))

= {00000, 00001, 00010, 00011, 00110, 00111, 01000, 01001, 01010, 01011, 01110,

01111, 10000, 10001, 10010, 10011, 10100, 10101, 10110, 10111, 11000, 11001, 11010,

11011, 11110, 11111}. Set of all strings detected by CHUNK(S,3) is L(CHUNK(S,3))

= {00001, 00011, 00100, 00101, 00110, 00111, 01000, 01001, 01010, 01011, 01100,

01101, 01110, 01111, 10000, 10001, 10010, 10011, 10100, 10101, 11001, 11011, 11100,

11101, 11110, 11111}.

Definition 1.4. Given a self set S ⊆ Σ`. A triple (d, i, j) of a string d ∈ Σk, 1 ≤ k ≤ `,

an index i ∈ {1, ..., ` − r + 1} and an index j ∈ {i, ..., ` − r + 1} is called a negative

detector under rcbvl matching rule if d does not occur in any s, s ∈ S.

In other words, a triple (d, i, j) is a rcbvl detector if there exist (j − i + 1) r-chunk

detectors (d1, i),..., (dj−i+1, j) that dk, dk+1 are two (r−1)-symbol overlapping strings,

k = 1, ..., j − i.

Example 1.4. Given `, r and the set S of self strings as in Example 1.1, S = {010101,

111010, 101101, 100011, 010111}. Triple (0001,1,2) is a rcbvl detector because there

exist two 3-chunk detectors (000,1), (001,2) that 000 and 001 are two 2-bit overlapping

strings. A detectors set under rcbvl matching rule contains 5 variable length detec-

tors {(0001,1,2), (00100,1,4), (100,4,4), (011110,1,4), (11000,1,3)}. It is a minimum

detectors set (23 bits) that covers all detector space of r-chunk detectors set in Exam-

ple 1.1 (45 bits).

Matching threshold r plays an important role in selection algorithms. The

value of r can be used to balance between underfitting and overfitting. Our proposed

methods in Chapter 5 investigate this value in combination with simple statistics for

better detection performance.

1.5.4 Detection in r-chunk detector-based positive selection

It could be seen from Example 1.3 that L(CHUNKp(S, r)) = {00100, 00101,

01100, 01101, 11100, 11101} 6= L(CHUNK(S, r)), so the detection coverage of Dn

is not the same as that of Dp. This is undesirable for the combination of PSA and

21

NSA. Hence, to combine PSA and NSA in a unified framework, we have to change the

original semantic of positive selection in the detection phase as follows.

Definition 1.5 (Detection in positive selection). If new instance matches ` − r + 1

positive r-chunk detectors (dij , i), i = 1, . . . , `− r + 1, it is claimed as self, otherwise it

is claimed as nonself.

With this new detection semantic, the following theorem on the equivalence of

detection coverage of r-chunk type PSA and NSA could be stated.

Theorem 1.1 (Detection Coverage). The detection coverage of positive and negative

selection algorithms coincide.

L(CHUNKp(S, r)) = L(CHUNK(S, r)) (1.1)

Proof. From the description of NSAs (see Fig. 1.4), if a new data instance matches

a negative r-chunk detector, then it is claimed as nonself, otherwise it is claimed as

self. Obviously, it is dual to the detection of new data instances in positive selection

as given in Definition 1.5.

This theorem lays the foundation for our novel Positive-Negative Selection Al-

gorithm (PNSA) proposed in next chapter.

1.5.5 Holes

The generalization performed by selection algorithms corresponds to the strings

that are labeled self even though they do not occur in the self sample S. These strings

are called holes. There are two types of holes: crossover holes and length-limit holes in

rcb matching rules literature. A crossover hole is a crossover of certain self strings, in

which all substrings of length r occur in self strings. A length-limit hole is one that has

at least one substring of length r that exists in CHUNK(S, r). The r-chunk matching

rule eliminates the problem of length-limit holes.

Fig. 1.7 illustrates the existence of holes in a self and a nonself space comprised

by self and detector strings. The string universe Σ` is a squared region. Each dark

circle represents a detector and a grid shape in the middle is self. The universe is

22

Figure 1.7: Existence of holes.

classified by the detector set as self (grid region and holes - white region) and nonself

(dark region covered by all circles).

In fact, holes are not a ”problem”, but as pointed out by Stibor et al. [78], they

are a necessary property of the selection algorithms. Without holes, the algorithms

would do nothing but memorize the training data for classification naively.

Fig. 1.8 shows a set of seven self strings as in Example 1.2, S ⊂ {0, 1}5 (left)

along with CHUNK(S, 3) (middle) and CONT(S, 3) (right). For both detector types,

the induced bipartitionings of the shape space {0, 1}5 are illustrated with strings that

are classified as nonself having a gray background and strings that are classified as self

having a white background. Bold strings are members of the self-set. Holes are the

strings that are classified as self but do not occur in the self-set S (non-bold, non-shaded

strings). This figure is adapted from [32].

1.5.6 Performance metrics

We used three metrics to evaluate each machine learning technique. Detection

rate (DR), Accuracy rate (ACC) and False alarm rate (FAR) are defined as:

DR =TP

TP + FN(1.2)

ACC =TP + TN

TP + FN + FP + FN(1.3)

FAR =FP

FP + TN(1.4)

23

Figure 1.8: Negative selections with 3-chunk and 3-contiguous detectors.

Where TP (True Positive) is the number of true positives (correctly classified

as nonself), TN (True Negative) is the number of true negatives (correctly classified

as self), TP (False Positive) is the number of false positives (classified as nonself,

actually self), and FN (False Negative) is the number of false negatives (classified as

self, actually nonself).

We used 10-fold cross-validation technique and holdout one to evaluate our ap-

proaches in experiments. Regarding the former, the dataset was randomly partitioned

into 10 subsets. Of the 10 subsets, a single subset was retained for testing, and the

others were used as training data. The process was then repeated 10 times, with each

of the 10 subset used exactly once as the testing data. The 10 results from the folds

were then averaged to produce a single performance. Regarding the later, dataset is

split into two groups: training set used to train the classifier and test set (or hold out

set) used to estimate the performance of classifier.

1.5.7 Ring representation of data

As we known, most of AIS-based applications use two types of data representa-

tion: string and real-valued vector. For both popular types, representations are linear

structure of symbols or numbers. They may omit information at the edges (the begin

24

and the end) of these structures. For example, to detect a new instance s as self or

nonself in detection phases of PSA and NSA, it takes `− r + 1 steps in the worst case

to match s again each detector at positions 1,...,`− r+ 1. Therefore, for given detector

d, the symbols d[1] and s[1] are used in only one match, the symbols d[2] and s[2] are

used in two matches, etc. In the other words, all positions in linear structures are not

equal in term of matching times.

Our earlier experimental implementation of NSA on binary ring-based strings

produces a trigger to solve this problem. A set of 50,000 random self strings and a set of

10,000 random nonself strings with the same length of 50 were used in the experiment.

Table 1.1 shows the experimental results on values of r ranging from 10 to 16 under

10 cross-validation technique. The results show that both detection rate and accuracy

rate of ring-based NSA are higher than that of the linear-based one, while false alarm

rates are relatively similar. Accordingly, we could use ring structures instead of linear

ones for more exact classification.

Table 1.1: Performance comparison of NSAs on linear strings and ring strings.

rNAS on linear strings NSA on ring stringsACC DR FAR ACC DR FAR

10 0.8343 0.0102 0.0008 0.8345 0.0123 0.001011 0.8380 0.0535 0.0051 0.8390 0.0655 0.006312 0.8488 0.1817 0.0177 0.8522 0.2193 0.021213 0.8677 0.4054 0.0399 0.8723 0.4704 0.046514 0.8875 0.6456 0.0640 0.8932 0.7184 0.071915 0.9023 0.8293 0.0829 0.9075 0.8888 0.088816 0.9112 0.9340 0.0934 0.9139 0.9662 0.0964

With reference to string-based detectors set, a simple technique for this approach

is to concatenate each string representing a detector with its first k symbols. Each new

linear string is a ring representation of its original binary one. Fig. 1.9 shows a ring

representation (b) of its original string (a) with k = 3.

Given a set of strings S ⊂ Σ`, a set Sr ⊂ Σ`+r−1 includes ring representations

of all strings in S by concatenating each string s ∈ S with its first r − 1 symbols.

Note that we can easily apply the idea of ring strings for other data representa-

25

00101001110010010100111

(a) (b)

Figure 1.9: A simple ring-based representation (b) of a string (a).

tions in AIS. One way to do this, for instance, is to create ring representations of other

structures such as trees, and automata, from set Sr instead of S as usual.

1.5.8 Frequency trees

Given a set D of length-equaled strings, a tree T on D, noted TD, is a rooted

directed tree with edge labels from Σ where for all c ∈ Σ, every node has at most one

outgoing edge labeled with c. For a string s, we write s ∈ T if there is a path from the

root of T to a leaf such that s is the concatenation of the labels on this path. Each

leaf is associated with an integer number, that is frequency of string s ∈ D and s is

the concatenation of the labels on the path ending by this leaf. This tree structure is

a compact representation of r-chunk detectors in our algorithm in Chapter 5.

Example 1.5. Let ` = 5 matching threshold r = 3. Suppose that we have the

set S of four strings: s1 = 00000, s2 = 10110, s3 = 10111, s4 = 11111. Sr =

{0000000, 1011010, 1011110, 1111111}. S1 = {(000,1), (101,1), (111,1)}, S2 =

{(000,2), (011,2), (111,2)}, S3 = {(000,3), (110,3), (111,3)}, S4= {(000,4), (101,4),

(111,4)}, S5 = {(000,5), (010,5), (110,5), (111,5)}.

Assuming that S = {N,A}, where set of normal data N = {s1, s2} and set of

abnormal data A = {s3, s4}.

Nr = {0000000, 1011010} (ring representations of all strings in N). N1 =

{(000,1), (101,1)}, N2 = {(000,2), (011,2)}, N3 = {(000,3), (110,3)}, N4= {(000,4),

(101,4)}, N5 = {(000,5), (010,5)}.

Ar = {1011110, 1111111}. A1 = {(101,1), (111,1)}, A2 = {(011,2), (111,2)},

A3 = {(111,3)}, A4= {(111,4)}, A5 = {(110,5), (111,5)}.

Ten trees presenting all 3-chunk detectors are in Fig. 1.10. Five trees TNi (TAi),

i = 1, . . . , 5, are in the first (second) row, from left to right, respectively.

Call to mind that there are some strings that belong to both positive trees and

26

1

0

0

0

0

0

0

01

1

0

0

01

0

0

0

1

0

0

0

11 1

1

1

1

1

1

0

1

1

0

1

1

1

01

1

1

1

2

1

1

1

2

1

1

1

1

1

1

1

1

1

1

1

1

0

11

1

1

1

0

0

1

Figure 1.10: Frequency trees for all 3-chunk detectors.

negative trees. A substring of s2, s2[1 . . . 3] = 101, for example, s2[1 . . . 3] ∈ TN1 and

s2[1 . . . 3] ∈ TA1 . This situation could lead to error rates in detection phase. Therefore,

frequencies of matches will be used to improve detection performance of algorithms.

Detailed technique of using the frequencies is presented in Chapter 5.

1.6 Datasets

There are two basis NIDSs which depend on the source of data to be analyzed:

packet-based NIDSs and flow-based ones. M. H. Bhuyan et al. reviewed in [13] that

most of NIDSs are based on packet-based data. However, we only concentrate on

flow-based NIDSs because of three reasons: 1). It can detect some special attacks,

like DDoS or RDP Brute Force, more efficient and faster than payload based one,

since lesser information is needed to be analyzed [87, 76]; 2). Flow-based anomaly

detection methods process only packet headers and, therefore, reduce data amount and

processing time for high-speed detection on large networks. It can solve the scalability

problem in condition of increasing network usage and load [76]. 3). Flow-based NIDSs

decrease privacy issues in comparison with packet-based ones because of the absence

of payload [76].

27

1.6.1 The DARPA-Lincoln datasets

So as to evaluate the performance of different intrusion detection methodologies,

MITs Lincoln laboratory with the sponsor from the DARPA ITO and Air Force Re-

search laboratory, gathered the DARPA Lincoln datasets which consist of nine weeks

in 1998: seven weeks for training and two remaining weeks for test data. The datasets

are collected and stored in the form of Tcpdump, they are data source for extracting

to some datasets such as KDD99 [3], and NetFlow [86].

There were more than 300 instances of 38 different attacks launched against

victim UNIX hosts in the attack date which can fall into one of the four categories:

Denial of Service (DoS), Probe, Users to Root (U2R), and Remote to Local (R2L).

For each week, inside and outside network traffic data, audit data recorded by the

Basic Security Module on Solaris hosts, and file system dumped from UNIX hosts were

collected. In 1999, another series of datasets which contained three weeks of training

and two weeks of test date was collected. More than 200 instances of 58 attack types

were launched against victim UNIX and WindowsNT hosts and a Cisco router. In 2000,

there were three additional scenario-specific datasets generated to address distributed

DoS and Windows NT attacks. Detailed descriptions of these datasets can be found

at [1].

1.6.2 UT dataset

A public labeled flow-based dataset is provided in [75]. This dataset was cap-

tured by monitoring a honeypot hosted in the University of Twente network, so we

call it TU dataset. This dataset has three categories: malicious traffic, unknown traffic

and side-effect traffic. It has 14,170,132 flows which are mostly of malicious nature.

Only a small number of flows, 5,968 flows or 0.042%, are not labeled (unknown) and

considered as normal data in our experiments in following chapters. Each flow in the

dataset has 13 fields: id: the ID of the flow, src ip: anonymized source IP address

(encoded as 32-bit number), dst ip: anonymized destination IP address (encoded as

32-bit number), packets: number of packets in flow, octets: number of bytes in flow,

start time: UNIX start time (number of seconds), start msec: start time (milliseconds

28

part), end time: UNIX end time (number of seconds), end msec: end time (millisec-

onds part), src port: source port number, dst port: destination port number, tcp flags:

TCP flags of the flow, and prot: IP protocol number.

Examples of a normal flow and an attack flow are (393, 3145344965, 2463760020,

3, 168, 1222173606, 974, 1222173610, 239, 0, 769, 0, 1) and (1, 2463760020, 3752951033,

1, 60, 1222173605, 985, 1222173605, 985, 4534, 22, 2, 6), respectively. The ID of a flow

is used to distinguish attack flows from the others.

1.6.3 Netflow dataset

Packet-based DARPA dataset [1] is used to generate flow-based DARPA dataset,

called NetFlow, in [86]. This dataset focuses only on flows to a specific port and an IP

address which receives the highest number of attacks. It includes all 129,571 traffics

(including attacks) to and from victims. Each flow in the dataset has 10 fields: Source

IP, Destination IP, Source Port, Destination Port, Packets, Octets, Start Time, End

Time, Flags, and Proto. All 24,538 attacked flows are labeled with text labels, such as

neptune, portsweep, ftpwrite, etc.

Examples of a normal flow and an attack flow are (172.16.112.20, 172.16.112.50,

53, 32961, 1, 161, 1999-03-05T08:17:10, 1999-03-05T08 :17:10, 17, 00) and (209.167.99.71,

172.16.112.50, 10353, 4288, 1, 46, 1999-03-12T17:23:19, 1999-03-12T17:23:19, 6, 02:::port-

sweep), respectively.

1.6.4 Discussions

This thesis does not mention any techniques or algorithms for NIDS on a variety

of datasets. Although there exist many other well-known datasets for IDS, such as

KDD99 datasets [3], NSL-KDD dataset [4], and FDFA datasets [2], they are all out of

scope of this netflow-related thesis.

There have been a number of different viewpoints and researches which argue

the DARPA datasets. McHugh [62] proposed one of the most important assessment

of DARPA datasets which was deeply critical. In the other words, this assessment

gives some examples such as normal and attack data have unrealistic data rates, the

29

lack of training datasets for anomaly detection for its intended purpose or no efforts

for validation; to show that some evaluation methodologies are questionable and may

have biased the results. Furthermore, the ideas from Malhony and Chan [60] which

can be seen as a confirmation of findings owned by McHughs’s experiments, revealed

that many attributes had small and fixed ranges in simulation, but large and growing

ranges in real traffic.

Although there exist some above argumentative theories, DARPA-Lincoln dataset

plays a vital role in publicity and is the most sophisticated benchmark for researchers

in the process of the evaluation on intrusion detection algorithms or machine learning

algorithms [92]. One of the important factor of this data is to be a proxy for developing,

testing and evaluating detecting algorithms, rather than to be a solid dataset for a real

time system. If a detection algorithm has a high performance based on the DARPA

data, it is more likely to have a good performance in real network environment [88].

1.7 Summary

This chapter presents background topics used in this thesis including HIS, IDS,

and AIS. Some terms and definitions that will be used throughout the thesis are also

stated clearly. Two important data structure, ring-based and frequency-based, can be

used for improving classification rate of selection algorithms. Some popular perfor-

mance metrics and three well-known datasets for NIDS are presented and discussed.

These datasets will be used for experiments in the other chapters. Besides, under new

semantic of detection in r-chunk based PSA (Section 1.5.4), we proved an important

theorem on the coincide of detection coverage of PSA and that of NSA. This theorem

will lead to one contribution of the thesis that will be discussed in next chapter.

30

Chapter 2

COMBINATION OF NEGATIVESELECTION AND POSITIVESELECTION

It can be seen from Theorem 1.1 in Chapter 1 that the r-chunk based PSA and

NSA are dual in term of detection. This motivates our approach to combining two

selection algorithms that does not affect the detection performance of each.

2.1 Introduction

NSA and PSA are computational models that have been inspired by negative

and positive selection of the biological immune system. Among the two, NSA has been

studied more extensively resulting in more variants and applications [51]. However,

all of existing string-based NSAs have worst-case exponential memory complexity for

storing the detectors set, hence, limit their practical applicabilities [31]. In this chap-

ter, we introduce a novel selection algorithm that employs binary representation and

r-chunk matching rule for detectors. The new algorithm combines negative and pos-

itive selection to reduce both detector storage complexity and detection time, while

maintaining the same detection coverage as that of NSAs (PSAs).

In following section, we review some related works. Section 2.3 presents a new

r-chunk type NSA is presented in detail that is the combination of positive and nega-

tive selection, called PNSA. In our proposed approach, binary trees are used as data

structure for storing the detectors set to reduce memory complexity, and therefore the

time complexity of the detection phase. Section 2.4 details preliminary experimental

31

results. Summary of the chapter is in the final section.

2.2 Related works

Both PSA and NSA achieve quite similar performance for detecting novelty

in data patterns [24]. Dasgupta et al. [21] conducted one of the earliest experiments

on combining positive with negative selection. The combined process is embedded in

a genetic algorithm using a fitness function that assigns a weight to each bit based

on the domain knowledge. Their method is neither aimed to reduce detector storage

complexity nor detection time. Esponda et al. [34] proposed a generic NSA for anomaly

detection problems. Their model of normal behavior is constructed from an observed

sample of normally occurring patterns. Such a model could represent either the set

of allowed patterns (positive detection) or the set of anomalous patterns (negative

detection). However, their NSA is not concerned with the combination of positive and

negative selection in detection phase as in proposed one. Stibor et al. [80] argued that

positive selection might have better detection performance than negative selection.

However, the choice between positive selections and negative ones obviously depends

on representation of the AIS-based applications.

To the best of our knowledge, there has not been any published attempt in

combining r -chunk type PSA and NSA for the purpose of reducing detector storage

complexity and detection time complexity.

2.3 New Positive-Negative Selection Algorithm

Our algorithm first constructs ` − r + 1 binary trees (called positive trees)

corresponding to `−r+1 positive r-chunk detectors set Dpi, i = 1, . . . , `−r+1. Then,

all complete subtrees of these trees are removed to achieve a compact representation of

the positive r-chunk detectors set while maintaining the detection coverage. Finally, for

every ith positive trees, we decide whether or not it should be converted to the negative

tree, which covers the negative r-chunk detectors set Dni. The decision depends on

which tree is more compact. When this process is done, we have ` − r + 1 compact

32

binary trees that some of them represent positive r-chunk detectors and the others

represent negative ones.

The r-chunk matching rule on binary trees is implemented as follows: a given

sample s matches the positive (negative) tree ith if s[i . . . i + k] is a path from the

root to a leaf, i = 1, . . . , ` − r + 1, k < r. The detection phase can be conducted by

traveling the compact binary trees iteratively one by one: a sample s is claimed as

nonself if it matches a negative tree or it does not match all positive trees, otherwise

it is considered as self.

Example 2.1. For the set of six self strings from Example 1.3, S = {00000, 00010,

10110, 10111, 11000, 11010}, where ` = 5 and r = 3, the six binary trees (the left

and right child are labeled 0 and 1 respectively) represent the detectors set of six 3-

chunk detectors (Dpi and Dni, i = 1, 2, 3) as depicted in Fig. 2.1. In the figure, dashed

arrows in some positive trees mark the complete subtrees that will be removed to achieve

compact tree representation. The positive trees for Dp1, Dp2 and Dp3 are in (a), (c)

and (e), respectively; The negative trees for Dn1, Dn2 and Dn3 are in (b), (d) and (f),

respectively.

The number of nodes of the trees in Figures 2.1.a - 2.1.f (after deleting complete

subtrees) are 9, 10, 7, 6, 8 and 8, respectively. Therefore, the chosen final trees are

those in Figures 2.1.a (9 nodes), 2.1.d (6 nodes) and 2.1.e or 2.1.f (8 nodes). In real

implementation, it is unnecessary to generate both positive trees and negative trees.

Since each Dpi could dually be represented either by a positive or a negative tree,

we only need to generate (compact) positive trees. If a compact positive tree T has

more number of leaves than the number of internal nodes that have single child, the

corresponding negative tree T ′ has less number of nodes than that of T . Therefore, T ′

should be used instead of T to represent Dni more compactly. Figure 2.3 presents a

diagram of the algorithm. The following example illustrates this observation.

Example 2.2. Consider again the set of six self strings S from Example 1.3, S =

{00000, 00010, 10110, 10111, 11000, 11010}. The compact positive tree for the positive

3-chunk detectors set Dp2 = {(000,2); (001,2); (011,2); (100,2); (101,2)} is showed

33

1

(a)

1

0

0

0

0

0

1

1

1

0

0

0

0

0

1

1

1

0

1

0

0

1

(b)

1

10

1

11 0

1

0

(c) (d)

1

1

000

0

0

1

10

1 10 1

(e) (f)

0

1

Figure 2.1: Binary tree representation of the detectors set generated from S.

in Fig. 2.2.a. This tree has three leaves and two nodes that have only one child (in

dotted circles) so it should be converted to the corresponding negative tree as illustrated

in Fig. 2.2.b.

1

00

0

1

10

1

1 0

1

(a) (b)

Figure 2.2: Conversion of a positive tree to a negative one.

34

Algorithm 2.1 Detector Generation Algorithm.

1: procedure DetectorGeneration(S, r, T )Input: A set of self strings S ⊆ Σ`, a matching threshold r ∈ {1, . . . , `}.Output: A set T of `− r + 1 prefix trees presenting all r-chunk detectors.

2: T = ∅3: for i = 1, ..., `− r + 1 do4: create an empty prefix positive tree Ti5: for all s ∈ S do6: insert every s[i . . . i + r − 1] into Ti7: end for8: for all internal node n ∈ Ti do9: if n is root of complete binary subtree then

10: delete this subtree11: end if12: end for13: if (number of leaves of Ti) > (number of nodes of Ti that have only one

child) then14: for all internal node ∈ Ti do15: if it has only one child then16: if the child is a leaf then17: delete the child18: end if19: create the other child for it20: end if21: end for22: mark Ti as a negative tree23: end if24: T = T ∪ {Ti}25: end for26: end procedure

The proposed technique is summarized by Algorithm 2.1 and Algorithm 2.2.

The first algorithm, Algorithm 2.1, generates ` − r + 1 trees where each of these is

labeled with self or nonself. The process of generating compact binary (positive and

negative) trees representing the complete r-chunk detectors set is conducted in the outer

“for” loop. First, all binary positive tree Ti are constructed by the first inner loop.

Then, the compactification of all Ti is conducted by the second one, i = 1, . . . , `−r+1.

The conversion of a positive tree to negative one takes place in “if” statement after the

second inner “for” loop. The procedure for recognizing a given cell string s as self or

nonself, is carried out by the last “while . . . do” and “if . . . then . . . else” statements.

Figure 2.4 presents a diagram of the algorithm.

35

Figure 2.3: Diagram of the Detector Generation Algorithm.

The detection phase could be done by the second algorithm, Algorithm 2.2, and

be illustrated by the following example.

Example 2.3. Given S, r as in Example 1.3, S = {00000, 00010, 10110, 10111,

11000, 11010}, and s = 10100 as the inputs of the algorithm, three binary trees are

36

constructed as the detectors set in Figures 2.1.a, 2.1.d. and 2.1.e. One of the output of

the algorithm is “s is nonself” because all the paths of tree T2 do not contain substring

of s: s[2. . . 4] = 010.

Algorithm 2.2 Positive-Negative Selection Algorithm.

1: procedure PNSA(T , r, s)Input: A set T of `−r+1 prefix trees presenting all r-chunk detectors, a matchingthreshold r ∈ {1, . . . , `}, an unlabeled string s ∈ Σ`.Output: A label of s (as self or nonself).

2: flag = true . A temporary boolean variable3: i = 14: while (i ≤ `− r + 1) and (flag = true) do5: if (Ti is a positive tree) and (s /∈ Ti) then6: flag = false7: end if8: if (Ti is a negative tree) and (s ∈ Ti) then9: flag = false

10: end if11: i=i+112: end while13: if flag = false then14: output s is nonself15: else16: output s is self17: end if18: end procedure

From the description of DetectorGeneration, it is trivial to show that it takes

|S|(`−r+1).r steps to generate all necessary trees (detector generation time complexity)

and (`− r + 1).r steps to verify a cell string as self or nonself in the worst case (worse-

case detection time complexity). These time complexities are similar to the popular

NSAs (PSAs) such as the one proposed in [31]. However, by using compact positive

and negative binary trees for storing the detectors set, PNSA could reduce the storage

complexity of the detectors set in comparison with the other r-chunk type single NSAs

or PSAs that store detectors as binary strings. This storage complexity reduction could

potentially lead to better detection time complexity in real and average cases. To see

this, first, let the following theorem be stated:

Theorem 2.1 (PNSA detector storage complexity). Given a self set S and an integer

37

Figure 2.4: Diagram of the Positive-Negative Selection Algorithm.

`, the procedure DetectorGeneration produces the detector (binary) tree set that have

at most total (` − r + 1).2r−2 less number of nodes in comparison to the detector tree

set created by a PSA or NSA only, where r ∈ {2, . . . , `− r + 1}.

38

Proof. We only prove the theorem for the PSA case, the NSA case can be proven in a

similar way. Because there are (`− r + 1) positive trees can be build from the self set

S, so the theorem is proved if it can reduce at most 2r−2 nodes from a positive tree.

The theorem is proved by induction on r (also the height of binary trees).

It is noted that when converting a positive tree to a negative tree, the reduction

in number of nodes is exactly as the result of the subtraction of number of leaf nodes

from the number of internal nodes that have only one child.

When r = 2, there are 16 trees of possible positive trees are of height 2. By

examining all 16 cases, we have found that the maximum reduction in number of

nodes is 1. One example of these cases is the positive tree that has 2 leaf nodes after

compactification as in Fig. 2.5.a. Since it has two leaf nodes and one one-child internal

node, after being converted to the corresponding negative tree, the number of nodes is

reduced by 2 - 1 = 1.

1

(a)

0

0

11

(b)

0

1

Figure 2.5: One node is reduced in a tree: a compact positive tree has 4 nodes (a) andits conversion (a negative tree) has 3 node (b).

It is supposed that the theorem’s conclusion is right for all r < k. We shall

prove that it is also right for k. This is done by an observation that in all positive

trees that are of height k, there is at least one tree with both left subtree and right

subtree (of height k − 1) that each can be reduced by at least 2(k−1)−2 nodes after

conversion.

A real experiment on network intrusion dataset showed at the end of following

section shows that the storage reduction is only about 0.35% of this maximum.

39

2.4 Experiments

Next, we investigate the possible impact of the reduction in detector storage

complexity in PNSA on the detection real (average) time complexity in comparison

with single NSA (PSA). All experiments are performed on a laptop computer with

Windows 8 Pro 64-bit, Intel Core i5-3210M, CPU 2.50GHz (4 CPUs), 4GB RAM.

Table 2.1 shows the results on detector memory storage and detection time of

PNSA compared to a popular NSA proposed in [31] on some combinations of S, ` and

r. The training dataset of selves S contains randomly generated binary strings. The

memory reduction is measured as the ratio of reduction in number of nodes of the

binary tree detectors generated by PNSA when compared to the binary tree detectors

generated by the NSA in [31]. The comparative results show that when ` and r are

sufficiently large, the detector storage complexity and the detection time of PNSA are

significantly smaller than NSA in [31] (36% and 50% less).

Table 2.1: Comparison of memory and detection time reductions.

S ` r Memory (%) Time (%)1,000 50 12 0 02,000 30 15 2.5 52,000 40 17 25.9 42.72,000 50 20 36.3 50

We have conducted another experiment by choosing ` = 40, |S| = 20,000 (S is

the set of randomly generated binary strings of length `) and varying r (from 15 to

40). Then, `− r + 1 trees were created using single NSA and other `− r + 1 compact

trees were created using PNSA. Next, both detectors sets were used to detect every

s ∈ S. Fig. 2.6 depicts the detection time of PNSA and NSA in the experiment. The

results show that PNSA detection time is significantly smaller than that of NSA. For

instance, when r is from 20 to 34, detection in PNSA is about 4.46 times faster than

that of NSA.

Next experiment is conducted on Netflow dataset, a conversion of Tcpdump

from well-known DARPA dataset to Netflow [86]. We use all 105,033 normal flows as

self samples. This self set is first converted to binary string of length 104, then we run

40

15 20 25 30 35 40

500

1000

1500

2000

2500

3000

3500

t (mins)

r

NSAPNSA

Figure 2.6: Detection time of NSA and PNSA.

our algorithm on r changing from 5 to 45. Table 2.2 shows some of the experiment

steps. The percentage of node reduction is in the final column. Fig. 2.7 depicts the

reduction of nodes in trees created by PNSA comparison to that of NSA for all r = 3,..,

45. It shows that the reduction is more than one third when the matching threshold

greater than 19.

Table 2.2: Comparison of nodes generation on Netflow dataset.

r NSA PNSA Reduction(%)5 727 706 2.8910 33,461 31,609 5.5315 1,342,517 1,154,427 14.0120 9,428,132 6,157,766 34.6825 18,997,102 11,298,739 40.5230 29,668,240 17,080,784 42.4235 42,596,987 2

improving some artificial immune algorithms for...

Documents