improving some artificial immune algorithms for...

103
MINISTRY OF EDUCATION VIETNAMESE ACADEMY AND TRAINING OF SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ———————————— NGUYEN VAN TRUONG IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR NETWORK INTRUSION DETECTION THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS Hanoi - 2019

Upload: others

Post on 04-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • MINISTRY OF EDUCATION VIETNAMESE ACADEMY

    AND TRAINING OF SCIENCE AND TECHNOLOGY

    GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY————————————

    NGUYEN VAN TRUONG

    IMPROVING SOME ARTIFICIAL IMMUNE

    ALGORITHMS FOR NETWORK INTRUSION

    DETECTION

    THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

    IN MATHEMATICS

    Hanoi - 2019

  • MINISTRY OF EDUCATION VIETNAMESE ACADEMY

    AND TRAINING OF SCIENCE AND TECHNOLOGY

    GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY————————————

    NGUYEN VAN TRUONG

    IMPROVING SOME ARTIFICIAL IMMUNE

    ALGORITHMS FOR NETWORK INTRUSION

    DETECTION

    THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHYIN MATHEMATICS

    Major: Mathematical foundations for InformaticsCode: 62 46 01 10

    Scientific supervisor:

    1. Assoc. Prof., Dr. Nguyen Xuan Hoai

    2. Assoc. Prof., Dr. Luong Chi Mai

    Hanoi - 2019

  • Acknowledgments

    First of all I would like to thank is my principal supervisor, Assoc. Prof.,

    Dr. Nguyen Xuan Hoai for introducing me to the field of Artificial Immune System.

    He guides me step by step through research activities such as seminar presentations,

    paper writing, etc. His genius has been a constant source of help. I am intrigued

    by his constructive criticism throughout my PhD. journey. I wish also to thank my

    co-supervisor, Assoc. Prof., Dr. Luong Chi Mai. She is always very enthusiastic in

    our discussion promising research questions. It is a pleasure and luxury for me to work

    with her. This thesis could not have been possible without my supervisors’ support.

    I gratefully acknowledge the support from Institute of Information Technology,

    Vietnamese Academy of Science and Technology, and from Thai Nguyen University

    of Education. I thank the financial support from the National Foundation for Science

    and Technology Development (NAFOSTED), ASEAN-European Academic University

    Network (ASEA-UNINET).

    I thank M.Sc. Vu Duc Quang, M.Sc. Trinh Van Ha and M.Sc. Pham Dinh

    Lam, my co-authors of published papers. I thank Assoc. Prof., Dr. Tran Quang

    Anh and Dr. Nguyen Quang Uy for many helpful insights for my research. I thank

    colleagues, especially my cool labmate Mr. Nguyen Tran Dinh Long, in IT Research

    & Development Center, HaNoi University.

    Finally, I thank my family for their endless love and steady support.

  • Certificate of Originality

    I hereby declare that this submission is my own work under my scientific super-

    visors, Assoc. Prof., Dr. Nguyen Xuan Hoai, and Assoc. Prof., Dr. Luong Chi Mai. I

    declare that, it contains no material previously published or written by another person,

    except where due reference is made in the text of the thesis. In addition, I certify that

    all my co-authors allow me to present our work in this thesis.

    Hanoi, 2019

    PhD. student

    Nguyen Van Truong

  • i

    Contents

    List of Figures v

    List of Tables vii

    Notation and Abbreviation viii

    INTRODUCTION 1

    Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1 BACKGROUND 5

    1.1 Detection of Network Anomalies . . . . . . . . . . . . . . . . . . . . . . 5

    1.1.1 Host-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.1.2 Network-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.1.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.2 A brief overview of human immune system . . . . . . . . . . . . . . . . 8

    1.3 AIS for IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3.1 AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3.2 AIS features for IDS . . . . . . . . . . . . . . . . . . . . . . . . 11

    1.4 Selection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.4.1 Negative Selection Algorithms . . . . . . . . . . . . . . . . . . . 12

  • ii

    1.4.2 Positive Selection Algorithms . . . . . . . . . . . . . . . . . . . 15

    1.5 Basic terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . . 16

    1.5.1 Strings, substrings and languages . . . . . . . . . . . . . . . . . 16

    1.5.2 Prefix trees, prefix DAGs and automata . . . . . . . . . . . . . 17

    1.5.3 Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    1.5.4 Detection in r-chunk detector-based positive selection . . . . . . 20

    1.5.5 Holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    1.5.6 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . 22

    1.5.7 Ring representation of data . . . . . . . . . . . . . . . . . . . . 23

    1.5.8 Frequency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    1.6 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    1.6.1 The DARPA-Lincoln datasets . . . . . . . . . . . . . . . . . . . 27

    1.6.2 UT dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    1.6.3 Netflow dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    1.6.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2 COMBINATION OF NEGATIVE SELECTION AND POSITIVE SE-

    LECTION 30

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    2.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.3 New Positive-Negative Selection Algorithm . . . . . . . . . . . . . . . . 31

    2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3 GENERATION OF COMPACT DETECTOR SET 43

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.3 New negative selection algorithm . . . . . . . . . . . . . . . . . . . . . 45

  • iii

    3.3.1 Detectors set generation under rcbvl matching rule . . . . . . . 45

    3.3.2 Detection under rcbvl matching rule . . . . . . . . . . . . . . . 48

    3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4 FAST SELECTION ALGORITHMS 51

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.3 A fast negative selection algorithm based on r-chunk detector . . . . . . 52

    4.4 A fast negative selection algorithm based on r-contiguous detector . . . 57

    4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    5 APPLYING HYBRID ARTIFICIAL IMMUNE SYSTEM FOR NET-

    WORK SECURITY 66

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    5.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    5.3 Hybrid positive selection algorithm with chunk detectors . . . . . . . . 69

    5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.4.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.4.3 Performance metrics and parameters . . . . . . . . . . . . . . . 72

    5.4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    CONCLUSIONS 78

    Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    Published works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

  • iv

    BIBLIOGRAPHY 81

  • v

    List of Figures

    1.1 Classification of anomaly-based intrusion detection methods . . . . . . 7

    1.2 Multi-layered protection and elimination architecture . . . . . . . . . . 9

    1.3 Multi-layer AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . 10

    1.4 Outline of a typical negative selection algorithm. . . . . . . . . . . . . . 13

    1.5 Outline of a typical positive selection algorithm. . . . . . . . . . . . . . 15

    1.6 Example of a prefix tree and a prefix DAG. . . . . . . . . . . . . . . . . 18

    1.7 Existence of holes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    1.8 Negative selections with 3-chunk and 3-contiguous detectors. . . . . . . 23

    1.9 A simple ring-based representation (b) of a string (a). . . . . . . . . . . 25

    1.10 Frequency trees for all 3-chunk detectors. . . . . . . . . . . . . . . . . . 26

    2.1 Binary tree representation of the detectors set generated from S. . . . . 33

    2.2 Conversion of a positive tree to a negative one. . . . . . . . . . . . . . . 33

    2.3 Diagram of the Detector Generation Algorithm. . . . . . . . . . . . . . 35

    2.4 Diagram of the Positive-Negative Selection Algorithm. . . . . . . . . . 37

    2.5 One node is reduced in a tree: a compact positive tree has 4 nodes (a)

    and its conversion (a negative tree) has 3 node (b). . . . . . . . . . . . 38

    2.6 Detection time of NSA and PNSA. . . . . . . . . . . . . . . . . . . . . 40

    2.7 Nodes reduction on trees created by PNSA on Netflow dataset. . . . . . 41

    2.8 Comparison of nodes reduction on Spambase dataset. . . . . . . . . . . 41

    3.1 Diagram of a algorithm to generate perfect rcbvl detectors set. . . . . . 47

    4.1 Diagram of the algorithm to generate positive r-chunk detectors set. . . 55

  • vi

    4.2 A prefix DAG G and an automaton M . . . . . . . . . . . . . . . . . . 57

    4.3 Diagram of the algorithm to generate negative r-contiguous detectors set. 61

    4.4 An automaton represents 3-contiguous detectors set. . . . . . . . . . . . 62

    4.5 Comparison of ratios of runtime of r-chunk detector-based NSA to run-

    time of Chunk-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.6 Comparison of ratios of runtime of r-contiguous detector-based NSA to

    runtime of Cont-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

  • vii

    List of Tables

    1.1 Performance comparison of NSAs on linear strings and ring strings. . . 24

    2.1 Comparison of memory and detection time reductions. . . . . . . . . . 39

    2.2 Comparison of nodes generation on Netflow dataset. . . . . . . . . . . . 40

    3.1 Data and parameters distribution for experiments and results comparison. 49

    4.1 Comparison of our results with the runtimes of previously published

    algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    4.2 Comparison of Chunk-NSA with r-chunk detector-based NSA. . . . . . 63

    4.3 Comparison of proposed Cont-NSA with r-contiguous detector-based NSA. 64

    5.1 Features for NIDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.2 Distribution of flows and parameters for experiments. . . . . . . . . . . 73

    5.3 Comparison between PSA2 and other algorithms. . . . . . . . . . . . . 74

    5.4 Comparison between ring string-based PSA2 and linear string-based PSA2. 76

  • viii

    Notation and Abbreviation

    Notation

    ` Length of data samples

    Sr Set of ring presentations of all strings in S

    |X| Cardinality of set XΣ An alphabet, a nonempty and finite set of symbols

    Σk Set of all strings of length k on alphabet Σ, where k is a

    positive integer.

    Σ∗ Set of all strings on alphabet Σ, including an empty string.

    r Matching threshold

    Dpi Set of all positive r-chunk detectors at position i.

    Dni Set of all negative r-chunk detectors at position i.

    CHUNKp(S, r) Set of all positive r-chunk detectors.

    CHUNK(S, r) Set of all negative r-chunk detectors.

    CONT(S, r) Set of all r-contiguous detectors.

    L(X) Set of all nonself strings detected by X.

    rcbvl r-contiguous bit with variable length.

    Abbreviation

    AIS Artificial Immune System

    ACC Accuracy Rate

    ACO Ant Colony Optimization

    ANIDS Anomaly Network Intrusion Detection System

    BBNN Block-Based Neural Network

    Chunk-NSA Chunk Detector-Based Negative Selection Algorithm

    Cont-NSA Contiguous Detector-Based Negative Selection Algorithm

    DR Detection Rate

    DAG Directed Acyclic Graph

    FAR False Alarm Rate

    GA Genetic Algorithm

    HIS Human Immune System

    HIDS Host Intrusion Detection System

    IDS Intrusion Detection System

  • ix

    ML Machine Learning

    MLP Multilayer Perceptron

    NIDS Network Intrusion Detection System

    NS Negative Selection

    NSA Negative Selection Algorithm

    NSM Negative Selection Mutation

    PNSA Positive-Negative Selection Algorithm

    PSA Positive Selection Algorithm

    PSA2 Two-class Positive Selection Algorithm

    PSO Particle Swarm Optimization

    PSOGSA Particle Swarm Optimization-Gravitational Search Algorithm

    RNSA Real-valued NSA

    SVM Support Vector Machines

    TCP Transmission Control Protocol

    VNSA Variable length detector-based NSA

  • 1

    INTRODUCTION

    Motivation

    Internet users and computer networks are suffering from rapidly increasing num-

    ber of attacks. In order to keep them safe, there is a need for effective security monitor-

    ing systems, such as Intrusion Detection Systems (IDS). However, intrusion detection

    has to face a number of different problems such as large network traffic volumes, im-

    balanced data distribution, difficulties to realize decision boundaries between normal

    and abnormal actions, and a requirement for continuous adaptation to a constantly

    changing environment. As a result, many researchers have attempted to use different

    types of approaches to build reliable intrusion detection system.

    Computational intelligence techniques, known for their ability to adapt and to

    exhibit fault tolerance, high computational speed and resilience against noisy informa-

    tion, are hopefully alternative methods to the problem.

    One of the promising computational intelligence methods for intrusion detection

    that have emerged recently are artificial immune systems (AIS) inspired by the biolog-

    ical immune system. Negative selection algorithm (NSA), a dominating model of AIS,

    is widely used for intrusion detection systems (IDS) [55, 52]. Despite its successful

    application, NSA has some weaknesses: 1-High false positive rate (false alarm rate)

    and false negative rate, 2-High training and testing time, 3-Exponential relationship

    between the size of the training data and the number of detectors possibly generated for

    testing, 4-Changeable definitions of ”normal data” and ”abnormal data” in dynamic

    network environment [55, 79, 92]. To overcome these limitations, trends of recent works

    are to concentrate on complex structures of immune detectors, matching methods and

    hybrid NSAs [11, 94, 52].

    Following trends mentioned above, in this thesis we investigate the ability of

    NSA to combine with other classification methods and propose more effective data

  • 2

    representations to improve some NSA’s weaknesses.

    Scientific meaning of the thesis: to provide further background to improve per-

    formance of AIS-based computer security field in particular and IDS in general.

    Reality meaning of the thesis: to assist computer security practicers or experts

    implement their IDS with new features from AIS origin.

    The major contributions of this research are: Propose a new representation of

    data for better performance of IDS; Propose a combination of existing algorithms as

    well as some statistical approaches in an uniform framework; Propose a complete and

    non-redundant detector representation to archive optimal time and memory complex-

    ities.

    Objectives

    Since data representation is one of the factors that affect the training and testing

    time, a compact and complete detector generation algorithm is investigated.

    The thesis investigates optimal algorithms to generate detector set in AIS. They

    help to reduce both training time and detecting time of AIS-based IDSs.

    Also, it is regarded to propose and investigate an AIS-based IDS that can

    promptly detect attacks, either if they are known or never seen before. The proposed

    system makes use of AIS with statistics as analysis methods and flow-based network

    traffic as experimental data.

    Problem statements

    Since the NSA has some limitations as listed in the first section, this thesis

    concentrates on three problems:

    1. The first problem is to find compact representations of data. Objectives of this

    problem’s solution is not only to minimize memory storage but also to reduce

    testing time.

    2. The second problem is to propose algorithms that can reduce training time

    and testing time in compared with all existing related algorithms.

  • 3

    3. The third problem is to improve detection performance with respect to reduc-

    ing false alarm rates while keeping detection rate and accuracy rate as high as

    possible.

    Solutions of these problems can partly improve first three weaknesses as listed in the

    first section. Regarding to the last NSAs’ weakness about changeable definitions of

    ”normal data” and ”abnormal data” in dynamic network environment, we consider it

    as a risk in our proposed algorithm and left it for future work.

    Logically, it is impossible to find an optimal algorithm that can both reduce time

    and memory complexities and obtain best detection performance. These aspects are

    always in conflict with each other. Thus, in each chapter, we will propose algorithms

    to solve each problem quite independently.

    The intrusion detection problem mentioned in this thesis can be informally

    stated as:

    Given a finite set S of network flows which labeled with self (normal) or nonself

    (abnormal). The objective is to build classifying models on S that can label exactly an

    unlabeled network flow s.

    Outline of thesis

    The first chapter introduces the background knowledge necessary to discuss the

    algorithms proposed in following chapters. First, detection of network anomalies is

    briefly introduced. Following that, the human immune system, artificial immune sys-

    tem, machine learning and their relevance are reviewed and discussed. Then, popular

    datasets used for experiments in the thesis are examined. related works.

    In Chapter 2, a combination method of selection algorithms is presented. The

    proposed technique helps to reduce detectors storage generated in training phase. Test-

    ing time, an important measurement in IDS, will also be reduced as a direct consequence

    of a smaller memory complexity. Tree structure is used in this chapter (and in Chapter

    5) to improve time and memory complexities.

    A complete and nonredundant detector set, also called perfect detectors set,

  • 4

    is necessary to archive acceptable self and nonself coverage of classifiers. A selection

    algorithm to generate a perfect detectors set is investigated in Chapter 3. Each detector

    in the set is a string concatenated from overlapping classical ones. Different from

    approaches in the other chapters, discrete structure of string-based detectors in this

    chapter are suitable for detection in distributed environment.

    Chapter 4 includes two selection algorithms for fast training phase. The optimal

    algorithms can generate a detectors set in linear time with respect to size of training

    data. The experiment results and theoretical proof show that proposed algorithms

    outperform all existing ones in term of training time. In term of detection time, the

    first algorithm and the second one is linear and polynomial, respectively.

    Chapter 5 mainly introduces a hybrid approach of positive selection algorithm

    with statistics for more effective NIDS. Frequencies of self and nonself data (strings) are

    contained in leaves of trees representing detectors. This information plays an important

    role in improving performance of the proposed algorithms. The hybrid approach came

    as a new positive selection algorithm for two-class classification that can be trained

    with samples from both self and nonself data types.

  • 5

    Chapter 1

    BACKGROUND

    The human immune system (HIS) has successfully protected our bodies against

    attacks from various harmful pathogens, such as bacteria, viruses, and parasites. It

    distinguishes pathogens from self-tissue, and further eliminates these pathogens. This

    provides a rich source of inspiration for computer security systems, especially intrusion

    detection systems [92]. Hence, applying theoretical immunology and observed immune

    functions, its principles, and its models to IDS has gradually developed into a new

    research field, called artificial immune system (AIS).

    How to apply remarkable features of HIS to archive scalable and robust IDS

    is considered a researching gap in the field of computer security. In this chapter, we

    introduce the background knowledge necessary to discuss the algorithms proposed in

    following chapters that can partly fulfill the gap.

    Firstly, a brief introduction to network anomaly detection is presented. We

    then overview HIS. Next, immune selection algorithms, detectors, performance metrics

    and their relevance are reviewed and discussed. Finally, some popular datasets are

    examined.

    1.1 Detection of Network Anomalies

    The idea of intrusion detection is predicated on the belief that an intruder’s

    behavior is noticeably different from that of a legitimate user and that many unautho-

    rized actions are detectable [65]. Intrusion detection systems (IDSs) are deployed as a

    second line of defense along with other preventive security mechanisms, such as user

  • 6

    authentication and access control. Based on its deployment, an IDS can act either as

    a host-based or as a network-based IDS.

    1.1.1 Host-Based IDS

    A Host-Based IDS (HIDS) monitors and analyzes the internals of a computing

    system. A HIDS may detect internal activity such as which program accesses what

    resources and attempts illegitimate access, for example, an activity that modifies the

    system password database. Similarly, a HIDS may look at the state of a system and

    its stored information whether it is in RAM or in the file system or in log files or

    elsewhere. Thus, one can think of a HIDS as an agent that monitors whether anything

    or anyone internal or external has circumvented the security policy that the operating

    system tries to enforce [12].

    1.1.2 Network-Based IDS

    A Network-Based IDS (NIDS) detects intrusions in network data. Intrusions

    typically occur as anomalous patterns. Most techniques model the data in a sequential

    fashion and detect anomalous subsequences. The primary reason for these anomalies

    is the attacks launched by outside attackers who want to gain unauthorized access to

    the network to steal information or to disrupt the network. In a typical setting, a

    network is connected to the rest of the world through the Internet. The NIDS reads

    all incoming packets or flows, trying to find suspicious patterns. For example, if a

    large number of TCP connection requests to a very large number of different ports are

    observed within a short time, one could assume that there is someone committing a

    port scan at some of the computers in the network. Port scans mostly try to detect

    incoming shell codes in the same manner that an ordinary intrusion detection system

    does. In addition to inspecting the incoming traffic, a NIDS also provides valuable

    information about intrusion from outgoing or local traffic. Some attacks might even be

    staged from the inside of a monitored network or network segment; and therefore, not

    regarded as incoming traffic at all. The data available for intrusion detection systems

    can be at different levels of granularity, like packet level traces or Cisco netflow data.

  • 7

    The data is high dimensional, typically, with a mix of categorical as well as continuous

    numeric attributes. Misuse-based NIDSs attempt to search for known intrusive patterns

    while an anomaly-based intrusion detector searches for unusual patterns. Today, the

    intrusion detection research is mostly concentrated on anomaly-based network intrusion

    detection because it can detect both known and unknown attacks [12].

    1.1.3 Methods

    On the basis of the availability of prior knowledge, the detection mechanism

    used, the mode of performance and the ability to detect attacks, existing anomaly

    detection methods are categorized into six broad categories [41] as shown in Fig. 1.1.

    This figure is adapted from [12].

    AnomalyDetection

    SupervisedLearning

    Learning

    Learning

    Unsupervised

    Probabilistic

    SoftComputing

    Knowledgebased

    CombinationLearners

    Parametric

    Non-ParametricClustering

    AssociationMining

    Outlier mining

    ANN based

    Rough Set based

    Fuzzy Logic

    GA based & Ant Colony

    Artificial Immune System

    Ensemble based

    Fusion based

    Hybrid

    Rule based & ExpertSystem based

    Ontology & Logic based

    Figure 1.1: Classification of anomaly-based intrusion detection methods

    AIS is a fairly new research subfield of Computational intelligence. It was

    considered as a system that acts intelligently: What it does is appropriate for its

    circumstances and its goal; it is flexible to changing environments and changing goals;

    it learns from experience; also it makes appropriate choices given perceptual limitations

    and finite computation [68].

  • 8

    1.1.4 Tools

    IDS tools are used for purposes such as information gathering, victim identi-

    fication, packet capture, network traffic analysis and visualization of traffic behavior.

    These tools for both commercial and free purposes can be examplified, such as Snort,

    Suricata, Bro, OSSEC, Samhain, Cisco Secure IDS, CyberCop, and RealSecure. Some

    immune-related IDS tools including LISYS [10], which is based on TCP packages, and

    MILA [26], a multilevel immune learning algorithm proposed for novel pattern recog-

    nition.

    However, despite their initially promising and influential properties, immune-

    based IDSs never made it beyond the prototype stage [83]. Two main issues that

    impeded the progress of immune algorithms were identified: large computational cost

    to achieve acceptable coverage of the potentially anomalous region [54], and the failure

    of these algorithms to generalize properly beyond the training set [79].

    1.2 A brief overview of human immune system

    Mainly being inspired by the human immune system, researchers have devel-

    oped AISs intellectually and innovatively. Physical barriers, physiological barriers, an

    innate immune system, and an adaptive immune system are main factors of a multi-

    layered protection architecture included in our human immune system; among which,

    the adaptive immune system being capable of adaptively recognizing specific types of

    pathogens, and memorizing them for accelerated future responses is a complex of a

    variety of molecules, cells, and organs spread all over the body [46]. Pathogens are for-

    eign substances like viruses, parasites and bacteria which attack the body. Figure 1.2,

    adapted from [77], presents a multi-layered protection and elimination architecture.

    T cells and B cells cooperate to distinguish self from nonself. On the one hand,

    T cells recognize antigens with the help of major histocompatibility complex (MHC)

    molecules. Antigen presenting cells ingest and fragment antigens to peptides. MHC

    molecules transport these peptides to the surface of antigen presenting cells. T cells,

    whose receptors bind with these peptide-MHC combinations, are said to recognize

  • 9

    Figure 1.2: Multi-layered protection and elimination architecture

    antigens. On the other hand, B cells recognize antigens by binding their receptors

    directly to antigens. The bindings actually are chemical bonds between receptors and

    epitopes. The more complementary the structure and the charge between receptors and

    epitopes are, the more likely binding will occur. The strength of the bond is termed

    affinity. To avoid autoimmunity, T cells and B cells must pass a negative selection

    stage, where lymphocytes matching self cells are killed.

    Prior to negative selection, T cells undergo positive selection. This is because in

    order to bind to the peptide-MHC combinations, they must recognize self MHC first.

    Thus, the positive selection will eliminate T cells with weak bonds to self MHC. T cells

    and B cells, which survive the negative selection, become mature, and enter the blood

    stream to perform the detection task. Since these mature lymphocytes have never

    encountered antigens, they are naive. Naive T cells and B cells can possibly auto-react

    with self cells, because some peripheral self proteins are never presented during the

    negative selection stage. To prevent self-attack, naive cells need two signals in order

    to be activated: one occurs when they bind to antigens, and the other is from other

    sources as a confirmation. Naive T helper cells receive the second signal from innate

    system cells. In the event that they are activated, T cells begin to clone. Some of

    the clones will send out signals to stimulate macrophages or cytotoxic T cells to kill

    antigens, or send out signals to activate B cells. Others will form memory T cells. The

    activated B cells migrate to a lymph node. In the lymph node, a B cell will clone itself.

  • 10

    Meanwhile, somatic hyper mutation is triggered, whose rate is 10 times higher than

    that of the germ line mutation, and is inversely proportional to the affinity. Mutation

    changes the receptor structures of offspring; hence offspring have to bind to pathogenic

    epitopes captured within the lymph nodes. If they do not bind, they will simply die

    after a short time. Whereas, in case they succeed in binding, they will leave the lymph

    node and differentiate into plasma or memory B cells.

    In summary, the HIS is a distributed, self-organizing and lightweight defense

    system for the body. These remarkable features fulfill and benefit the design goals of

    an intrusion detection system, thus resulting in a scalable and robust system [53].

    1.3 AIS for IDS

    1.3.1 AIS model for IDS

    Figure 1.3 illustrates the steps necessary to obtain an AIS solution for a secu-

    rity problem, as firstly envisioned by de Castro and Timmis [27] and latter adopted

    by Fernandes et al. [35]. Firstly, the security domain of the system to model needs

    to be identified. Secondly,the immune entities that best fit the needs of the system

    should be picked from the immunological theories. That should ease pointing out the

    representation of the entities. In the step of the affinity measures one should take into

    account a matching rule that outputs if two elements should bind.

    Figure 1.3: Multi-layer AIS model for IDS

  • 11

    1.3.2 AIS features for IDS

    According to Kim et al. [55], AIS features can be illustrated and summarized

    as follows.

    Firstly, a distributed IDS supports robustness, configurability, extendibility and

    scalability. It is robust since the failure of one local intrusion detection process does

    not cripple the overall IDS. It is also easy to configure a system since each intrusion

    detection process can be simply tailored for the local requirements of a specific host.

    The addition of new intrusion detection processes running on different operating sys-

    tems does not require modification of existing processes and hence it is extensible. It

    can also scale better, since the high volume of audit data is distributed amongst many

    local hosts and is analyzed by those hosts.

    Secondly, a self-organizing IDS provides adaptability and global analysis. With-

    out external management or maintenance, a self organizing IDS automatically detects

    intrusion signatures which are previously unknown and/or distributed, and eliminates

    and/or repairs compromised components. Such a system is highly adaptive because

    there is no need for manual updates of its intrusion signatures as network environments

    change. Global analysis emerges from the interactions among a large number of varied

    intrusion detection processes.

    Next, a lightweight IDS supports efficiency and dynamic features. A lightweight

    IDS does not impose a large overhead on a system or place a heavy burden on CPU

    and I/O. It places minimal work on each component of the IDS. The primary functions

    of hosts and networks are not adversely affected by the monitoring. It also dynami-

    cally covers intrusion and non-intrusion pattern spaces at any given time rather than

    maintaining entire intrusion 8 and non-intrusion patterns.

    One more important feature is a multi-layered IDS which increases robustness.

    The failure of one-layer defense does not necessarily allow an entire system to be

    compromised. While a distributed IDS allocates intrusion detection processes across

    several hosts, a multi-layered IDS places different levels of sensors at one monitoring

    place.

    Additionally, a diverse IDS provides robustness. A variety of different intrusion

  • 12

    detection processes spread across hosts will slow an attack that has successfully com-

    promised one or more hosts. This is because an understanding of the intrusion process

    at one site provides limited or no information on intrusion processes at other sites.

    Finally, it is a disposable IDS that increases robustness, extendibility and config-

    urability. A disposable IDS does not depend on any single component. Any component

    can be easily and automatically replaced with other components. These properties are

    important in an effective IDS, as well as being established properties of the HIS.

    1.4 Selection algorithms

    The main developments within AIS have focussed on three immunological the-

    ories: clonal selections, immune networks and negative selections. Negative selection

    approaches are based on self-nonself discrimination in biology system. This property

    makes it attractive for computer and network security researchers. A survey by G. C.

    Silva and D. Dasgupta in [71] showed that in five-year period 2008-2013, NSA predom-

    inated all the other models of AIS in term of published papers relating to both network

    security and anomaly detection. This trend triggers for much of the research work in

    this thesis.

    A model of AIS, positive selection algorithm (PSA), is also investigated. Under

    some conditions, we will prove in a follow section that PSA is adequate to NSA in term

    of anomaly detection performance.

    1.4.1 Negative Selection Algorithms

    Negative selection is a mechanism employed to protect the body against self-

    reactive lymphocytes. Such lymphocytes can occur because the building blocks of

    antibodies are different gene segments that are randomly composed and undergo a fur-

    ther somatic hypermutation process. Therefore, this process can produce lymphocytes

    which are able to recognise self-antigens [85].

    NSAs are among the most popular and extensively studied techniques in ar-

    tificial immune systems that simulate the negative selection process of the biological

    immune system. Stephanie Forrest et al. [38] proposed an algorithmic model of this

  • 13

    process, which can be considered as a classifier that learns from only self samples

    (negative examples)1.

    A typical NSA comprises of two phases: detector generation and detection [7,

    50]. In the detector generation phase (Fig. 1.4.a), the detector candidates are generated

    by some random processes and censored by matching them against given self samples

    taken from a set S (representing the system components). The candidates that match

    any element of S are eliminated and the rest are kept and stored in the detector

    set D. In the detection phase (Fig. 1.4.b), the collection of detectors are used to

    distinguish self (system components) from nonself (outlier, anomaly, etc.). If incoming

    data instance matches any detector, it is claimed as nonself or anomaly. Figure 1.4 is

    adapted from [38].

    End

    Input new samples

    Match any detector?

    Self

    Nonself

    Yes

    No

    Begin

    End

    Generate Random Candidates

    Match self samples?

    Accept as new detector

    Yes

    NoEnough detectors?

    Yes

    Begin

    No

    (a) Generation of detector set (b) Detection of new instances

    Figure 1.4: Outline of a typical negative selection algorithm.

    Concept matching or recognition, are used both in the detector generation phase

    and in the anomaly detection phase. Regardless of representation, a matching rule on

    a detector d and a data sample s can be informally defined as a distance measure

    between d and s within a threshold. Matching threshold exposes the concept of partial

    matching: two points do not have to be exactly the same to be considered matching.

    1At the writing time of this thesis, the paper has been cited more than 2300 times

  • 14

    A partial matching rule can support an approximation or a generalization in the algo-

    rithms. The choice of the matching rule or the threshold in a matching rule must be

    application specific and representation dependent [51]. For real-valued representation,

    some popular rules are Euclidean distance and Manhattan distance. In string represen-

    tation, rcb(r-contiguous bits) matching rule and r-chunk matching rule are the most

    famous ones and they are formally presented in following section.

    Since its introduction, NSA has had many applications such as in computer virus

    detection [37, 5], monitoring UNIX processes [36], anomaly detection [22, 26], intrusion

    detection [19, 54, 46, 59, 18, 93], scheduling [64], fault detection and diagnosis [45, 72],

    negative database [33, 98], negative authentication [25, 20]. Moreover, NSA has been

    quite successfully applied in immunology where they are used as models to provide

    insight into fundamental principles of immunity and infection [15], and to illustrate

    the immunological processes such as HIV infection [56, 57].

    The most significant characteristics of a NSA making its uniqueness and strength

    are:

    • No prior knowledge of nonself is required [29].

    • It is inherently distributable; no communication between detectors is needed [30].

    • It can hide the self concept [33].

    • Compared with other change detection methods, NSAs do not depend on the

    knowledge of defined normal. Consequently, checking activity of each site can

    be based on a unique signature of each while the same algorithm is used over

    multiple sites.

    • The quality of the check can be traded off against the cost of performing a check

    [38].

    • Symmetric protection is provided so the malicious manipulation on detector set

    can be detected by normal behavior of the system [38].

    • If the process of generating detectors is costly, it can be distributed to multiple

    sites because of its inherent parallel characteristics.

  • 15

    • Detection is tunable to balance between coverage (matching probability) and the

    number of detectors [29].

    1.4.2 Positive Selection Algorithms

    Contrary to NSAs, PSAs have been less studied in the literature. PSAs are

    mainly developed and applied in intrusion detection [23, 73, 44, 66], malware detec-

    tion [39], spam detection [81], and classification [40, 67]. Stibor et al. [80] argues that

    positive selection might have better detection performance than negative one. How-

    ever, for problems and applications that the number of detectors generated by NSAs is

    much less than that of self samples, negative selection is obviously a better choice [51].

    Similar to NSA, a PSA contains two phases: detector generation and detection.

    In the detector generation phase (Fig. 1.5.a), the detector candidates are generated by

    some random processes and matched against the given self sample set S. The candi-

    dates that do not match any element in S are eliminated and the rest are kept and

    stored in the detector set D. In the detection phase (Fig. 1.5.b), the collection of de-

    tectors are used to distinguish self from nonself. If incoming data instance matches any

    detector, it is claimed as self. In other words, detectors modeling involves generating a

    End

    Input new samples

    Match any detector?

    Self

    Nonself

    Yes

    No

    Begin

    End

    Generate Random Candidates

    Match self samples?

    Accept as new detector

    Yes

    NoEnough detectors?

    Yes

    Begin

    No

    (a) Generation of detector set (b) Detection of new instances

    Figure 1.5: Outline of a typical positive selection algorithm.

  • 16

    set of strings (patterns) that do not match any string in a training dataset too strongly

    (negative selection) or weakly match at least one string from the same dataset (positive

    selection). Having obtained the detectors, one usually examines a set of testing dataset

    (i.e., ”antigens”), for which we search one or all matching detectors for classification.

    1.5 Basic terms and definitions

    In selection algorithms, an essential component is the matching rule which de-

    termines the similarity between detectors and self samples (in the detector generation

    phase) and coming data instances (in the detection phase). Obviously, the matching

    rule is dependent on detector representation. In this thesis, both self and nonself cells

    are represented as strings of fixed length. This representation is a simple and popular

    representation for detectors and data in AIS, and other representations (such as real

    valued) could be reduced to binary, a special case of string [42, 51].

    1.5.1 Strings, substrings and languages

    An alphabet Σ is nonempty and finite set of symbols. A string s ∈ Σ∗ is a

    sequence of symbols from Σ, and its length is denoted by |s|. A string is called empty

    string if its length equals 0. Given an index i ∈ {1, . . . , |s|}, then s[i] is the symbol

    at position i in s. Given two indices i and j, whenever j ≥ i, then s[i . . . j] is the

    substring of s with length j − i + 1 that starts at position i and whenever j < i, then

    s[i . . . j] is the empty string. If i = 1, then s[i . . . j] is a prefix of s and, if j = |s|,

    then s[i . . . j] is a suffix of s. For a proper prefix or suffix s′ of s, we have in addition

    |s′| < |s|. Given a string s ∈ Σ`, another string d ∈ Σr with 1 ≤ r ≤ `, and an index

    i ∈ {1, . . . , ` − r + 1}, we say that d occurs in s at position i if s[i . . . i + r − 1] = d.

    Moreover, concatenation of two strings s and s′ is s + s′.

    A set of strings S ⊆ Σ∗ is called a language. For two indices i and j, we define

    S[i . . . j] = {s[i . . . j]|s ∈ S}.

  • 17

    1.5.2 Prefix trees, prefix DAGs and automata

    A prefix tree T is a rooted directed tree with edge labels from Σ where for all

    c ∈ Σ, every node has at most one outgoing edge labeled with c. For a string s, we write

    s ∈ T if there is a path from the root of T to a leaf such that s is the concatenation

    of the labels on this path. The language L(T ) described by T is defined as the set of

    all strings that have a nonempty prefix s ∈ T . For example, for T as in Fig. 1.6.a we

    have 0 ∈ T and 10 ∈ T , but 1 6∈ T . Furthermore, 0 ∈ L(T ), 01 ∈ L(T ) since 0 ∈ T and

    11 6∈ L(T ) since no prefix of 11 lies in T . Trees for self dataset and nonself dataset are

    called positive trees and negative trees, respectively.

    A prefix DAG D is a directed acyclic graph with edge labels from Σ, where

    again for all c ∈ Σ, every node has at most one outgoing edge labeled with c. Similar

    to prefix trees, the terms root and leaf used to refer to a node without incoming and

    outgoing edges, respectively. We write s ∈ D if there is a path from a root node to

    a leaf node in D that is labeled by s. Given n ∈ D, the language L(D,n) contains

    all strings that have a nonempty prefix that labels a path from n to some leaf. For

    instance, if D is the DAG in Fig. 1.6.b and n is its lower left node, then L(D,n) consists

    of all strings starting with 11. Moreover, we define L(D) = ∪n is a root of DL(D,n).

    A finite automaton is a tuple M = (Q, q0, Qa,Σ,∆), where Q is a set of states

    with a distinguished initial state q0 ∈ Q,Qa ⊆ Q the set of accepting states, Σ the

    alphabet of M , and ∆ ⊆ Q× Σ×Q the transition relation. Furthermore, we assume

    that the transition relation is unambiguous: for every q ∈ Q and every c ∈ Σ there is

    at most one q′ ∈ Q with (q, c, q′) ∈ ∆. It is common to represent the transition relation

    as a graph with node set Q (with the initial state and the accepting states highlighted

    properly) and labeled edges (a c-labeled edge from q to q′ if q′ ∈ Q with (q, c, q′) ∈ ∆).

    An automaton M is said to accept a string s if its graph contains a path from q0 to

    some q ∈ Qa whose concatenated edge labels equal s (note that this path may contain

    loops). The language L(M) contains all strings accepted by M .

    A prefix DAG can be turned into a finite automaton to decide the membership

    of strings in languages. The details steps of this process is presented in Chapter 4.

  • 18

    01

    0

    1 0

    1

    0

    1a. b.

    Figure 1.6: Example of a prefix tree and a prefix DAG.

    1.5.3 Detectors

    In PSAs and NSAs, an essential component is the matching rule which deter-

    mines the similarity between detectors and self samples (in the detector generation

    phase) and coming data instances (in the detection phase). Obviously, the matching

    rule is dependent on detector representation. For string based AIS, the r-chunk and

    r-contiguous detectors are among the most common matching rules. A r-chunk match-

    ing rule can be seen as a generalisation of the r-contiguous matching rule, which helps

    AIS to achieve better results on data where adjacent regions of the input data sequence

    are not necessarily semantically correlated, such as in network data packets [9].

    An important difference between rcb and r-chunk matching rules is holes, or the

    undetectable strings, that they may induce. This concept is presented in Section 1.5.5.

    Given a nonempty alphabet Σ and finite set of symbols, positive and negative

    r-chunk detectors, r-contiguous detectors, rcbvl-detectors could be defined as follows:

    Definition 1.1 (Positive r-chunk detectors). Given a self set S ⊆ Σ`, a tuple (d, i) of

    a string d ∈ Σr, where r ≤ `, and an index i ∈ {1, ..., ` − r + 1} is a positive r-chunk

    detector if there exists a s ∈ S such that d occurs in s.

    Definition 1.2 (Negative r-chunk detectors). Given a self set S ⊆ Σ`, a tuple (d, i) of

    a string d ∈ Σr, r ≤ `, and an index i ∈ {1, ..., `− r+ 1} is a negative r-chunk detector

    if d does not occurs any s ∈ S.

    Although some proposed approaches in following chapters can be implemented

    on any finite alphabet, binary strings used in all examples are binary, Σ = {0, 1}, just

    for easy understanding.

    Example 1.1. Let ` = 6, r = 3. Given a set S of five self strings: s1 = 010101, s2

    = 111010, s3 = 101101, s4 = 100011, s5 = 010111. The set of some positive r-chunk

  • 19

    detectors is {(010,1), (111,1), (101,2), (110,2), (010,3), (101,3), (101,4), (010,4),

    (111,4))}. The set of some negative r-chunk detectors is {(000,1), (001,1), (011,1),

    (001,2), (010,2), (100,2), (000,3), (100,3), (000,4), (001,4), (100,4)}.

    Definition 1.3. Given a self set S ⊆ Σ`, a string d ∈ Σ` is a r-contiguous detector if

    d[i, . . . , i + r − 1] does not occurs any s ∈ S for all i ∈ {1, ..., `− r + 1}.

    Example 1.2. Let ` = 5, r = 3. Given a set of 7 self strings S = {01111, 00111,

    10000, 10001, 10010, 10110, 11111}. The set of all 3-contiguous detectors is {01011,

    11011}. This example is adapted from [32].

    We also use the following notations:

    • Dpi = {(d, i)|(d, i) is a positive r-chunk detector} is set of all positive r-chunk

    detectors at position i, i = 1, . . . , `− r + 1.

    • Dni = {(d, i)|(d, i) is a negative r-chunk detector} is set of all negative r-chunk

    detectors at position i, i = 1, . . . , `− r + 1.

    • CHUNKp(S, r) = ∪`−r+1i=1 Dpi is set of all positive r-chunk detectors.

    • CHUNK(S, r) = ∪`−r+1i=1 Dni is set of all negative r-chunk detectors.

    • CONT(S, r) is the set of all r-contiguous detectors that do not match any string

    in S.

    • For a given detectors set X, L(X) is the set of all nonself strings detected by X.

    We also say that Σ` \ L(X) is the set of all self strings detected by X.

    Example 1.3. Let ` = 5, matching threshold r = 3. Suppose that we have the set S

    of six self strings s1 = 00000, s2 = 00010, s3 = 10110, s4 = 10111, s5 = 11000, s6 =

    11010. Dp1 = {(000,1), (101,1), (110,1)} (Dp1 is set of all left most substrings of

    length r of all s ∈ S), Dn1 = {(001,1), (010,1), (011,1), (100,1), (111,1)}, Dp2 =

    {(000,2), (001,2), (011,2), (100,2), (101,2)}, Dn2 = {(010,2), (110,2), (111,2)}, Dp3

    = {(000,3), (010,3), (110,3), (111,3)}, Dn3 = {(001,3), (011,3), (100,3), (101,3)}

    (note that Dpi ∪Dni = Σ3, i = 1, 2, 3).

  • 20

    The self space covered by the set of CHUNKp(S, 3) is {0, 1}5\ L(CHUNKp(S, 3))

    = {00000, 00001, 00010, 00011, 00110, 00111, 01000, 01001, 01010, 01011, 01110,

    01111, 10000, 10001, 10010, 10011, 10100, 10101, 10110, 10111, 11000, 11001, 11010,

    11011, 11110, 11111}. Set of all strings detected by CHUNK(S,3) is L(CHUNK(S,3))

    = {00001, 00011, 00100, 00101, 00110, 00111, 01000, 01001, 01010, 01011, 01100,

    01101, 01110, 01111, 10000, 10001, 10010, 10011, 10100, 10101, 11001, 11011, 11100,

    11101, 11110, 11111}.

    Definition 1.4. Given a self set S ⊆ Σ`. A triple (d, i, j) of a string d ∈ Σk, 1 ≤ k ≤ `,

    an index i ∈ {1, ..., ` − r + 1} and an index j ∈ {i, ..., ` − r + 1} is called a negative

    detector under rcbvl matching rule if d does not occur in any s, s ∈ S.

    In other words, a triple (d, i, j) is a rcbvl detector if there exist (j − i + 1) r-chunk

    detectors (d1, i),..., (dj−i+1, j) that dk, dk+1 are two (r−1)-symbol overlapping strings,

    k = 1, ..., j − i.

    Example 1.4. Given `, r and the set S of self strings as in Example 1.1, S = {010101,

    111010, 101101, 100011, 010111}. Triple (0001,1,2) is a rcbvl detector because there

    exist two 3-chunk detectors (000,1), (001,2) that 000 and 001 are two 2-bit overlapping

    strings. A detectors set under rcbvl matching rule contains 5 variable length detec-

    tors {(0001,1,2), (00100,1,4), (100,4,4), (011110,1,4), (11000,1,3)}. It is a minimum

    detectors set (23 bits) that covers all detector space of r-chunk detectors set in Exam-

    ple 1.1 (45 bits).

    Matching threshold r plays an important role in selection algorithms. The

    value of r can be used to balance between underfitting and overfitting. Our proposed

    methods in Chapter 5 investigate this value in combination with simple statistics for

    better detection performance.

    1.5.4 Detection in r-chunk detector-based positive selection

    It could be seen from Example 1.3 that L(CHUNKp(S, r)) = {00100, 00101,

    01100, 01101, 11100, 11101} 6= L(CHUNK(S, r)), so the detection coverage of Dn

    is not the same as that of Dp. This is undesirable for the combination of PSA and

  • 21

    NSA. Hence, to combine PSA and NSA in a unified framework, we have to change the

    original semantic of positive selection in the detection phase as follows.

    Definition 1.5 (Detection in positive selection). If new instance matches ` − r + 1

    positive r-chunk detectors (dij , i), i = 1, . . . , `− r + 1, it is claimed as self, otherwise it

    is claimed as nonself.

    With this new detection semantic, the following theorem on the equivalence of

    detection coverage of r-chunk type PSA and NSA could be stated.

    Theorem 1.1 (Detection Coverage). The detection coverage of positive and negative

    selection algorithms coincide.

    L(CHUNKp(S, r)) = L(CHUNK(S, r)) (1.1)

    Proof. From the description of NSAs (see Fig. 1.4), if a new data instance matches

    a negative r-chunk detector, then it is claimed as nonself, otherwise it is claimed as

    self. Obviously, it is dual to the detection of new data instances in positive selection

    as given in Definition 1.5.

    This theorem lays the foundation for our novel Positive-Negative Selection Al-

    gorithm (PNSA) proposed in next chapter.

    1.5.5 Holes

    The generalization performed by selection algorithms corresponds to the strings

    that are labeled self even though they do not occur in the self sample S. These strings

    are called holes. There are two types of holes: crossover holes and length-limit holes in

    rcb matching rules literature. A crossover hole is a crossover of certain self strings, in

    which all substrings of length r occur in self strings. A length-limit hole is one that has

    at least one substring of length r that exists in CHUNK(S, r). The r-chunk matching

    rule eliminates the problem of length-limit holes.

    Fig. 1.7 illustrates the existence of holes in a self and a nonself space comprised

    by self and detector strings. The string universe Σ` is a squared region. Each dark

    circle represents a detector and a grid shape in the middle is self. The universe is

  • 22

    Figure 1.7: Existence of holes.

    classified by the detector set as self (grid region and holes - white region) and nonself

    (dark region covered by all circles).

    In fact, holes are not a ”problem”, but as pointed out by Stibor et al. [78], they

    are a necessary property of the selection algorithms. Without holes, the algorithms

    would do nothing but memorize the training data for classification naively.

    Fig. 1.8 shows a set of seven self strings as in Example 1.2, S ⊂ {0, 1}5 (left)

    along with CHUNK(S, 3) (middle) and CONT(S, 3) (right). For both detector types,

    the induced bipartitionings of the shape space {0, 1}5 are illustrated with strings that

    are classified as nonself having a gray background and strings that are classified as self

    having a white background. Bold strings are members of the self-set. Holes are the

    strings that are classified as self but do not occur in the self-set S (non-bold, non-shaded

    strings). This figure is adapted from [32].

    1.5.6 Performance metrics

    We used three metrics to evaluate each machine learning technique. Detection

    rate (DR), Accuracy rate (ACC) and False alarm rate (FAR) are defined as:

    DR =TP

    TP + FN(1.2)

    ACC =TP + TN

    TP + FN + FP + FN(1.3)

    FAR =FP

    FP + TN(1.4)

  • 23

    Figure 1.8: Negative selections with 3-chunk and 3-contiguous detectors.

    Where TP (True Positive) is the number of true positives (correctly classified

    as nonself), TN (True Negative) is the number of true negatives (correctly classified

    as self), TP (False Positive) is the number of false positives (classified as nonself,

    actually self), and FN (False Negative) is the number of false negatives (classified as

    self, actually nonself).

    We used 10-fold cross-validation technique and holdout one to evaluate our ap-

    proaches in experiments. Regarding the former, the dataset was randomly partitioned

    into 10 subsets. Of the 10 subsets, a single subset was retained for testing, and the

    others were used as training data. The process was then repeated 10 times, with each

    of the 10 subset used exactly once as the testing data. The 10 results from the folds

    were then averaged to produce a single performance. Regarding the later, dataset is

    split into two groups: training set used to train the classifier and test set (or hold out

    set) used to estimate the performance of classifier.

    1.5.7 Ring representation of data

    As we known, most of AIS-based applications use two types of data representa-

    tion: string and real-valued vector. For both popular types, representations are linear

    structure of symbols or numbers. They may omit information at the edges (the begin

  • 24

    and the end) of these structures. For example, to detect a new instance s as self or

    nonself in detection phases of PSA and NSA, it takes `− r + 1 steps in the worst case

    to match s again each detector at positions 1,...,`− r+ 1. Therefore, for given detector

    d, the symbols d[1] and s[1] are used in only one match, the symbols d[2] and s[2] are

    used in two matches, etc. In the other words, all positions in linear structures are not

    equal in term of matching times.

    Our earlier experimental implementation of NSA on binary ring-based strings

    produces a trigger to solve this problem. A set of 50,000 random self strings and a set of

    10,000 random nonself strings with the same length of 50 were used in the experiment.

    Table 1.1 shows the experimental results on values of r ranging from 10 to 16 under

    10 cross-validation technique. The results show that both detection rate and accuracy

    rate of ring-based NSA are higher than that of the linear-based one, while false alarm

    rates are relatively similar. Accordingly, we could use ring structures instead of linear

    ones for more exact classification.

    Table 1.1: Performance comparison of NSAs on linear strings and ring strings.

    rNAS on linear strings NSA on ring stringsACC DR FAR ACC DR FAR

    10 0.8343 0.0102 0.0008 0.8345 0.0123 0.001011 0.8380 0.0535 0.0051 0.8390 0.0655 0.006312 0.8488 0.1817 0.0177 0.8522 0.2193 0.021213 0.8677 0.4054 0.0399 0.8723 0.4704 0.046514 0.8875 0.6456 0.0640 0.8932 0.7184 0.071915 0.9023 0.8293 0.0829 0.9075 0.8888 0.088816 0.9112 0.9340 0.0934 0.9139 0.9662 0.0964

    With reference to string-based detectors set, a simple technique for this approach

    is to concatenate each string representing a detector with its first k symbols. Each new

    linear string is a ring representation of its original binary one. Fig. 1.9 shows a ring

    representation (b) of its original string (a) with k = 3.

    Given a set of strings S ⊂ Σ`, a set Sr ⊂ Σ`+r−1 includes ring representations

    of all strings in S by concatenating each string s ∈ S with its first r − 1 symbols.

    Note that we can easily apply the idea of ring strings for other data representa-

  • 25

    00101001110010010100111

    (a) (b)

    Figure 1.9: A simple ring-based representation (b) of a string (a).

    tions in AIS. One way to do this, for instance, is to create ring representations of other

    structures such as trees, and automata, from set Sr instead of S as usual.

    1.5.8 Frequency trees

    Given a set D of length-equaled strings, a tree T on D, noted TD, is a rooted

    directed tree with edge labels from Σ where for all c ∈ Σ, every node has at most one

    outgoing edge labeled with c. For a string s, we write s ∈ T if there is a path from the

    root of T to a leaf such that s is the concatenation of the labels on this path. Each

    leaf is associated with an integer number, that is frequency of string s ∈ D and s is

    the concatenation of the labels on the path ending by this leaf. This tree structure is

    a compact representation of r-chunk detectors in our algorithm in Chapter 5.

    Example 1.5. Let ` = 5 matching threshold r = 3. Suppose that we have the

    set S of four strings: s1 = 00000, s2 = 10110, s3 = 10111, s4 = 11111. Sr =

    {0000000, 1011010, 1011110, 1111111}. S1 = {(000,1), (101,1), (111,1)}, S2 =

    {(000,2), (011,2), (111,2)}, S3 = {(000,3), (110,3), (111,3)}, S4= {(000,4), (101,4),

    (111,4)}, S5 = {(000,5), (010,5), (110,5), (111,5)}.

    Assuming that S = {N,A}, where set of normal data N = {s1, s2} and set of

    abnormal data A = {s3, s4}.

    Nr = {0000000, 1011010} (ring representations of all strings in N). N1 =

    {(000,1), (101,1)}, N2 = {(000,2), (011,2)}, N3 = {(000,3), (110,3)}, N4= {(000,4),

    (101,4)}, N5 = {(000,5), (010,5)}.

    Ar = {1011110, 1111111}. A1 = {(101,1), (111,1)}, A2 = {(011,2), (111,2)},

    A3 = {(111,3)}, A4= {(111,4)}, A5 = {(110,5), (111,5)}.

    Ten trees presenting all 3-chunk detectors are in Fig. 1.10. Five trees TNi (TAi),

    i = 1, . . . , 5, are in the first (second) row, from left to right, respectively.

    Call to mind that there are some strings that belong to both positive trees and

  • 26

    1

    0

    0

    0

    0

    0

    0

    01

    1

    0

    0

    01

    0

    0

    0

    1

    0

    0

    0

    11 1

    1

    1

    1

    1

    1

    0

    1

    1

    0

    1

    1

    1

    01

    1

    1

    1

    2

    1

    1

    1

    2

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    0

    11

    1

    1

    1

    0

    0

    1

    Figure 1.10: Frequency trees for all 3-chunk detectors.

    negative trees. A substring of s2, s2[1 . . . 3] = 101, for example, s2[1 . . . 3] ∈ TN1 and

    s2[1 . . . 3] ∈ TA1 . This situation could lead to error rates in detection phase. Therefore,

    frequencies of matches will be used to improve detection performance of algorithms.

    Detailed technique of using the frequencies is presented in Chapter 5.

    1.6 Datasets

    There are two basis NIDSs which depend on the source of data to be analyzed:

    packet-based NIDSs and flow-based ones. M. H. Bhuyan et al. reviewed in [13] that

    most of NIDSs are based on packet-based data. However, we only concentrate on

    flow-based NIDSs because of three reasons: 1). It can detect some special attacks,

    like DDoS or RDP Brute Force, more efficient and faster than payload based one,

    since lesser information is needed to be analyzed [87, 76]; 2). Flow-based anomaly

    detection methods process only packet headers and, therefore, reduce data amount and

    processing time for high-speed detection on large networks. It can solve the scalability

    problem in condition of increasing network usage and load [76]. 3). Flow-based NIDSs

    decrease privacy issues in comparison with packet-based ones because of the absence

    of payload [76].

  • 27

    1.6.1 The DARPA-Lincoln datasets

    So as to evaluate the performance of different intrusion detection methodologies,

    MITs Lincoln laboratory with the sponsor from the DARPA ITO and Air Force Re-

    search laboratory, gathered the DARPA Lincoln datasets which consist of nine weeks

    in 1998: seven weeks for training and two remaining weeks for test data. The datasets

    are collected and stored in the form of Tcpdump, they are data source for extracting

    to some datasets such as KDD99 [3], and NetFlow [86].

    There were more than 300 instances of 38 different attacks launched against

    victim UNIX hosts in the attack date which can fall into one of the four categories:

    Denial of Service (DoS), Probe, Users to Root (U2R), and Remote to Local (R2L).

    For each week, inside and outside network traffic data, audit data recorded by the

    Basic Security Module on Solaris hosts, and file system dumped from UNIX hosts were

    collected. In 1999, another series of datasets which contained three weeks of training

    and two weeks of test date was collected. More than 200 instances of 58 attack types

    were launched against victim UNIX and WindowsNT hosts and a Cisco router. In 2000,

    there were three additional scenario-specific datasets generated to address distributed

    DoS and Windows NT attacks. Detailed descriptions of these datasets can be found

    at [1].

    1.6.2 UT dataset

    A public labeled flow-based dataset is provided in [75]. This dataset was cap-

    tured by monitoring a honeypot hosted in the University of Twente network, so we

    call it TU dataset. This dataset has three categories: malicious traffic, unknown traffic

    and side-effect traffic. It has 14,170,132 flows which are mostly of malicious nature.

    Only a small number of flows, 5,968 flows or 0.042%, are not labeled (unknown) and

    considered as normal data in our experiments in following chapters. Each flow in the

    dataset has 13 fields: id: the ID of the flow, src ip: anonymized source IP address

    (encoded as 32-bit number), dst ip: anonymized destination IP address (encoded as

    32-bit number), packets: number of packets in flow, octets: number of bytes in flow,

    start time: UNIX start time (number of seconds), start msec: start time (milliseconds

  • 28

    part), end time: UNIX end time (number of seconds), end msec: end time (millisec-

    onds part), src port: source port number, dst port: destination port number, tcp flags:

    TCP flags of the flow, and prot: IP protocol number.

    Examples of a normal flow and an attack flow are (393, 3145344965, 2463760020,

    3, 168, 1222173606, 974, 1222173610, 239, 0, 769, 0, 1) and (1, 2463760020, 3752951033,

    1, 60, 1222173605, 985, 1222173605, 985, 4534, 22, 2, 6), respectively. The ID of a flow

    is used to distinguish attack flows from the others.

    1.6.3 Netflow dataset

    Packet-based DARPA dataset [1] is used to generate flow-based DARPA dataset,

    called NetFlow, in [86]. This dataset focuses only on flows to a specific port and an IP

    address which receives the highest number of attacks. It includes all 129,571 traffics

    (including attacks) to and from victims. Each flow in the dataset has 10 fields: Source

    IP, Destination IP, Source Port, Destination Port, Packets, Octets, Start Time, End

    Time, Flags, and Proto. All 24,538 attacked flows are labeled with text labels, such as

    neptune, portsweep, ftpwrite, etc.

    Examples of a normal flow and an attack flow are (172.16.112.20, 172.16.112.50,

    53, 32961, 1, 161, 1999-03-05T08:17:10, 1999-03-05T08 :17:10, 17, 00) and (209.167.99.71,

    172.16.112.50, 10353, 4288, 1, 46, 1999-03-12T17:23:19, 1999-03-12T17:23:19, 6, 02:::port-

    sweep), respectively.

    1.6.4 Discussions

    This thesis does not mention any techniques or algorithms for NIDS on a variety

    of datasets. Although there exist many other well-known datasets for IDS, such as

    KDD99 datasets [3], NSL-KDD dataset [4], and FDFA datasets [2], they are all out of

    scope of this netflow-related thesis.

    There have been a number of different viewpoints and researches which argue

    the DARPA datasets. McHugh [62] proposed one of the most important assessment

    of DARPA datasets which was deeply critical. In the other words, this assessment

    gives some examples such as normal and attack data have unrealistic data rates, the

  • 29

    lack of training datasets for anomaly detection for its intended purpose or no efforts

    for validation; to show that some evaluation methodologies are questionable and may

    have biased the results. Furthermore, the ideas from Malhony and Chan [60] which

    can be seen as a confirmation of findings owned by McHughs’s experiments, revealed

    that many attributes had small and fixed ranges in simulation, but large and growing

    ranges in real traffic.

    Although there exist some above argumentative theories, DARPA-Lincoln dataset

    plays a vital role in publicity and is the most sophisticated benchmark for researchers

    in the process of the evaluation on intrusion detection algorithms or machine learning

    algorithms [92]. One of the important factor of this data is to be a proxy for developing,

    testing and evaluating detecting algorithms, rather than to be a solid dataset for a real

    time system. If a detection algorithm has a high performance based on the DARPA

    data, it is more likely to have a good performance in real network environment [88].

    1.7 Summary

    This chapter presents background topics used in this thesis including HIS, IDS,

    and AIS. Some terms and definitions that will be used throughout the thesis are also

    stated clearly. Two important data structure, ring-based and frequency-based, can be

    used for improving classification rate of selection algorithms. Some popular perfor-

    mance metrics and three well-known datasets for NIDS are presented and discussed.

    These datasets will be used for experiments in the other chapters. Besides, under new

    semantic of detection in r-chunk based PSA (Section 1.5.4), we proved an important

    theorem on the coincide of detection coverage of PSA and that of NSA. This theorem

    will lead to one contribution of the thesis that will be discussed in next chapter.

  • 30

    Chapter 2

    COMBINATION OF NEGATIVESELECTION AND POSITIVESELECTION

    It can be seen from Theorem 1.1 in Chapter 1 that the r-chunk based PSA and

    NSA are dual in term of detection. This motivates our approach to combining two

    selection algorithms that does not affect the detection performance of each.

    2.1 Introduction

    NSA and PSA are computational models that have been inspired by negative

    and positive selection of the biological immune system. Among the two, NSA has been

    studied more extensively resulting in more variants and applications [51]. However,

    all of existing string-based NSAs have worst-case exponential memory complexity for

    storing the detectors set, hence, limit their practical applicabilities [31]. In this chap-

    ter, we introduce a novel selection algorithm that employs binary representation and

    r-chunk matching rule for detectors. The new algorithm combines negative and pos-

    itive selection to reduce both detector storage complexity and detection time, while

    maintaining the same detection coverage as that of NSAs (PSAs).

    In following section, we review some related works. Section 2.3 presents a new

    r-chunk type NSA is presented in detail that is the combination of positive and nega-

    tive selection, called PNSA. In our proposed approach, binary trees are used as data

    structure for storing the detectors set to reduce memory complexity, and therefore the

    time complexity of the detection phase. Section 2.4 details preliminary experimental

  • 31

    results. Summary of the chapter is in the final section.

    2.2 Related works

    Both PSA and NSA achieve quite similar performance for detecting novelty

    in data patterns [24]. Dasgupta et al. [21] conducted one of the earliest experiments

    on combining positive with negative selection. The combined process is embedded in

    a genetic algorithm using a fitness function that assigns a weight to each bit based

    on the domain knowledge. Their method is neither aimed to reduce detector storage

    complexity nor detection time. Esponda et al. [34] proposed a generic NSA for anomaly

    detection problems. Their model of normal behavior is constructed from an observed

    sample of normally occurring patterns. Such a model could represent either the set

    of allowed patterns (positive detection) or the set of anomalous patterns (negative

    detection). However, their NSA is not concerned with the combination of positive and

    negative selection in detection phase as in proposed one. Stibor et al. [80] argued that

    positive selection might have better detection performance than negative selection.

    However, the choice between positive selections and negative ones obviously depends

    on representation of the AIS-based applications.

    To the best of our knowledge, there has not been any published attempt in

    combining r -chunk type PSA and NSA for the purpose of reducing detector storage

    complexity and detection time complexity.

    2.3 New Positive-Negative Selection Algorithm

    Our algorithm first constructs ` − r + 1 binary trees (called positive trees)

    corresponding to `−r+1 positive r-chunk detectors set Dpi, i = 1, . . . , `−r+1. Then,

    all complete subtrees of these trees are removed to achieve a compact representation of

    the positive r-chunk detectors set while maintaining the detection coverage. Finally, for

    every ith positive trees, we decide whether or not it should be converted to the negative

    tree, which covers the negative r-chunk detectors set Dni. The decision depends on

    which tree is more compact. When this process is done, we have ` − r + 1 compact

  • 32

    binary trees that some of them represent positive r-chunk detectors and the others

    represent negative ones.

    The r-chunk matching rule on binary trees is implemented as follows: a given

    sample s matches the positive (negative) tree ith if s[i . . . i + k] is a path from the

    root to a leaf, i = 1, . . . , ` − r + 1, k < r. The detection phase can be conducted by

    traveling the compact binary trees iteratively one by one: a sample s is claimed as

    nonself if it matches a negative tree or it does not match all positive trees, otherwise

    it is considered as self.

    Example 2.1. For the set of six self strings from Example 1.3, S = {00000, 00010,

    10110, 10111, 11000, 11010}, where ` = 5 and r = 3, the six binary trees (the left

    and right child are labeled 0 and 1 respectively) represent the detectors set of six 3-

    chunk detectors (Dpi and Dni, i = 1, 2, 3) as depicted in Fig. 2.1. In the figure, dashed

    arrows in some positive trees mark the complete subtrees that will be removed to achieve

    compact tree representation. The positive trees for Dp1, Dp2 and Dp3 are in (a), (c)

    and (e), respectively; The negative trees for Dn1, Dn2 and Dn3 are in (b), (d) and (f),

    respectively.

    The number of nodes of the trees in Figures 2.1.a - 2.1.f (after deleting complete

    subtrees) are 9, 10, 7, 6, 8 and 8, respectively. Therefore, the chosen final trees are

    those in Figures 2.1.a (9 nodes), 2.1.d (6 nodes) and 2.1.e or 2.1.f (8 nodes). In real

    implementation, it is unnecessary to generate both positive trees and negative trees.

    Since each Dpi could dually be represented either by a positive or a negative tree,

    we only need to generate (compact) positive trees. If a compact positive tree T has

    more number of leaves than the number of internal nodes that have single child, the

    corresponding negative tree T ′ has less number of nodes than that of T . Therefore, T ′

    should be used instead of T to represent Dni more compactly. Figure 2.3 presents a

    diagram of the algorithm. The following example illustrates this observation.

    Example 2.2. Consider again the set of six self strings S from Example 1.3, S =

    {00000, 00010, 10110, 10111, 11000, 11010}. The compact positive tree for the positive

    3-chunk detectors set Dp2 = {(000,2); (001,2); (011,2); (100,2); (101,2)} is showed

  • 33

    1

    (a)

    1

    0

    0

    0

    0

    0

    1

    1

    1

    0

    0

    0

    0

    0

    1

    1

    1

    0

    1

    0

    0

    1

    (b)

    1

    10

    1

    11 0

    1

    0

    (c) (d)

    1

    1

    000

    0

    0

    1

    10

    1 10 1

    (e) (f)

    0

    1

    Figure 2.1: Binary tree representation of the detectors set generated from S.

    in Fig. 2.2.a. This tree has three leaves and two nodes that have only one child (in

    dotted circles) so it should be converted to the corresponding negative tree as illustrated

    in Fig. 2.2.b.

    1

    00

    0

    1

    10

    1

    1 0

    1

    (a) (b)

    Figure 2.2: Conversion of a positive tree to a negative one.

  • 34

    Algorithm 2.1 Detector Generation Algorithm.

    1: procedure DetectorGeneration(S, r, T )Input: A set of self strings S ⊆ Σ`, a matching threshold r ∈ {1, . . . , `}.Output: A set T of `− r + 1 prefix trees presenting all r-chunk detectors.

    2: T = ∅3: for i = 1, ..., `− r + 1 do4: create an empty prefix positive tree Ti5: for all s ∈ S do6: insert every s[i . . . i + r − 1] into Ti7: end for8: for all internal node n ∈ Ti do9: if n is root of complete binary subtree then

    10: delete this subtree11: end if12: end for13: if (number of leaves of Ti) > (number of nodes of Ti that have only one

    child) then14: for all internal node ∈ Ti do15: if it has only one child then16: if the child is a leaf then17: delete the child18: end if19: create the other child for it20: end if21: end for22: mark Ti as a negative tree23: end if24: T = T ∪ {Ti}25: end for26: end procedure

    The proposed technique is summarized by Algorithm 2.1 and Algorithm 2.2.

    The first algorithm, Algorithm 2.1, generates ` − r + 1 trees where each of these is

    labeled with self or nonself. The process of generating compact binary (positive and

    negative) trees representing the complete r-chunk detectors set is conducted in the outer

    “for” loop. First, all binary positive tree Ti are constructed by the first inner loop.

    Then, the compactification of all Ti is conducted by the second one, i = 1, . . . , `−r+1.

    The conversion of a positive tree to negative one takes place in “if” statement after the

    second inner “for” loop. The procedure for recognizing a given cell string s as self or

    nonself, is carried out by the last “while . . . do” and “if . . . then . . . else” statements.

    Figure 2.4 presents a diagram of the algorithm.

  • 35

    Figure 2.3: Diagram of the Detector Generation Algorithm.

    The detection phase could be done by the second algorithm, Algorithm 2.2, and

    be illustrated by the following example.

    Example 2.3. Given S, r as in Example 1.3, S = {00000, 00010, 10110, 10111,

    11000, 11010}, and s = 10100 as the inputs of the algorithm, three binary trees are

  • 36

    constructed as the detectors set in Figures 2.1.a, 2.1.d. and 2.1.e. One of the output of

    the algorithm is “s is nonself” because all the paths of tree T2 do not contain substring

    of s: s[2. . . 4] = 010.

    Algorithm 2.2 Positive-Negative Selection Algorithm.

    1: procedure PNSA(T , r, s)Input: A set T of `−r+1 prefix trees presenting all r-chunk detectors, a matchingthreshold r ∈ {1, . . . , `}, an unlabeled string s ∈ Σ`.Output: A label of s (as self or nonself).

    2: flag = true . A temporary boolean variable3: i = 14: while (i ≤ `− r + 1) and (flag = true) do5: if (Ti is a positive tree) and (s /∈ Ti) then6: flag = false7: end if8: if (Ti is a negative tree) and (s ∈ Ti) then9: flag = false

    10: end if11: i=i+112: end while13: if flag = false then14: output s is nonself15: else16: output s is self17: end if18: end procedure

    From the description of DetectorGeneration, it is trivial to show that it takes

    |S|(`−r+1).r steps to generate all necessary trees (detector generation time complexity)

    and (`− r + 1).r steps to verify a cell string as self or nonself in the worst case (worse-

    case detection time complexity). These time complexities are similar to the popular

    NSAs (PSAs) such as the one proposed in [31]. However, by using compact positive

    and negative binary trees for storing the detectors set, PNSA could reduce the storage

    complexity of the detectors set in comparison with the other r-chunk type single NSAs

    or PSAs that store detectors as binary strings. This storage complexity reduction could

    potentially lead to better detection time complexity in real and average cases. To see

    this, first, let the following theorem be stated:

    Theorem 2.1 (PNSA detector storage complexity). Given a self set S and an integer

  • 37

    Figure 2.4: Diagram of the Positive-Negative Selection Algorithm.

    `, the procedure DetectorGeneration produces the detector (binary) tree set that have

    at most total (` − r + 1).2r−2 less number of nodes in comparison to the detector tree

    set created by a PSA or NSA only, where r ∈ {2, . . . , `− r + 1}.

  • 38

    Proof. We only prove the theorem for the PSA case, the NSA case can be proven in a

    similar way. Because there are (`− r + 1) positive trees can be build from the self set

    S, so the theorem is proved if it can reduce at most 2r−2 nodes from a positive tree.

    The theorem is proved by induction on r (also the height of binary trees).

    It is noted that when converting a positive tree to a negative tree, the reduction

    in number of nodes is exactly as the result of the subtraction of number of leaf nodes

    from the number of internal nodes that have only one child.

    When r = 2, there are 16 trees of possible positive trees are of height 2. By

    examining all 16 cases, we have found that the maximum reduction in number of

    nodes is 1. One example of these cases is the positive tree that has 2 leaf nodes after

    compactification as in Fig. 2.5.a. Since it has two leaf nodes and one one-child internal

    node, after being converted to the corresponding negative tree, the number of nodes is

    reduced by 2 - 1 = 1.

    1

    (a)

    0

    0

    11

    (b)

    0

    1

    Figure 2.5: One node is reduced in a tree: a compact positive tree has 4 nodes (a) andits conversion (a negative tree) has 3 node (b).

    It is supposed that the theorem’s conclusion is right for all r < k. We shall

    prove that it is also right for k. This is done by an observation that in all positive

    trees that are of height k, there is at least one tree with both left subtree and right

    subtree (of height k − 1) that each can be reduced by at least 2(k−1)−2 nodes after

    conversion.

    A real experiment on network intrusion dataset showed at the end of following

    section shows that the storage reduction is only about 0.35% of this maximum.

  • 39

    2.4 Experiments

    Next, we investigate the possible impact of the reduction in detector storage

    complexity in PNSA on the detection real (average) time complexity in comparison

    with single NSA (PSA). All experiments are performed on a laptop computer with

    Windows 8 Pro 64-bit, Intel Core i5-3210M, CPU 2.50GHz (4 CPUs), 4GB RAM.

    Table 2.1 shows the results on detector memory storage and detection time of

    PNSA compared to a popular NSA proposed in [31] on some combinations of S, ` and

    r. The training dataset of selves S contains randomly generated binary strings. The

    memory reduction is measured as the ratio of reduction in number of nodes of the

    binary tree detectors generated by PNSA when compared to the binary tree detectors

    generated by the NSA in [31]. The comparative results show that when ` and r are

    sufficiently large, the detector storage complexity and the detection time of PNSA are

    significantly smaller than NSA in [31] (36% and 50% less).

    Table 2.1: Comparison of memory and detection time reductions.

    S ` r Memory (%) Time (%)1,000 50 12 0 02,000 30 15 2.5 52,000 40 17 25.9 42.72,000 50 20 36.3 50

    We have conducted another experiment by choosing ` = 40, |S| = 20,000 (S is

    the set of randomly generated binary strings of length `) and varying r (from 15 to

    40). Then, `− r + 1 trees were created using single NSA and other `− r + 1 compact

    trees were created using PNSA. Next, both detectors sets were used to detect every

    s ∈ S. Fig. 2.6 depicts the detection time of PNSA and NSA in the experiment. The

    results show that PNSA detection time is significantly smaller than that of NSA. For

    instance, when r is from 20 to 34, detection in PNSA is about 4.46 times faster than

    that of NSA.

    Next experiment is conducted on Netflow dataset, a conversion of Tcpdump

    from well-known DARPA dataset to Netflow [86]. We use all 105,033 normal flows as

    self samples. This self set is first converted to binary string of length 104, then we run

  • 40

    15 20 25 30 35 40

    500

    1000

    1500

    2000

    2500

    3000

    3500

    t (mins)

    r

    NSAPNSA

    Figure 2.6: Detection time of NSA and PNSA.

    our algorithm on r changing from 5 to 45. Table 2.2 shows some of the experiment

    steps. The percentage of node reduction is in the final column. Fig. 2.7 depicts the

    reduction of nodes in trees created by PNSA comparison to that of NSA for all r = 3,..,

    45. It shows that the reduction is more than one third when the matching threshold

    greater than 19.

    Table 2.2: Comparison of nodes generation on Netflow dataset.

    r NSA PNSA Reduction(%)5 727 706 2.8910 33,461 31,609 5.5315 1,342,517 1,154,427 14.0120 9,428,132 6,157,766 34.6825 18,997,102 11,298,739 40.5230 29,668,240 17,080,784 42.4235 42,596,987 2