online intrusion alert aggregation

13
Online Intrusion Alert Aggregation with Generative Data Stream Modeling Alexander Hofmann, Member, IEEE, and Bernhard Sick, Member, IEEE Abstract—Alert aggregation is an important subtask of intrusion detection. The goal is to identify and to cluster different alerts—produced by low-level intrusion detection systems, firewalls, etc.—belonging to a specific attack instance which has been initiated by an attacker at a certain point in time. Thus, meta-alerts can be generated for the clusters that contain all the relevant information whereas the amount of data (i.e., alerts) can be reduced substantially. Meta-alerts may then be the basis for reporting to security experts or for communication within a distributed intrusion detection system. We propose a novel technique for online alert aggregation which is based on a dynamic, probabilistic model of the current attack situation. Basically, it can be regarded as a data stream version of a maximum likelihood approach for the estimation of the model parameters. With three benchmark data sets, we demonstrate that it is possible to achieve reduction rates of up to 99.96 percent while the number of missing meta-alerts is extremely low. In addition, meta-alerts are generated with a delay of typically only a few seconds after observing the first alert belonging to a new attack instance. Index Terms—Intrusion detection, alert aggregation, generative modeling, data stream algorithm. Ç 1 INTRODUCTION I NTRUSION detection systems (IDS) are besides other protective measures such as virtual private networks, authentication mechanisms, or encryption techniques very important to guarantee information security. They help to defend against the various threats to which networks and hosts are exposed to by detecting the actions of attackers or attack tools in a network or host-based manner with misuse or anomaly detection techniques [1]. At present, most IDS are quite reliable in detecting suspicious actions by evaluating TCP/IP connections or log files, for instance. Once an IDS finds a suspicious action, it immediately creates an alert which contains information about the source, target, and estimated type of the attack (e.g., SQL injection, buffer overflow, or denial of service). As the intrusive actions caused by a single attack instance— which is the occurrence of an attack of a particular type that has been launched by a specific attacker at a certain point in time—are often spread over many network connections or log file entries, a single attack instance often results in hundreds or even thousands of alerts. IDS usually focus on detecting attack types, but not on distinguishing between different attack instances. In addition, even low rates of false alerts could easily result in a high total number of false alerts if thousands of network packets or log file entries are inspected. As a consequence, the IDS creates many alerts at a low level of abstraction. It is extremely difficult for a human security expert to inspect this flood of alerts, and decisions that follow from single alerts might be wrong with a relatively high probability. In our opinion, a “perfect” IDS should be situation-aware [2] in the sense that at any point in time it should “know” what is going on in its environment regarding attack instances (of various types) and attackers. In this paper, we make an important step toward this goal by introducing and evaluating a new technique for alert aggregation. Alerts may originate from low-level IDS such as those mentioned above, from firewalls (FW), etc. Alerts that belong to one attack instance must be clustered together and meta-alerts must be generated for these clusters. The main goal is to reduce the amount of alerts substantially without losing any important information which is necessary to identify on- going attack instances. We want to have no missing meta- alerts, but in turn we accept false or redundant meta-alerts to a certain degree. This problem is not new, but current solutions are typically based on a quite simple sorting of alerts, e.g., according to their source, destination, and attack type. Under real conditions such as the presence of classification errors of the low-level IDS (e.g., false alerts), uncertainty with respect to the source of the attack due to spoofed IP addresses, or wrongly adjusted time windows, for instance, such an approach fails quite often. Our approach has the following distinct properties: . It is a generative modeling approach [3] using prob- abilistic methods. Assuming that attack instances can be regarded as random processes “producing” alerts, we aim at modeling these processes using approximative maximum likelihood parameter esti- mation techniques. Thus, the beginning as well as the completion of attack instances can be detected. . It is a data stream approach, i.e., each observed alert is processed only a few times [4]. Thus, it can be applied online and under harsh timing constraints. 282 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011 . The authors are with the Computationally Intelligent Systems Group (CIS), Department of Informatics and Mathematics, University of Passau, Innstraße 33, 94032 Passau, Germany. E-mail: {hofmann, sick}@fim.uni-passau.de. Manuscript received 26 Sept. 2008; revised 1 May 2009; accepted 5 July 2009; published online 11 Aug. 2009. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TDSC-2008-09-0152. Digital Object Identifier no. 10.1109/TDSC.2009.36. 1545-5971/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

Upload: vishvanath-prathap

Post on 12-Oct-2014

43 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Intrusion Alert Aggregation

Online Intrusion Alert Aggregationwith Generative Data Stream Modeling

Alexander Hofmann, Member, IEEE, and Bernhard Sick, Member, IEEE

Abstract—Alert aggregation is an important subtask of intrusion detection. The goal is to identify and to cluster different

alerts—produced by low-level intrusion detection systems, firewalls, etc.—belonging to a specific attack instance which has been

initiated by an attacker at a certain point in time. Thus, meta-alerts can be generated for the clusters that contain all the relevant

information whereas the amount of data (i.e., alerts) can be reduced substantially. Meta-alerts may then be the basis for reporting to

security experts or for communication within a distributed intrusion detection system. We propose a novel technique for online alert

aggregation which is based on a dynamic, probabilistic model of the current attack situation. Basically, it can be regarded as a data

stream version of a maximum likelihood approach for the estimation of the model parameters. With three benchmark data sets, we

demonstrate that it is possible to achieve reduction rates of up to 99.96 percent while the number of missing meta-alerts is extremely

low. In addition, meta-alerts are generated with a delay of typically only a few seconds after observing the first alert belonging to a new

attack instance.

Index Terms—Intrusion detection, alert aggregation, generative modeling, data stream algorithm.

Ç

1 INTRODUCTION

INTRUSION detection systems (IDS) are besides otherprotective measures such as virtual private networks,

authentication mechanisms, or encryption techniques veryimportant to guarantee information security. They help todefend against the various threats to which networks andhosts are exposed to by detecting the actions of attackers orattack tools in a network or host-based manner with misuseor anomaly detection techniques [1].

At present, most IDS are quite reliable in detectingsuspicious actions by evaluating TCP/IP connections or logfiles, for instance. Once an IDS finds a suspicious action, itimmediately creates an alert which contains informationabout the source, target, and estimated type of the attack(e.g., SQL injection, buffer overflow, or denial of service). Asthe intrusive actions caused by a single attack instance—which is the occurrence of an attack of a particular type thathas been launched by a specific attacker at a certain point intime—are often spread over many network connections orlog file entries, a single attack instance often results inhundreds or even thousands of alerts. IDS usually focus ondetecting attack types, but not on distinguishing betweendifferent attack instances. In addition, even low rates of falsealerts could easily result in a high total number of falsealerts if thousands of network packets or log file entries areinspected. As a consequence, the IDS creates many alerts ata low level of abstraction. It is extremely difficult for ahuman security expert to inspect this flood of alerts, and

decisions that follow from single alerts might be wrongwith a relatively high probability.

In our opinion, a “perfect” IDS should be situation-aware[2] in the sense that at any point in time it should “know”what is going on in its environment regarding attackinstances (of various types) and attackers. In this paper, wemake an important step toward this goal by introducingand evaluating a new technique for alert aggregation. Alertsmay originate from low-level IDS such as those mentionedabove, from firewalls (FW), etc. Alerts that belong to oneattack instance must be clustered together and meta-alertsmust be generated for these clusters. The main goal is toreduce the amount of alerts substantially without losing anyimportant information which is necessary to identify on-going attack instances. We want to have no missing meta-alerts, but in turn we accept false or redundant meta-alertsto a certain degree.

This problem is not new, but current solutions aretypically based on a quite simple sorting of alerts, e.g.,according to their source, destination, and attack type.Under real conditions such as the presence of classificationerrors of the low-level IDS (e.g., false alerts), uncertaintywith respect to the source of the attack due to spoofedIP addresses, or wrongly adjusted time windows, forinstance, such an approach fails quite often.

Our approach has the following distinct properties:

. It is a generative modeling approach [3] using prob-abilistic methods. Assuming that attack instancescan be regarded as random processes “producing”alerts, we aim at modeling these processes usingapproximative maximum likelihood parameter esti-mation techniques. Thus, the beginning as well asthe completion of attack instances can be detected.

. It is a data stream approach, i.e., each observed alert isprocessed only a few times [4]. Thus, it can beapplied online and under harsh timing constraints.

282 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011

. The authors are with the Computationally Intelligent Systems Group(CIS), Department of Informatics and Mathematics, University of Passau,Innstraße 33, 94032 Passau, Germany.E-mail: {hofmann, sick}@fim.uni-passau.de.

Manuscript received 26 Sept. 2008; revised 1 May 2009; accepted 5 July 2009;published online 11 Aug. 2009.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TDSC-2008-09-0152.Digital Object Identifier no. 10.1109/TDSC.2009.36.

1545-5971/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

Page 2: Online Intrusion Alert Aggregation

The remainder of this paper is organized as follows: InSection 2 some related work is presented. Section 3 describesthe proposed alert aggregation approach, and Section 4provides experimental results for the alert aggregationusing various data sets. Finally, Section 5 summarizes themajor findings.

2 RELATED WORK

Most existing IDS are optimized to detect attacks with highaccuracy. However, they still have various disadvantagesthat have been outlined in a number of publications and a lotof work has been done to analyze IDS in order to directfuture research (cf. [5], for instance). Besides others, onedrawback is the large amount of alerts produced. Recentresearch focuses on the correlation of alerts from (possiblymultiple) IDS. If not stated otherwise, all approachesoutlined in the following present either online algorithmsor—as we see it—can easily be extended to an online version.

Probably, the most comprehensive approach to alertcorrelation is introduced in [6]. One step in the presentedcorrelation approach is attack thread reconstruction, which canbe seen as a kind of attack instance recognition. Noclustering algorithm is used, but a strict sorting of alertswithin a temporal window of fixed length according to thesource, destination, and attack classification (attack type). In[7], a similar approach is used to eliminate duplicates, i.e.,alerts that share the same quadruple of source anddestination address as well as source and destination port.In addition, alerts are aggregated (online) into predefinedclusters (so-called situations) in order to provide a morecondensed view of the current attack situation. Thedefinition of such situations is also used in [8] to clusteralerts. In [9], alert clustering is used to group alerts thatbelong to the same attack occurrence. Even though calledclustering, there is no clustering algorithm in a classic sense.The alerts from one (or possibly several) IDS are stored in arelational database and a similarity relation—which is basedon expert rules—is used to group similar alerts together.Two alerts are defined to be similar, for instance, if bothoccur within a fixed time window and their source andtarget match exactly. As already mentioned, these ap-proaches are likely to fail under real-life conditions withimperfect classifiers (i.e., low-level IDS) with false alerts orwrongly adjusted time windows.

Another approach to alert correlation is presented in [10].A weighted, attribute-wise similarity operator is used todecide whether to fuse two alerts or not. However, asalready stated in [11] and [12], this approach suffers fromthe high number of parameters that need to be set. Thesimilarity operator presented in [13] has the same dis-advantage—there are lots of parameters that must be set bythe user and there is no or only little guidance in order tofind good values. In [14], another clustering algorithm thatis based on attribute-wise similarity measures with user-defined parameters is presented. However, a closer look atthe parameter setting reveals that the similarity measure, infact, degenerates to a strict sorting according to the sourceand destination IP addresses and ports of the alerts. The

drawbacks that arise thereof are the same as thosementioned above.

In [15], three different approaches are presented to fusealerts. The first, quite simple one groups alerts accordingto their source IP address only. The other two approachesare based on different supervised learning techniques.Besides a basic least-squares error approach, multilayerperceptrons, radial basis function networks, and decisiontrees are used to decide whether to fuse a new alert withan already existing meta-alert (called scenario) or not. Dueto the supervised nature, labeled training data need to begenerated which could be quite difficult in case of variousattack instances.

The same or quite similar techniques as described so farare also applied in many other approaches to alert correla-tion, especially in the field of intrusion scenario detection.Prominent research in scenario detection is described in [16],[17], [18], for example. More details can be found in [19].

In [20], an offline clustering solution based on the CUREalgorithm is presented. The solution is restricted to numer-ical attributes. In addition, the number of clusters must beset manually. This is problematic, as in fact it assumes thatthe security expert has knowledge about the actual numberof ongoing attack instances. The alert clustering solutiondescribed in [11] is more related to ours. A link-basedclustering approach is used to repeatedly fuse alerts intomore generalized ones. The intention is to discover thereasons for the existence of the majority of alerts, the so-called root causes, and to eliminate them subsequently. Anattack instance in our sense can also be seen as a kind of rootcause, but in [11] root causes are regarded as “generallypersistent” which does not hold for attack instances thatoccur only within a limited time window. Furthermore, onlyroot causes that are responsible for a majority of alerts are ofinterest and the attribute-oriented induction algorithm isforced “to find large clusters” as the alert load can thus bereduced at most. Attack instances that result in a smallnumber of alerts (such as PHF or FFB) are likely to beignored completely. The main difference to our approach isthat the algorithm can only be used in an offline setting andis intended to analyze historical alert logs. In contrast, weuse an online approach to model the current attack situation.The alert clustering approach described in [12] is based on[11] but aims at reducing the false positive rate. The createdcluster structure is used as a filter to reduce the amount ofcreated alerts. Those alerts that are similar to already knownfalse positives are kept back, whereas alerts that areconsidered to be legitimate (i.e., dissimilar to all knownfalse positives) are reported and not further aggregated. Thesame idea—but based on a different offline clusteringalgorithm—is presented in [21].

A completely different clustering approach is presentedin [22]. There, the reconstruction error of an autoassociatorneural network (AA-NN) is used to distinguish differenttypes of alerts. Alerts that yield the same (or a similar)reconstruction error are put into the same cluster. Theapproach can be applied online, but an offline training phaseand training data are needed to train the AA-NN and also tomanually adjust intervals for the reconstruction error thatdetermine which alerts are clustered together. In addition, it

HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 283

Page 3: Online Intrusion Alert Aggregation

turned out that due to the dimensionality reduction by theAA-NN, alerts of different types can have the samereconstruction error which leads to erroneous clustering.

In our prior work, we applied the well-known c-meansclustering algorithm in order to identify attack instances[23]. However, this algorithm also works in a purely offlinemanner.

3 A NOVEL ONLINE ALERT AGGREGATION

TECHNIQUE

In this section, we describe our new alert aggregationapproach which is—at each point in time—based on aprobabilistic model of the current situation. To outline thepreconditions and objectives of alert aggregation, we startwith a short sketch of our intrusion framework. Then, webriefly describe the generation of alerts and the alert format.We continue with a new clustering algorithm for offlinealert aggregation which is basically a parameter estimationtechnique for the probabilistic model. After that, we extendthis offline method to an algorithm for data stream clusteringwhich can be applied to online alert aggregation. Finally, wemake some remarks on the generation of meta-alerts.

3.1 Collaborating Intrusion Detection Agents

In our work, we focus on a system of structurally verysimilar so-called intrusion detection (ID) agents. Throughself-organized collaboration, these ID agents form adistributed intrusion detection system (DIDS).

Fig. 1 outlines the layered architecture of an ID agent:The sensor layer provides the interface to the network andthe host on which the agent resides. Sensors acquire rawdata from both the network and the host, filter incomingdata, and extract interesting and potentially valuable (e.g.,statistical) information which is needed to construct anappropriate event. At the detection layer, different detectors,e.g., classifiers trained with machine learning techniquessuch as support vector machines (SVM) or conventionalrule-based systems such as Snort [24], assess these eventsand search for known attack signatures (misuse detection)and suspicious behavior (anomaly detection). In case of

attack suspicion, they create alerts which are then for-warded to the alert processing layer. Alerts may also beproduced by FW or the like. At the alert processing layer,the alert aggregation module has to combine alerts that areassumed to belong to a specific attack instance. Thus, so-called meta-alerts are generated. Meta-alerts are used orenhanced in various ways, e.g., scenario detection ordecentralized alert correlation. An important task of thereaction layer is reporting.

The overall architecture of the distributed intrusiondetection system and a framework for large-scale simula-tions are described in [25], [26] in more detail.

In our layered ID agent architecture, each layer assesses,filters, and/or aggregates information produced by a lowerlayer. Thus, relevant information gets more and morecondensed and certain, and, therefore, also more valuable.We aim at realizing each layer in a way such that the recallof the applied techniques is very high, possibly at the cost ofa slightly poorer precision [27]. In other words, with thealert aggregation module—on which we focus in thispaper—we want to have a minimal number of missingmeta-alerts (false negatives) and we accept some false meta-alerts (false positives) and redundant meta-alerts in turn.

3.2 Alert Generation and Format

In this section, we make some comments on the informationcontained in alerts, the objects that must be aggregated, andon their format. As the concrete content and format dependon a specific task and on certain realizations of the sensorsand detectors, some more details will be given in Section 4together with the experimental conditions.

At the sensor layer, sensors determine the values ofattributes that are used as input for the detectors as well asfor the alert clustering module. Attributes in an event thatare independent of a particular attack instance can be usedfor classification at the detection layer. Attributes that are(or might be) dependent on the attack instance can be used inan alert aggregation process to distinguish different attackinstances. A perfect partition into dependent and indepen-dent attributes, however, cannot be made. Some are clearlydependent (such as the source IP address which canidentify the attacker), some are clearly independent suchas the destination port which usually is 80 in case of web-based attacks), and lots are both (such as the destinationport which can be a hint to the attacker’s actual targetservice as well as an attack tool specifically designed totarget a particular service only). In addition to the attributesproduced by the sensors, alert aggregation is based onadditional attributes generated by the detectors. Examplesare the estimated type of the attack instance that led to thegeneration of the alert (e.g., SQL injection, buffer overflow,or denial of service), and the degree of uncertaintyassociated with that estimate.

An alert a consists of a number of D attributes. Weassume that Dm (Dm � D) of these attributes are categoricaland the other D�Dm are continuous (real valued). Withoutloss of generality, we arrange these attributes such that

a ¼ ðaa1; . . . ; aaDm|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}categorical

; aDmþ1; . . . ; aD|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}continuous

Þ: ð1Þ

An extension to other attribute types (e.g., integers) wouldbe straightforward.

284 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011

Fig. 1. Architecture of an intrusion detection agent.

Page 4: Online Intrusion Alert Aggregation

Examples for continuous or approximately continuousattributes are durations or the amount of transmitted data.Examples for categorical attributes are IP addresses, portnumbers, or TCP status flags. For the categorical attributes,we apply a 1-of-Kd coding scheme where Kd is the numberof different categories of attribute d 2 f1; . . . ; Dmg. Thevalue of such an attribute is represented by a vector

aad ¼ ðad1; . . . ; adKd Þ; ð2Þ

with adj ¼ 1 if ad belongs to category j and adj ¼ 0otherwise. For continuous attributes no dedicated codingscheme is required. The values are just denoted by a scalarvalue ad 2 IR for all d 2 fDm þ 1; . . . ; Dg.

Typically, the number of different categories can be keptlow by defining appropriate equivalence classes. Forexample, we do not define 65,535 different categories forthe TCP port numbers. Instead, using the IANA portnumber list [28], we define three different categories: well-known ports (0-1,023), registered ports (1,024-49,151), anddynamic and/or private ports (49,152-65,535). IP addresses,for instance, are subdivided according to the categoriesdefined in RFC 1597 and RFC 790 [29], [30]. That is, domainknowledge must be used to define equivalence classes.Thus, the number of categories can be reduced substantiallywhich makes the alert aggregation easier without a loss ofsignificant information. Quite similar schemes have alreadysuccessfully been applied (see, e.g., [9], [11], [23]).

3.3 Offline Alert Aggregation

In this section, we introduce an offline algorithm for alertaggregation which will be extended to a data streamalgorithm for online aggregation in Section 3.4.

Assume that a host with an ID agent is exposed to acertain intrusion situation as sketched in Fig. 2: One orseveral attackers launch several attack instances belongingto various attack types. The attack instances each cause anumber of alerts with various attribute values. Only two ofthe attributes are shown and the correspondence of alertsand (true or estimated) attack instances is indicated bydifferent symbols. Fig. 2a shows a view on the “idealworld” which an ID agent does not have. The agent onlyhas observations of the detectors (alerts) in the attribute

space without attack instance labels as outlined in Fig. 2b.

The task of the alert aggregation module is now to estimate

the assignment to instances by using the unlabeled

observations only and by analyzing the cluster structure

in the attribute space. That is, it has to reconstruct the attack

situation. Then, meta-alerts can be generated that are

basically an abstract description of the cluster of alerts

assumed to originate from one attack instance. Thus, the

amount of data is reduced substantially without losing

important information. Fig. 2c shows the result of a

reconstruction of the situation. There may be different

potentially problematic situations:

1. False alerts are not recognized as such and wronglyassigned to clusters: This situation is acceptable aslong as the number of false alerts is comparably low.

2. True alerts are wrongly assigned to clusters: Thissituation is not really problematic as long as themajority of alerts belonging to that cluster iscorrectly assigned. Then, no attack instance ismissed.

3. Clusters are wrongly split: This situation is un-desired but clearly unproblematic as it leads toredundant meta-alerts only. Only the data reductionrate is lower, no attack instance is missed.

4. Several clusters are wrongly combined into one:This situation is definitely problematic as attackinstances may be missed.

According to our objectives (cf. Section 3.1) we must try toavoid the latter situation but we may accept the formerthree situations to a certain degree.

How can the set of samples be clustered (i.e., aggre-gated) to generate meta-alerts? Here, the answer to thisquestion is identical to the answer to the following: Howcan an attack situation be modeled with a parameterizedprobabilistic model and how can the parameters beestimated from the observations?

First, we introduce a model for a single attack instance.

We regard an attack instance as a random process generat-

ing alerts that are distributed according to a certain multi-

variate probability distribution. The alert space is composed

of several attributes. We have categorical attributes which

HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 285

Fig. 2. Example illustrating the alert aggregation task and possible problems (artificial attack situation). (a) Idealized world: In the idealized IDS, thedetectors do not make errors (no false and missing alerts) and the correct assignment of alerts to attack instances is known (indicated by differentsymbols). (b) Actual observations: The alerts produced by a real detection layer. The task of the alert aggregation is to reconstruct the attacksituation by means of these observations only (including false alerts). (c) Reconstruction: The result of the aggregation (correspondence of alertsand clusters/meta-alerts) together with four different types of problems that may arise.

Page 5: Online Intrusion Alert Aggregation

we assume to be multinomially distributed. That is, for an

attribute aad 2 faa1; . . . ; aaDmg (cf. 2) we use

Mðaadj��dÞ ¼YKd

k¼1

ð�dkÞadk ; ð3Þ

with a parameter vector ��d ¼ ð�d1; . . . ; �dKd Þ and the restric-

tionsPKd

k¼1 �dk ¼ 1 and �dk � 0 for all k. The continuous

attributes must be chosen such that a normal distribution

(Gaussian distribution) can be assumed. That is, for an

attribute ad 2 faDmþ1; . . . ; aDg we use

N�adj�d; �2

d

�¼ 1ffiffiffiffiffiffi

2�p

�dexp � 1

2�2d

ad � �dð Þ2� �

; ð4Þ

with mean �d 2 IR and standard deviation �d 2 IRþ.

Assuming further that the attributes are not correlated

(i.e., statistically independent from each other), the multi-

variate hybrid distribution H for one attack instance gets

Hðaj��; ��2; ��Þ ¼YDm

d¼1

Mðaadj��dÞ �YD

d¼Dmþ1

N�adj�d; �2

d

�; ð5Þ

with corresponding parameter vectors ��, ��2, and ��. That

is, the joint distribution is a product of the univariate

distributions. The distribution underlying an overall attack

situation consisting of J 2 NN attack instances is then

assumed to be a mixture of J hybrid distributions which

are then called components of this mixture distribution:

Sðaj��; ��2; ��; ��Þ ¼XJj¼1

�j � H�aj��j; ��2

j ; ��j�; ð6Þ

with �� ¼ ð�1; . . . ; �JÞ, 0 � �j for all j, andPJ

j¼1 �j ¼ 1. The

�j are called mixing coefficients.Reconstructing an attack situation from observed sam-

ples as sketched above means to estimate all the parametersof the mixture distribution by means of those samples. Thestandard approach for such a problem is a maximumlikelihood (ML) estimation [3].

That is, we are given a set of N alerts A ¼ fað1Þ; . . . ; aðNÞgoriginating from the situation that must be reconstructed.Assuming that these samples are independent and identi-cally distributed (i.i.d.) according to S, the so-called like-lihood function

SðAj��; ��2; ��; ��Þ ¼YNn¼1

S�aðnÞj��; ��2; ��; ��

�; ð7Þ

must be maximized with respect to all parameters. In thecase of mixture models, it is not possible to evolve a closedformula for the optimization. Instead, an iterative methodcalled expectation maximization (EM) must be applied (cf. [3],for instance).

The resulting EM procedure for our attack situationmodel is shown in Algorithm 1. It iteratively maximizesthe likelihood with two alternating computation steps:E (expectation) and M (maximization). The E step assignsthe alerts to components—resulting in a partition of theset A with J clusters—and the M step optimizes theparameters of the mixture model.

Some additional remarks must be made:Initialization of model parameters. The aim of the

initialization is to find good initial values. Instead of usinga random initialization which results in higher runtimesand sub-optimal solutions, we use a heuristic which wehave successfully applied to the training of radial basisfunction neural networks [31]. This heuristic selects asinitial cluster centers a set of alerts with a large spread inthe attribute space.

Hard assignment of alerts to components. More generalEM algorithms make a gradual assignment of alerts tocomponents in the E step (cf. responsibilities in [3]). Inpractical applications, a hard assignment reduces theruntimes significantly at the cost of slightly worse solutionsin some situations. In our case, this is acceptable as we donot want to find the optimal model parameters at the end,but to generate the optimal set of meta-alerts.

Stopping criterion. An EM algorithm guarantees that theset of parameters is improved in each step. In addition, due tothe hard assignment of alerts, there exists a limited number ofpossible assignments. Thus, it can be guaranteed that thealgorithm converges. For the sake of simplicity, however, weusually run the algorithm for a fixed number of iterations.

Fixed mixing coefficients. One of the main difficulties inalert aggregation is the wide range of possible cluster sizes.There are clusters that contain thousands of alerts, but thereare also clusters that consist of a few alerts only. For instance,a Neptune attack instance may result in 200,000 alertswhereas a PHF attack instance may consist of only five alerts[32]. Thus, in contrast to a more general EM approach, it isimportant to fix the mixing coefficients (here: �j ¼ 1

J ).Otherwise, if the mixing coefficients were estimated fromthe observed samples, the EM algorithm would focus on theoptimization of the parameters of “heavy” componentswhile neglecting the “light” ones.

286 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011

Page 6: Online Intrusion Alert Aggregation

One important task has not been mentioned so far: Theestimation of the number of components (clusters) J fromthe set of observed samples A. Up to now, we assumed thatthis number is given. In our case, the number of clusters shallcorrespond to the number of attack instances in an ideal case.To solve this problem, cluster validation measures can be usedto assess the quality of a clustering result. A clusteringalgorithm is started several times with a varying number ofclusters, the quality for each clustering result is assessednumerically, and finally the results are compared and thenumber that optimizes this measure is chosen (so-calledrelative criterion, cf. [33]). Examples for suitable measuresare the Dunn index [34], the Davies-Bouldin index [35], andthe CDbw index [36]. The idea that forms the basis of thosemeasures is that they search for a clustering result thatconsists of clusters that are both, compact and separable.Although the usability of such measures has been proven inmany (offline) application examples, they cannot be appliedin a data streaming case as they suffer from high runtimes,e.g.,OðN2Þ in the case of the Dunn index. We need a measurethat can be updated incrementally when a new alert arrives.Therefore, we chose a density-based approach to estimateseparability and compactness.

As a final result of the EM algorithm, we not only knowthe optimized parameter values of the model, but also thefinal assignment of alerts to components. These clustersdefine a partition of the sample set.

First, we define the compactness of a single clusterassociated with a component j by

�j ¼ minaðnÞassigned to j

H�aðnÞj��j; ��2

j ; ��j�: ð8Þ

Then, the compactness of the overall partition is

� ¼ minj2f1;...;Jg

�j: ð9Þ

That means, the compactness is influenced by the alertwith the lowest likelihood. As a consequence, we get a biastoward a higher number of components, which is desirablein our application (high recall required, lower precisionaccepted). Note that we can easily update this measure inOð1Þ if a new alert is assigned to a component.

Second, we define the separability of a pair of clustersassociated with component i and j by

�ði; jÞ ¼ 1

2

H�að�Þi j��j; ��2

j ; ��j�

H�að�Þj j��j; ��2

j ; ��j�þH

�að�Þj j��i; ��2

i ; ��i�

H�að�Þi j��i; ��2

i ; ��i�

!; ð10Þ

where we use the alerts with the highest likelihoods

að�Þh ¼ arg max

aðnÞassigned to h

H�aðnÞj��h; ��2

h; ��h�: ð11Þ

Then, the separability of the overall partition is

� ¼ mini;j2f1;...;Jg;i6¼j

�ði; jÞ: ð12Þ

An incremental version of this measure with Oð1Þ timecomplexity is also straightforward.

Finally, the overall cluster validation measure � for aclustering result is

� ¼ �

�: ð13Þ

We search for a number J that maximizes this measure. InSection 3.4, it will become clear that only a small number ofchoices for J must actually be tested by our onlineaggregation algorithm.

3.4 Data Stream Alert Aggregation

In this section, we describe how the offline approach isextended to an online approach working for dynamic attacksituations.

Assume that in the environment observed by an ID agentattackers initiate new attack instances that cause alerts for acertain time interval until this attack instance is completed.Thus, at any point in time the ID agent—which is assumedto have a model of the current situation, cf. Fig. 3a—hasseveral tasks, cf. Fig. 3b:

1. Component adaption: Alerts associated with al-ready recognized attack instances must be identi-fied as such and assigned to already existingclusters while adapting the respective componentparameters.

2. Component creation (novelty detection): The occur-rence of new attack instances must be stated. Newcomponents must be parameterized accordingly.

HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 287

Fig. 3. Example illustrating the principle of online alert aggregation (artificial attack situation). (a) Existing model: Components have been created bythe alert aggregation module. These components are the basis for meta-alert generation. (b) Assignment problem: New observations must either beassigned to an existing component which is then adapted or a new component must be created. Also, outdated components must be deleted.(c) Adapted model: The new situation after a few steps. One component has been created, one component has been deleted, and the othercomponents have been adapted accordingly.

Page 7: Online Intrusion Alert Aggregation

3. Component deletion (obsoleteness detection): Thecompletion of attack instances must be detectedand the respective components must be deletedfrom the model.

That is, the ID agent must be situation-aware and try to keephis model of the current attack situation permanently up todate, see Fig. 3c.

Clearly, there is a trade-off between runtime (or reactiontime) and accuracy. For example, it is hardly possible todecide upon the existence of a new attack instance whenonly one observation is made. From the viewpoint of ourobjectives (cf. Section 3.1), the tasks 1 and 2 are more timecritical than task 3.

From a probabilistic viewpoint we can state that ouroverall random process is nonstationary in a certain sensewhich can be regarded as being equivalent to changing themixing coefficients at certain points in time. A mixingcoefficient is either zero or the reciprocal of the number of“active” components (for the time interval of the respectiveattack instance). With appropriate novelty and obsoletenessdetection mechanisms, we aim at detecting these points intime with both sufficient certainty and timeliness.

The offline aggregation algorithm provides the basic ideafor the online case, too. We have to consider, however, thatwe want to have a data stream algorithm with shortruntimes. In the following, we do not distinguish betweenthe following two views on solutions of the modelingproblem: The probabilistic model with J components withparameters ��j, ��

2j , and ��j for j 2 f1; . . . ; Jg and mixing

coefficients 1J , and the partition of the overall set of alerts

into a set of clusters C ¼ fC1; . . . ; CJg. That is, we equate theterms partition and model as well as cluster and compo-nent, respectively. For online clustering, we require thateach alert a has a time stamp tsðaÞ (time of observation) andthat we always know the current point in time t.

Algorithm 2 describes the online alert aggregation. If anew alert is observed we first have to decide whether a firstcomponent has to be created. In this case, we initialize itsparameters with information taken from this alert. Random,small values are added, for example, to prevent anysubsequent optimization steps from running into singula-rities of the respective likelihood function [3]. Otherwise,we have to decide whether the alert has to be associatedwith an existing component or not, i.e., whether we believethat it belongs to an ongoing attack instance or not.Provisionally, we assign the alert to the most likelycomponent (E step) and optimize the parameters of thiscomponent (M step). For the reason of temporal efficiency,we do not conduct a sequence of E and M steps for theoverall model. In some tests [19], it turned out that our maingoal—not to miss any attack instances, see Section 3.1—canbe achieved this way with substantially lower runtimes butat the cost of some redundant meta-alerts (due to split ofclusters, see Section 3.3).

The assignment of the alert to an existing component isnot accepted in any case, only if the quality of the modelincreases or does not decrease too much, e.g., not more than15 percent (realized by means of threshold �). Otherwise,we consider the possibility that a new attack instance mayhave been initiated, do not modify the model, and

temporarily store the alert in a buffer until this belief iscorroborated by additional observations.

Novelty handling basically means that we empty thebuffer by assigning its content either to existing componentsor to new components. This procedure is initiated eitherwhen the temporal spread of the buffer content is too largeor when the content is no longer homogeneous in the sensethat we assume that another new attack instance may havebeen initiated:

. Temporal spread: As the rate of incoming alertsdepends on the current attack situation, it changesheavily over time ranging from thousands of alerts

288 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011

Page 8: Online Intrusion Alert Aggregation

per minute to only a few alerts per hour. Thus, to keepthe response time short, we have to take into accountthe temporal spread of the buffer content. The spreadis defined as the difference between the time stampsof the oldest and the most recent alert in the buffer.

. Homogeneity: The goal is to ensure that only alertsthat are similar to each other are stored in the buffer.Thus, it is possible that the novelty handlingconducts—for temporal performance reasons—onlya local optimization of the model (see below). Weregard the alerts in the buffer as being similar if theyall have the same most likely component.

That is, to initiate the novelty handling for a givenmodel C and a buffer B, we use the function

noveltyðaÞ ¼

true; if�

maxb2BftsðaÞ � ts ðbÞg � �

_�

arg maxj2f1;...;jCjg

H�aj��j; ��2

j ; ��j�

6¼ arg maxj2f1;...;jCjg

H�bj��j; ��2

j ; ��j�

for any b 2 B�;

false; otherwise;

8>>>>>>>>>>><>>>>>>>>>>>:

ð14Þ

with a user-defined threshold � to state novelty. Note, thatspread and homogeneity can both be checked with lowtemporal effort.

Algorithm 3 describes the novelty handling itself.Basically, to adapt the overall model, we run the offlineaggregation algorithm several times with different possiblecomponent numbers to chose the optimal number. How-ever, due to the homogeneity of the buffer, we may restrictthe optimization to the alerts in the buffer and in one“neighbor” cluster on the one hand and a relatively smalluser-defined maximum number of components K on theother without violating our main goal [19]. The result of thislocal optimization is finally fused with the unmodifiedparts of the model.

In order to reduce the runtime of this algorithm further, wemay reduce the number of alerts that have to be processed bymeans of an appropriate subsampling technique.

To initiate the obsoleteness handling for a given model Cand component Cj, we use

obsoletenessðCjÞ ¼true; if t � max

c2CjftsðcÞg þ �;

false; otherwise;

(ð15Þ

where � is a convex combination of a multiple of the meaninterarrival time between the alerts in the cluster and priorknowledge which is faded out when the number of alerts inthis cluster gets larger. That is, we use a kind of estimate ofthe arrival time of the next alert belonging to the correspond-ing attack instance. If obsoleteness is stated, the correspond-ing component is simply deleted from the model.

Note that the mixing coefficients are always definedimplicitly by the number of currently existing components.

3.5 Meta-Alert Generation and Format

With the creation of a new component, an appropriate meta-alert that represents the information about the component inan abstract way is created. Every time a new alert is addedto a component, the corresponding meta-alert is updatedincrementally, too. That is, the meta-alert “evolves” with thecomponent. Meta-alerts may be the basis for a whole setfurther tasks (cf. Fig. 1):

. Sequences of meta-alerts may be investigated furtherin order to detect more complex attack scenarios(e.g., by means of hidden Markov models).

. Meta-alerts may be exchanged with other ID agentsin order to detect distributed attacks such as one-to-many attacks.

. Based on the information stored in the meta-alerts,reports may be generated to inform a human securityexpert about the ongoing attack situation.

Meta-alerts could be used at various points in time fromthe initial creation until the deletion of the correspondingcomponent (or even later). For instance, reports could becreated immediately after the creation of the componentor—which could be more preferable in some cases—asequence of updated reports could be created in regulartime intervals. Another example is the exchange of meta-alerts between ID agents: Due to high communication costs,meta-alerts could be exchanged based on the evaluation oftheir interestingness [37].

According to the task for which meta-alerts are used, theymay contain different attributes. Examples for those attri-butes are aggregated alert attributes (e.g., lists or intervals ofsource addresses or targeted service ports, or a time intervalthat marks the beginning and the end—if available—of theattack instance), attributes extracted from the probabilisticmodel (e.g., the distribution parameters or the number ofalerts assigned to the component), an aggregated alertassessment provided by the detection layer (e.g., the attacktype classification or the classification confidence), and alsoinformation about the current attack situation (e.g., thenumber of recent attacks of the same or a similar type, linksto attacks originating from the same or a similar source).

4 EXPERIMENTAL RESULTS

This section evaluates the new alert aggregation approach.We use three different data sets to demonstrate the

HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 289

Page 9: Online Intrusion Alert Aggregation

feasibility of the proposed method: The first is the well-known DARPA intrusion detection evaluation data set [32],for the second we used real-life network traffic datacollected at our university campus network, and the thirdcontains firewall log messages from a commercial Internetservice provider. All experiments were conducted on an PCwith 2.20 GHz and 2 GB of RAM.

4.1 Description of the Benchmark Data Sets

4.1.1 DARPA Data

For the DARPA evaluation [32], several weeks of trainingand test data have been generated on a test bed that emulatesa small government site. The network architecture as well asthe generated network traffic has been designed to be similarto that of an Air Force base. We used the TCP/IP networkdump as input data and analyzed all 104 TCP-based attackinstances (corresponding to more than 20 attack types) thathave been launched against the various target hosts. Assketched in Section 3.2, sensors extract statistical informationfrom the network traffic data. At the detection layer, weapply SVM to classify the sensor events. By applying avarying threshold to the output of the classifier, a so-calledreceiver operating characteristics (ROC) curve can becreated [38]. The ROC curve in Fig. 4 plots the true positiverate (TPR, number of true positives divided by the sum oftrue positives and false negatives) against the false positiverate (FPR, number of false positives divided by the sum offalse positives and true negatives) for the trained SVM. Eachpoint of the curve corresponds to a specific threshold. Fouroperating points (OP) are marked. OP 1 is the one with thesmallest overall error, but as we want to realize a high recall,we also investigate three more operating points whichexhibit higher TPR at the cost of an increased FPR. We willalso investigate the aggregation under idealized conditionswhere we assume to have a perfect detector layer with nomissing and no false alerts at all. As attributes for the alerts,we use the source and destination IP address, the source anddestination port, the attack type, and the creation timedifferences (based on the creation time stamps).

Table 1 shows the number of alerts produced for thedifferent OP and also for the idealized condition, i.e., aperfect detection layer. In addition, the number of attackinstances for which at least one alert is generated by thedetector layer is also given. Note that we have 104 attackinstances in the data set altogether. For OP 1, there are three

attack instances for which not even a single alert is created,i.e., these instances are already missed at the detection layer.The three instances are missed because there are only a fewtraining samples of the corresponding attack types in thedata set which results in a bad generalization performanceof the SVM. Switching from OP 1 to OP 2 alerts are createdfor one more instance. OP 3 and OP 4 do not yield anyfurther improvement.

We are aware of the various critique on the DARPAbenchmark data (e.g., published in [39], [40]) and thelimitations that emerge thereof. Despite all disadvantages,this data set is still frequently used—not least because it is awell-documented and publicly available data set that allowsthe comparison of results of different researchers. In orderto achieve fair and realistic results, we carefully analyzed allknown deficiencies and omitted features that could bias thedetector classification performance. Besides, most of thedeficiencies of this data set have an impact on the detectorlayer but only little influence on the alert aggregationresults. A detailed discussion can be found in [19] and [41].

4.1.2 Campus Network Data

To assess the performance of our approach in more detail,we also conducted own attack experiments. We launchedseveral brute force password guessing attacks against themail server (POP3) of our campus network and recordedthe network traffic. The attack instances differed in origin,start time, duration, and password guessing rate. The attackschedule was designed to reflect situations which we regardas being difficult to recognize. In particular, we have

1. several concurrent attack instances (up to seven),2. partially and completely overlapping attack in-

stances,3. several instances within a short time interval,4. different attack instances from similar sources,5. different attack durations, and6. an attacker that changes his IP address during the

attack.

In order to demonstrate that the proposed technique canalso be used with a conventional signature-based detector,the captured traffic was analyzed by the open source IDSSnort [24], which detected all 17 attack instances that havebeen launched and produced 128,816 alerts (cf. Table 1).The alert format equals the one used for the SVM detector,i.e., the alerts exhibit the source and destination IP address,

290 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011

Fig. 4. ROC curve for the SVM detector.

TABLE 1Input of the Alert Aggregation Algorithm

Page 10: Online Intrusion Alert Aggregation

the source and destination port, the attack type, andcreation time differences. Snort was configured to matchour network topology and we turned off rudimental alertaggregation features. In order to achieve a high recall, weactivated all available rule sets—the official rule sets as wellas available community rules, which both are available atthe Snort web page [42]. Activating all rules leads to a falsealert rate of 0.33 percent. The FPR is based on theassumption that all alerts that are not classified with theattack type that we launched are false alerts. There is nomissing alert rate given in Table 1 for this data set for tworeasons: First, it cannot be guaranteed that there areunknown attacks in the data set that were started by realattackers and second, we do not know exactly how manyalerts should be created by the attacks we launched.

4.1.3 Internet Service Provider Firewall Logs

The third data set used here differs from the previous onesas we actually do not have a detector layer that performs aclassification and searches for known attacks. Instead, weuse log messages that were generated by a CISCO PIXsystem, which is a well-known commercial FW andnetwork address translation (NAT) device. Typical PIXlog messages are DENY or ACCEPT messages—dependingon whether the incoming or outgoing network packets orconnections are forbidden or allowed according to thefirewall rule set. Here, the alerts consist of the source anddestination IP address, the source and destination port, thecreation time differences, and the PIX message type. Detailson the log message format can be found in [43]. The PIXwas installed in the network of an Internet service providerwithin a real-life environment. During the six hours ofcollection time, 4,898 log messages—we also use the termalerts in the following—were created. As the data were notcollected within a controlled simulation environment, wedo not have any information whether there occurredattacks or not and specifications of TPR and FPR are notpossible. Nonetheless, this data set can be used todemonstrate the broad applicability of the proposed onlinealert aggregation technique.

4.2 Performance Measures

In order to assess the performance of the alert aggregation,we evaluate the following measures:

Percentage of detected instances (p). We regard anattack instance as being detected if there is at least one meta-alert that predominantly contains alerts of that particularinstance. The percentage of detected attack instances p canthus be determined by dividing the number of instancesthat are detected by the total number of instances in the dataset. The measure is computed with respect to the instancescovered by the output of the detection layer, i.e., instancesmissed by the detectors are not considered.

Number of meta-alerts (MA) and reduction rate (r). Thenumber of meta-alerts (MA) is further divided into thenumber of attack meta-alerts MAattack which predominantlycontain true alerts and the number of nonattack meta-alertsMAnonattack which predominantly contain false alerts. Thereduction rate r is 1 minus the number of created meta-alerts MA divided by the total number of alerts N .

Average runtime (tavg) and worst case runtime (tworst).The average runtime is measured in milliseconds per alert.Assuming up to several hundred thousand alerts a day, tavg

should stay clearly below 100 ms per alert. The worst caseruntime tworse, which is measured in seconds, states howlong it takes at most to execute the while loop of Algorithm 2,which may include the execution of Algorithms 3 and 1 (thelatter several times).

Meta-alert creation delay (d). It is obvious that there is acertain delay until a meta-alert is created for a new attackinstance. The meta-alert creation delay d measures the delaybetween the actual beginning of the instance (i.e., the creationtime of the first alert) and the creation of the first meta-alertfor that instance. We investigate, how many seconds thealgorithm needs to create 90 percent (d90%), 95 percent (d95%),and 100 percent (d100%) of the meta-alerts.

4.3 Results

In the following, the results for the alert aggregation arepresented. For all experiments, the same parameter settingsare used. We set the threshold � that decides whether to adda new alert to an existing component or not to five percent,and the value for the threshold � that specifies the allowedtemporal spread of the alert buffer to 180 seconds. � was setthat low value in order to ensure that even a quite smalldegrade of the cluster quality, which could indicate a newattack instance, results in a new component. A small valueof �, of course, results in more components and, thus, in alower reduction rate, but it also reduces the risk of missingattack instances. The parameter �, which is used in thenovelty assessment function, controls the maximum timethat new alerts are allowed to reside in the buffer B. In orderto keep the response time short, we set it to 180 s which wethink is a reasonable value. For both parameters, there werelarge intervals in which parameter values could be chosenwithout deteriorating the results. A detailed analysis anddiscussion on the effects of different parameter settings canbe found in [19].

4.3.1 DARPA Data

Results for the DARPA data set are given in Table 2. First ofall, it must be stated there is an operation point of the SVMat the detection layer (OP 1) where we do not miss anyattack instances at all (at least in addition to those alreadymissed at the detection layer). The reduction rate is with99.87 percent extremely high, and the detection delay isonly 5.41 s in the worst case (d100%). Average and worst caseruntimes are very good, too.

All OP will now be analyzed in much more detail.All attack instances for which the detector produces at

least a single alert are detected in the idealized case and withOP 1 and OP 2. Choosing another OP, the rate of detectedinstances drops to 98.04 percent (OP 3) and 99.02 percent(OP 4). In OP 3, a FORMAT instance and a MULTIHOPinstance are missed. In OP 4, only the FORMAT instancecould not be detected. A further analysis identified thefollowing reasons:

. The main reason in the case of the FORMATinstance is the small number of only four alerts.Those alerts are created by the detector layer for allOP, i.e., there is obviously no benefit from choosing

HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 291

Page 11: Online Intrusion Alert Aggregation

an OP with higher FPR. By increasing the FPR, thetrue FORMAT alerts are erroneously merged withfalse alerts into one cluster. Hence, as the false alertseasily outnumber the four true FORMAT alertswithin this cluster, the FORMAT instance gets lost.

. For the MULTIHOP instance, for which we have19 alerts, the situation is more complex. The instanceis only missed in OP 3 and not in OP 4. In OP 3, thedownside of a higher FPR outweighs the benefit of ahigher TPR—the MULTIHOP alerts are merged witha large number of false alerts. Further increasing theFPR (OP 4) leads to more false alerts as well, but, inthis case, also to a further split of clusters such thatthe false alerts and the MULTIHOP alerts are placedinto separate clusters.

Next, we analyze the number of meta-alerts MA and thereduction rate r. In the idealized case, 324 meta-alerts arecreated. Compared to the about 1.6 million alerts, we get areduction rate of 99.98 percent, which is a reduction ofalmost three orders of magnitude. Unfortunately, withexception of the first of seven weeks, it was not possible toachieve the ideal case with exactly one meta-alert for everyattack instance. Basically, there are four reasons:

Distinguishable steps of an attack type. Often, a split ofattack instances into more meta-alerts is caused by thenature of the attacks themselves. Actually, many attacktypes consist of different, clearly distinguishable steps. Asan example, the FTP-WRITE attack exhibits three suchsteps: an FTP login on port 21, an FTP data transfer onport 20, and a remote login on port 513. Thus, a split intothree related meta-alerts is quite natural. Subsequent tasksat the alert processing layer are supposed to handle suchmultistep attack scenarios (cf. Fig. 1).

Several independent attackers. In the DARPA data set,some attack instances are labeled as a single attack instancealthough they are in fact comprised of the actions of severalindependent attackers. A typical example is a WAREZ-CLIENT instance in week four: There, attackers downloadillegally provided software from a compromised FTP server.As there is no cooperation between the attackers as in thecase of a coordinated denial of service attack, for example,there is an individual meta-alert created for every attacker.

Long attack duration. Attack instances with a longduration are often split into several meta-alerts. Typicalexamples are slow or hidden port scans or (distributed)denial of service attacks which can last several hours.

Bidirectional communication. TCP/IP-based communi-cation between two hosts results in packets transmitted inboth directions. If the detector layer produces alerts for bothdirections (e.g., due to malicious packets), the source anddestination IP address are swapped, which in the endresults in two meta-alerts. This problem could be solvedwith an appropriate preprocessing step.

With an increasing FPR, the number of meta-alerts alsoincreases from 1,976 meta-alerts in OP 1 to 8,588 meta-alertsin OP 4. Even though the number of meta-alerts increasesclearly, we still have a reduction rate of 99.46 percent in OP 4,which would heavily reduce the—possibly manual—effortof a human security expert that analyzes that informationflood. It is also interesting to see that, although the overallnumber of meta-alerts MA increases, the number of attackclusters MAattack remains roughly constant. The majority ofadditionally created meta-alerts contains only false alerts.

With more meta-alerts, the runtime increases. Even inOP 4, with an average runtime of 0.97 ms per alert, theproposed technique is still very efficient. Fig. 5 shows thecomponent creation delays for the four operating points. Thefigure depicts the percentage of attack instances for which ameta-alert was created after a time not exceeding the valuespecified at the x-axis. It can be seen that the creation delay iswithin the range of a few seconds and, thus, meets therequirements for an online application. Table 2 displays thetime after which for 90, 95, and 100 percent of the attackinstances meta-alerts were created. Note that these valuescorrespond to points on the curve in Fig. 5. Interestingly, inthe idealized case, the delay is much higher which can beexplained by the novelty detection mechanism (cf. (14)). In

292 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011

TABLE 2Results of the Online Alert Aggregation for Three Benchmark Data Sets

Fig. 5. Cumulative creation delays for meta-alerts.

Page 12: Online Intrusion Alert Aggregation

the case of a more homogeneous alert stream, the noveltyhandling is mainly influenced by the temporal spreadwhereas in the case of a heterogeneous alert stream due tofalse alerts, the novelty handling is started more often whichresults in lower component creation delay times.

4.3.2 Campus Network Data

For the campus network data, for which the IDS Snort wasused to create alerts, quite similar results could be achieved(see Table 2). All attack instances that have been launchedwere correctly detected. For the 17 attack instances with128,816 alerts, 52 meta-alerts were created, which isequivalent to a reduction rate of 99.96 percent. Again, themajority of meta-alerts is caused by false alerts. We have20 attack meta-alerts and 32 nonattack meta-alerts. Forthree of the 17 attack instances, a redundant meta-alert wascreated. The reasons are similar to the ones described forthe DARPA data set. It is worth mentioning that for theattack instance where the attacker changed his IP addressduring the attack, only one meta-alert was created. Theresults for the runtime as well as for the cluster creationdelay are similar to that of the DARPA data, too.

4.3.3 Internet Service Provider Firewall Logs

For the firewall log data, the proposed alert aggregationcould also be applied successfully. As Table 2 shows,56 meta-alerts were created for the 4,989 alerts, which is areduction rate of 98.86 percent. As it is not possible tospecify a percentage of detected attack instances, weanalyzed the content of the 56 resulting meta-alerts: Inmany cases, it is possible to find a particular reason for themeta-alerts, e.g., we identified meta-alerts that subsume allalerts that are caused by obviously forbidden accesses fromhosts within the DMZ to an external printer using thePrinter PDL Datastream Port (HP JetDirect, port 9,100).Another example is meta-alerts that subsume alerts forillegal network time protocol (ntp, port 123) accesses fromhosts outside the DMZ. The runtime turns out to be a littlebit lower with 1.53 ms per alert, but, interestingly, the worstcase runtime tworst is with 0.27 s much smaller than for theprevious data sets. This effect might be explained by themore heterogeneous alert stream in case of the firewall logdata which results in an increased number of calls ofAlgorithm 3 due to the novelty handling: On the one side,the more heterogeneous the alert stream is, the more oftenthe (time expensive) Algorithm 3 is called which increasesthe overall runtime: On the other side, due to the increasednumber of calls of Algorithm 3, every call is conducted onan alert buffer that contains only a few new alerts whichreduces the runtime of a single call of Algorithm 3—and,thus, the worst case runtime.

4.4 Conclusion

The experiments demonstrated the broad applicability ofthe proposed online alert aggregation approach. Weanalyzed three different data sets and showed thatmachine-learning-based detectors, conventional signature-based detectors, and even firewalls can be used as alertgenerators. In all cases, the amount of data could bereduced substantially. Although there are situations asdescribed in Section 3.3—especially clusters that arewrongly split—the instance detection rate is very high:

None or only very few attack instances were missed.Runtime and component creation delay are well suited foran online application.

5 SUMMARY AND OUTLOOK

We presented a novel technique for online alert aggregationand generation of meta-alerts. We have shown that thesheer amount of data that must be reported to a humansecurity expert or communicated within a distributedintrusion detection system, for instance, can be reducedsignificantly. The reduction rate with respect to the numberof alerts was up to 99.96 percent in our experiments. At thesame time, the number of missing attack instances isextremely low or even zero in some of our experimentsand the delay for the detection of attack instances is withinthe range of some seconds only.

In the future, we will develop techniques for interest-ingness-based communication strategies for a distributedIDS. This IDS will be based on organic computing principles[44]. In addition, we will investigate how human domainknowledge can be used to improve the detection processes

further. We will also apply our techniques to benchmarkdata that fuse information from heterogeneous sources (e.g.,combining host and network-based detection).

ACKNOWLEDGMENTS

This work was partly supported by the German ResearchFoundation (DFG) under grant number SI 674/3-2. Theauthors would like to thank D. Fisch for his support inpreparing one of the data sets. The authors highly appreciatethe suggestions of the anonymous reviewers that helpedthem to improve the quality of the article.

REFERENCES

[1] S. Axelsson, “Intrusion Detection Systems: A Survey andTaxonomy,” Technical Report 99-15, Dept. of Computer Eng.,Chalmers Univ. of Technology, 2000.

[2] M.R. Endsley, “Theoretical Underpinnings of Situation Aware-ness: A Critical Review,” Situation Awareness Analysis andMeasurement, M.R. Endsley and D.J. Garland, eds., chapter 1,pp. 3-32, Lawrence Erlbaum Assoc., 2000.

[3] C.M. Bishop, Pattern Recognition and Machine Learning. Springer,2006.

[4] M.R. Henzinger, P. Raghavan, and S. Rajagopalan, Computing onData Streams. Am. Math. Soc., 1999.

[5] A. Allen, “Intrusion Detection Systems: Perspective,” TechnicalReport DPRO-95367, Gartner, Inc., 2003.

[6] F. Valeur, G. Vigna, C. Krugel, and R.A. Kemmerer, “AComprehensive Approach to Intrusion Detection Alert Correla-tion,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 3,pp. 146-169, July-Sept. 2004.

[7] H. Debar and A. Wespi, “Aggregation and Correlation ofIntrusion-Detection Alerts,” Recent Advances in Intrusion Detection,W. Lee, L. Me, and A. Wespi, eds., pp. 85-103, Springer, 2001.

[8] D. Li, Z. Li, and J. Ma, “Processing Intrusion Detection Alerts inLarge-Scale Network,” Proc. Int’l Symp. Electronic Commerce andSecurity, pp. 545-548, 2008.

[9] F. Cuppens, “Managing Alerts in a Multi-Intrusion DetectionEnvironment,” Proc. 17th Ann. Computer Security Applications Conf.(ACSAC ’01), pp. 22-31, 2001.

[10] A. Valdes and K. Skinner, “Probabilistic Alert Correlation,” RecentAdvances in Intrusion Detection, W. Lee, L. Me, and A. Wespi, eds.pp. 54-68, Springer, 2001.

[11] K. Julisch, “Using Root Cause Analysis to Handle IntrusionDetection Alarms,” PhD dissertation, Universitat Dortmund, 2003.

HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 293

Page 13: Online Intrusion Alert Aggregation

[12] T. Pietraszek, “Alert Classification to Reduce False Positives inIntrusion Detection,” PhD dissertation, Universitat Freiburg, 2006.

[13] F. Autrel and F. Cuppens, “Using an Intrusion Detection AlertSimilarity Operator to Aggregate and Fuse Alerts,” Proc. FourthConf. Security and Network Architectures, pp. 312-322, 2005.

[14] G. Giacinto, R. Perdisci, and F. Roli, “Alarm Clustering forIntrusion Detection Systems in Computer Networks,” MachineLearning and Data Mining in Pattern Recognition, P. Perner andA. Imiya, eds. pp. 184-193, Springer, 2005.

[15] O. Dain and R. Cunningham, “Fusing a Heterogeneous AlertStream into Scenarios,” Proc. 2001 ACM Workshop Data Mining forSecurity Applications, pp. 1-13, 2001.

[16] P. Ning, Y. Cui, D.S. Reeves, and D. Xu, “Techniques and Tools forAnalyzing Intrusion Alerts,” ACM Trans. Information SystemsSecurity, vol. 7, no. 2, pp. 274-318, 2004.

[17] F. Cuppens and R. Ortalo, “LAMBDA: A Language to Model aDatabase for Detection of Attacks,” Recent Advances in IntrusionDetection, H. Debar, L. Me, and S.F. Wu, eds. pp. 197-216, Springer,2000.

[18] S.T. Eckmann, G. Vigna, and R.A. Kemmerer, “STATL: An AttackLanguage for State-Based Intrusion Detection,” J. ComputerSecurity, vol. 10, nos. 1/2, pp. 71-103, 2002.

[19] A. Hofmann, “Alarmaggregation und Interessantheitsbewertungin einem dezentralisierten Angriffserkennungsystem,” PhD dis-sertation, Universitat Passau, under review.

[20] M.S. Shin, H. Moon, K.H. Ryu, K. Kim, and J. Kim, “ApplyingData Mining Techniques to Analyze Alert Data,” Web Technologiesand Applications, X. Zhou, Y. Zhang, and M.E. Orlowska, eds.pp. 193-200, Springer, 2003.

[21] J. Song, H. Ohba, H. Takakura, Y. Okabe, K. Ohira, and Y. Kwon,“A Comprehensive Approach to Detect Unknown Attacks viaIntrusion Detection Alerts,” Advances in Computer Science—ASIAN2007, Computer and Network Security, I. Cervesato, ed., pp. 247-253,Springer, 2008.

[22] R. Smith, N. Japkowicz, M. Dondo, and P. Mason, “UsingUnsupervised Learning for Network Alert Correlation,”Advances in Artificial Intelligence, R. Goebel, J. Siekmann, andW. Wahlster, eds. pp. 308-319, Springer, 2008.

[23] A. Hofmann, D. Fisch, and B. Sick, “Identifying Attack Instancesby Alert Clustering,” Proc. IEEE Three-Rivers Workshop SoftComputing in Industrial Applications (SMCia ’07), pp. 25-31, 2007.

[24] M. Roesch, “Snort—Lightweight Intrusion Detection for Net-works,” Proc. 13th USENIX Conf. System Administration (LISA ’99),pp. 229-238, 1999.

[25] O. Buchtala, W. Grass, A. Hofmann, and B. Sick, “ADistributed Intrusion Detection Architecture with OrganicBehavior,” Proc. First CRIS Int’l Workshop Critical InformationInfrastructures (CIIW ’05), pp. 47-56, 2005.

[26] D. Fisch, A. Hofmann, V. Hornik, I. Dedinski, and B. Sick, “AFramework for Large-Scale Simulation of Collaborative IntrusionDetection,” Proc. IEEE Conf. Soft Computing in Industrial Applica-tions (SMCia ’08), pp. 125-130, 2008.

[27] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification,second ed. Wiley Interscience, 2001.

[28] IANA, “Port Numbers,” http://www.iana.org/assignments/port-numbers, May 2009.

[29] Y. Rekhter, B. Moskowitz, D. Karrenberg, and G. de Groot, “RFC1597—Address Allocation for Private Internets,” http://www.faqs.org/rfcs/rfc1597.html, Mar. 1994.

[30] J. Postel, “RFC 790—Assigned numbers,” http://www.faqs.org/rfcs/rfc790.html, Sept. 1981.

[31] O. Buchtala, A. Hofmann, and B. Sick, “Fast and Efficient Trainingof RBF Networks,” Artificial Neural Networks and Neural InformationProcessing—ICANN/ICONIP 2003, O. Kaynak, E. Alpaydin, E. Oja,and L. Xu, eds., pp. 43-51, Springer, 2003.

[32] R.P. Lippmann, D.J. Fried, I. Graf, J.W. Haines, K.R. Kendall, D.McClung, D. Weber, S.E. Webster, D. Wyschogrod, R.K.Cunningham, and M.A. Zissman, “Evaluating Intrusion Detec-tion Systems: The 1998 DARPA Offline Intrusion DetectionEvaluation,” Proc. DARPA Information Survivability Conf. andExposition (DISCEX), vol. 2, pp. 12-26, 2000.

[33] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On ClusteringValidation Techniques,” J. Intelligent Information Systems, vol. 17,nos. 2/3, pp. 107-145, 2001.

[34] J.C. Dunn, “Well Separated Clusters and Optimal Fuzzy Parti-tions,” J. Cybernetics, vol. 4, pp. 95-104, 1974.

[35] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 2,pp. 224-227, Apr. 1979.

[36] M. Halkidi and M. Vazirgiannis, “Clustering Validity AssessmentUsing Multi Representatives,” Proc. SETN Conf., vol. 2, pp. 237-249, 2002.

[37] A. Hofmann, I. Dedinski, B. Sick, and H. de Meer, “A Novelty-Driven Approach to Intrusion Alert Correlation Based onDistributed Hash Tables,” Proc. 12th IEEE Symp. Computers andComm. (ISCC ’07), pp. 71-78, 2007.

[38] F. Provost and T. Fawcett, “Analysis and Visualization ofClassifier Performance: Comparison under Imprecise Class andCost Distributions,” Proc. Third Int’l Conf. Knowledge Discovery andData Mining (KDD ’97), pp. 43-48, 1997.

[39] J. McHugh, “Testing Intrusion Detection Systems: A Critique ofthe 1998 and 1999 DARPA Intrusion Detection System Evaluationsas Performed by Lincoln Laboratory,” ACM Trans. Information andSystem Security, vol. 3, no. 4, pp. 262-294, 2000.

[40] M.V. Mahoney and P.K. Chan, “An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network AnomalyDetection,” Recent Advances in Intrusion Detection, G. Vigna,E. Jonsson, and C. Krugel, eds., pp. 220-237, Springer, 2003.

[41] A. Hofmann, D. Fisch, and B. Sick, “Improving IntrusionDetection Training Data by Network Traffic Variation,” Proc.IEEE Three-Rivers Workshop Soft Computing in Industrial Applica-tions, pp. 25-31, 2007.

[42] Sourcefire, Inc., http://www.snort.org/, 2009.[43] CISCO Systems, Inc., “Cisco PIX Firewall System Log Messages,

Version 6.3,” http://www.cisco.com/en/US/docs/security/pix/pix63/system/message/pixemsgs.html, 2009.

[44] Organic Computing, R.P. Wurtz, ed. Springer, 2008.

Alexander Hofmann received the diploma(master’s degree) in computer science in 2002from the University of Passau, Germany. Heworked toward the PhD degree as a researchassistant in the area of computer security as amember of the research group “ComputationallyIntelligent Systems” chaired by Bernhard Sick.Currently, he is employed as a senior softwareengineer with Capgemini sd&m AG. He haspublished several articles in the areas of intru-

sion detection, data mining, neural networks, and peer2peer technolo-gies. He is a member of the IEEE and the ACM.

Bernhard Sick received the diploma (1992) andPhD degrees (1999) in computer science, fromthe University of Passau, Germany. Currently,he is an assistant professor (Privatdozent) at theFaculty of Computer Science and Mathematicsof the University of Passau, Germany. There, heis conducting research in the areas of machinelearning, organic computing, collaborative datamining, intrusion detection, and biometrics. Heauthored more than 70 peer-reviewed publica-

tions in these areas. He is a member of the IEEE and an associate editorof the IEEE Transactions on Systems, Man, and Cybernetics, Part B. Heholds one patent and received several thesis, best paper, teaching, andinventor awards.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

294 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011