[lecture notes in computer science] computational intelligence volume 4114 || intrusion detection...

D.-S. Huang, K. Li, and G.W. Irwin (Eds.): ICIC 2006, LNAI 4114, pp. 724 – 729, 2006. © Springer-Verlag Berlin Heidelberg 2006

Intrusion Detection Based on Data Mining*

Jian Yin, Fang Mei, and Gang Zhang

Department of Computer Science, Sun Yat-Sen University, Guangzhou 510275, China [email protected]

Abstract. Many traditional algorithms use single metric generated by multi-events to detect intrusion by comparison with a certain threshold. In this paper we present a metric vector-based algorithm to detect intrusion while introducing the sample distance for both discrete and continuous data in order to improve the algorithm on heterogeneous dataset. Experiments on MIT lab Data show that the proposed algorithm is effective and efficient.

1 Introduction

Recently, data mining methods are widely used in intrusion detection especially in host audit data analysis. There are many famous probabilistic algorithm such as Decision Tree, Hotelling’s T2, Chi-Square, first-order and high-order Markov model [1],[2]. These algorithms focus on some data features to mark anomaly state. But these algorithms mostly used single metric generated by multi-events in order to detect intrusion by comparison with a certain threshold. On the other hand, some research have focused on the redefinition of distance function such as Minkowsky, Euclidean and so on, but there is of no effect for both discrete and continuous data.

This paper is focused on the expression of multiple events vector on intrusion detection with a formalized model. First we propose a generating algorithm of multiple events vector and a definition based on the bizarrerie distance function[3], and then generalize the algorithm on heterogeneous dataset, last present the corresponding detection algorithm and compare with the traditional algorithms.

2 Multi Event-Based Vector on Heterogeneous Dataset

For the description of single event, we only need introduce a variable X, which can describe the type of a certain event in a certain moment. If there are n types of events, then there may be n values about this variable. For frequency property of multi events, we use an array of random variables, (X1,X2,,,,,,Xn), to represent the frequency of n different types of events for a given sequence of events. * This work is supported by the National Natural Science Foundation of China (60573097),

Natural Science Foundation of Guangdong Province (05200302,04300462), Research Foundation of National Science and Technology Plan Project (2004BA721A02), Research Foundation of Science and Technology Plan Project in Guangdong Province (2005B10101032) and Research Foundation of Disciplines Leading to Doctorate degree of Chinese Universities(20050558017).

Intrusion Detection Based on Data Mining 725

Many algorithms have a good performance on congeneric dataset, but the effect on heterogeneous dataset is not good. This paper generalizes the traditional algorithm based on the research result of Bernhard Scholkopf[4],[5],[6], combined with the definition of the bizarrerie distance function [6] on heterogeneous dataset which is proposed by D. Randall Wilson. First we give the distance definition on both discrete attributes and continuous attributes of heterogeneous dataset.

Definite 1. Formalized distance: Let x and y be two continuous data on heterogeneous dataset X, and the ath attributes are xa and ya, then the Formalized distance of the two

points on the ath attribute is:a

a4

|yx|)y,x(diff_normalized

σ−

= , where,σa is the variation

of the ath attribute on the dataset.

Definite 2. VDM (Value Difference Metric): Let x and y be two discrete data on heterogeneous dataset X, and the ath attributes are xa and ya, then VDM distance of the two

points on the ath attribute is: ∑=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

c

1c

2

N

N

N

N)y,x(vdm_normalized

y,a

c,y,a

x,a

c,x,aa ,where,

Na,x is the count of the data whose value of the ath attribute is xa in the whole dataset X, and Na,x,c is the count of the data whose value of the ath attribute is xa and the output type is c. c is the output types of dataset.

3 Mining Intrusion Data

3.1 Multi-event-Based Vector Generation

We use vector X (X1,X2,….Xn) to represent n event types of original hosts audit dataset. In Chi-square multivariate test, the detection value is calculated as below:

∑=

−=n

1ii

2ii2

)x(Avg

))x(Avgx(x . In this algorithm, only the mean Vector X is considered. And

the detection is made only by the mean shift on one or more of the n variables. From the formula above we can see that metrics calculated from multiple events in

event vector have been summarized to a single value. This will hide details information from individual events. To preserve such details, we provide a new algorithm to calculate a vector to represent per event-based metric. The output of our algorithm is a vector based on the input mean vector. So we expand the summarized procedure to a vector. The calculation of per event-based vector is same as Chi-square algorithm. The vector is represented as below:

⎥⎥⎦

⎤

⎢⎢⎣

⎡ −−−)x(Avg

)x(Avgx,...,

)x(Avg

)x(Avgx,

)x(Avg

)x(Avgx

n

2nn

2

222

1

211 . (1)

Each term in our vector represents a certain event type in host audit dataset. With this vector, we get the character of audit dataset per event instead of just a summarized value.

726 J. Yin, F. Mei , and G. Zhang

3.2 The Generalized Vector on Heterogeneous Dataset

The process of heterogeneous data on the sample dataset is always a difficult point on intrusion detection, D. Randal Wilson proposed a more efficient distance function HVDM (Heterogeneous Value difference metric)[6], which could show different impact of detection result for different attributes, and also it can measure the difference effectively between different data.

Given x, y ∈X, then the HVDM distance between x and y is defined as:

[ ]∑=

=m

1a

2aaa )y,x(d)y,x(H ,

If the ath attribute is discrete, then )y,x(diff_normalized)y,x(d da = .

If the ath attribute is continuous, then )y,x(vdm_normalized)y,x(d da = . Based on above, we generalize the formula (1) into formula (2) as below:

⎥⎥⎦

⎤

⎢⎢⎣

⎡ −−−)x(Avg

)))x(Avgx(H(,...,

)x(Avg

)))x(Avgx(H(,

)x(Avg

)))x(Avgx(H(

n

2nn

2

222

1

211 . (2)

In order to compare these terms with a single threshold, a formalized procedure is necessary. We use geometric mean value to formalize this vector. Let:

n n)21 )x(Avg...x(Avg)x(AvgpAvg = , Let vector (2) be ( )n,...21 pp,p , then the

formalized vector is formed as following: ⎟⎟⎠

⎞⎜⎜⎝

⎛+++ n

n

2

2

1

1

ppAvg

P,...,

ppAvg

P,

ppAvg

P .

3.3 Mining Anomaly Status

Here, we use two methods. The first is area comparison method based on the area rounded by formalized vector curve and X- and Y-axis. The second one is curve fitting. For area comparison method, we use integral to calculate the area. We use a curve to describe result vector and calculate the area lapped with X-axis by integral from 0 to n:

∑∑∫== ⎥

⎥⎦

⎤

⎢⎢⎣

⎡

+=⎟⎟

⎠

⎞⎜⎜⎝

⎛⋅

+=

n

1i i

2i

n

1i

ii

in

0 PpAvg

)P(P

PpAvg

Pdx)x(f

The result of integral is compared with that of the standard curve (of average mean). A larger result indicates an anomaly status may occur.

For curve fitting method, we do not really calculate the area between standard curves. A discrete point comparison method is used. Only points appeared in the multi-event vector will be used in comparison. If you need, you can email us and ask the detailed C++ codes about this procedure.

4 Experiment Results

4.1 Training and Test Data

We use 1999 DARPA intrusion testing data of the MIT Lincoln Lab as training data. This dataset was obtained from one simulation real network environment, including the


normal data and intrusion data. It can be downloaded from the following URL[7],[8]: http://www.ll.mit.edu/IST/ideval/data/1999/1999_data_index.html

This dataset simultaneously provided the training data and the test data. The training dataset includes three week-long trainings data. First week and the third week training data comes from normal behavior, the second week training data contained the attack data. Fourth, the fifth week data is the test data. These test data tested about 201 examples of 56 kinds of intrusions. We provided the primitive BSM model binary output document and the BSM structure document as well as the outer covering original document used in the initialization of BSM audit information [8].

The experimental result was based on the above dataset. First, the training dataset is used to construct the normal frames model. Then, the test data is used to gain the experimental result and provide the ROC curve.

4.2 The Result of Two Kinds of Examine Method

For the area comparison method, we use ROC curve to describe our experiment result. Each point in an ROC curve indicates a pair of the hit rate and the false alarm rate for a signal threshold. By varying the value of the signal threshold, we obtain an ROC curve. The closer the ROC is to the top-left corner of the chart, the better detection performance the intrusion detection algorithm does.

Figure 1 shows the comparison result of Chi-square multivariate test and our multi-event vector area comparison. From the figure we can see that our algorithm is more accurate and more effective than the old one. In this figure, our algorithm uses a formalized process as described in our related work.

Figure 2 shows ROC curve without our formalized process. From the curve we can see that our formalized procedure will increased the accuracy of detection. The reason lies in with our formalized procedure, not only mean shifts but also multi-event counter-relationship have been added to the result vector. Experiment result shows that there may be some internal relationship between events in intrusion detection.

ROC Curve

False Alarm Rate

1.00.75.50.250.00

Hit

Rate

1.00

.75

.50

.25

0.00

Curve To Compare

Reference Line

Multi-Event Area

Chi-Square

x-axis： false rate y-axis： hit rate

Fig. 1. Comparison of Chi-Square and Multi-Event Area

728 J. Yin, F. Mei , and G. Zhang

For curve fitting method, we observe the related output through the establishment of different threshold value. In between 0% and 100%, find the point leads hit rate that can reach the maximum value. Table 1 listed the hit probability which corresponded under each threshold value.

Comparing the output of the test data with the training data characteristic, we obtained the hit probability which showed in table 1. As can be seen from the table that the best threshold range is between 80% and 90%.

Table 1. The Hit Rate results of different thresholds

Threshold Hit Rate

10% 5.53% 20% 13.81% 30% 23.11% 40% 31.92% 50% 52.00% 60% 61.19% 70% 75.07% 80% 91.12% 90% 80.83%

ROC Curve

False Alarm Rate

1.00.75.50.250.00

Hit

Rate

1.00

.90

.80

.70

.60

.50

.40

.30

.20

.10

0.00

Fig. 2. ROC curve without normalized procedure

5 Conclusion

The paper proposed an intrusion data detection and mining algorithm on heterogeneous dataset. The core idea is based on the use of multi-vector events, which replaced the probability value as the only measurement in traditional algorithms. The concept of distance is introduced and vector dataset is generalized by processing the discrete and continuous attribution distinctively. Experiments show that the measure could describe


audit dataset more precisely, the promoted vector on the heterogeneous collection could enhance the accuracy and the validity, while keeping the time and space complexity of original algorithm.

References

1. Ye, N., Li, X. Y., Chen, Q., Syed Masum Emran, Xu, M. M.: Probabilistic Techniques for Intrusion Detection Based on Computer Audit Data. IEEE Transactions on Systems, Man, And Cybernetics—Part A: Systems and Humans, Volume.31, NO.4, (2001)266 – 274

2. Lee, W. K., Stolfo, S.J., Mok, K.W.: A Data Mining Framework for Building Intrusion Detection Models. Security and Privacy, 1999. Proceedings of the 1999 IEEE Symposium , (1999)120 – 132

3. Yoshida, K.: Entropy Based Iintrusion Detection. Communications, Computers and Signal Processing, PACRIM. 2003. IEEE Pacific Rim Conference on, Vol.2, (2003 )840 – 843

4. D Randall Wilson, Tony R Martinez: Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research, Vol. 6, No.1,(1997) 1-34

5. Bernhard Scholkof, John C Plattz: Estimating the Support of a High-dimensional Distribution. Neurral Computation, Vol.13, No.7, (2001)1443-1472

6. Bernhard Scholkopf: The Kernel Trick for Distance. Microsoft Research, Tech Rep: MSR-TR-2000-51, 2000

7. Bernhard Scholkopf: Statistical Learning and Kernel Methods. Microsoft Research, Tech Rep: MSR-TR-2000-23, 2000

8. Haines, J.W., Rossey, L.M., Lippmann, R.P., Cunningham, R.K.: Extending the DARPA Off-line Intrusion Detection Evaluations. DARPA Information Survivability Conference & Exposition II, DISCEX '01. Proceedings, Vol.1, 12-14 ,(2001 )35 – 45

9. http://www.ll.mit.edu/IST/ideval/data/1999/1999_data_index.html 10. Lindqvist, U., Porras, P.A.. eXpert-BSM: A Host-based Intrusion Detection Solution for

Sun Solaris. Computer Security Applications Conference, ACSAC 2001. Proceedings 17th Annual, (2001) 240 – 251

[lecture notes in computer science] computational intelligence volume 4114 || intrusion detection...

Documents