kdd cup ’99: classifier learning predictive model for intrusion detection

13
KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented by Chris Clifton

Upload: abena

Post on 31-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection. Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented by Chris Clifton. KDD Cup Overview. Held Annually in conjunction with Knowledge Discovery and Data Mining Conference (now ACM-sponsored) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

KDD Cup ’99: Classifier LearningPredictive Model for Intrusion Detection

Charles Elkan1999 Conference on Knowledge

Discovery and Data MiningPresented by Chris Clifton

Page 2: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

KDD Cup Overview

• Held Annually in conjunction with Knowledge Discovery and Data Mining Conference (now ACM-sponsored)

• Challenge problem(s) released well before conference– Goal is to give best solution to problem– Relatively informal “contest”– Gives “standard” test for comparing techniques

• Winner announced at KDD conference– Lots of recognition to winner

Page 3: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Classifier Learning forIntrusion Detection

• One of two KDD’99 challenge problems– Other was a knowledge discovery problem

• Goal is to learn a classifier to define TCP/IP connections as intrusion/okay– Data: Collection of features describing TCP

connection

– Class: Non-attack or type of attack

• Scoring: Cost per Test Sample– Wrong answers penalized based on type of “wrong”

Page 4: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Data: TCP “connection” information

• Dataset developed for 1998 DARPA Intrusion Detection Evaluation Program– Nine weeks of raw TCP dump data from simulated USAF LAN– Simulated attacks to give positive examples– Processed into 5 million training “connections”, 2 million test– Some “attributes” derived from raw data

• Twenty-four attack types in training data, four classes:– DOS: denial-of-service, e.g. syn flood; – R2L: unauthorized access from a remote machine, e.g. guessing

password; – U2R:  unauthorized access to local superuser (root) privileges, e.g.,

various ``buffer overflow'' attacks; – probing: surveillance and other probing, e.g., port scanning.

• Test set includes fourteen attack types not found in training set

Page 5: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Basic features of individual TCP connections

feature name description  type

duration  length (number of seconds) of the connection  continuous

protocol_type  type of the protocol, e.g. tcp, udp, etc.  discrete

service network service on the destination, e.g., http, telnet, etc. 

discrete

src_bytes  number of data bytes from source to destination  continuous

dst_bytes  number of data bytes from destination to source  continuous

flag  normal or error status of the connection  discrete 

land 1 if connection is from/to the same host/port; 0 otherwise 

discrete

wrong_fragment  number of ``wrong'' fragments  continuous

urgent  number of urgent packets  continuous

Page 6: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Content features within a connection suggested by domain knowledge

feature name description  type

hot  number of ``hot'' indicators continuous

num_failed_logins  number of failed login attempts  continuous

logged_in  1 if successfully logged in; 0 otherwise  discrete

num_compromised  number of ``compromised'' conditions  continuous

root_shell  1 if root shell is obtained; 0 otherwise  discrete

su_attempted  1 if ``su root'' command attempted; 0 otherwise  discrete

num_root  number of ``root'' accesses  continuous

num_file_creations  number of file creation operations  continuous

num_shells  number of shell prompts  continuous

num_access_files  number of operations on access control files  continuous

num_outbound_cmds number of outbound commands in an ftp session  continuous

is_hot_login  1 if the login belongs to the ``hot'' list; 0 otherwise  discrete

is_guest_login  1 if the login is a ``guest''login; 0 otherwise  discrete

Page 7: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Traffic features computed using a two-second time window

feature name description  type

count number of connections to the same host as the current connection in the past two seconds 

continuous

Note: The following  features refer to these same-host connections.

serror_rate  % of connections that have ``SYN'' errors  continuous

rerror_rate  % of connections that have ``REJ'' errors  continuous

same_srv_rate  % of connections to the same service  continuous

diff_srv_rate  % of connections to different services  continuous

srv_count number of connections to the same service as the current connection in the past two seconds 

continuous

Note: The following features refer to these same-service connections.

srv_serror_rate  % of connections that have ``SYN'' errors  continuous

srv_rerror_rate  % of connections that have ``REJ'' errors  continuous

srv_diff_host_rate 

% of connections to different host continuous

Page 8: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Scoring

• Each prediction gets a score:– Row is correct answer

– Column is prediction made

• Score is average over all predictions

normal probe DOS U2R R2L

normal 0 1 2 2 2

probe 1 0 2 2 2

DOS 2 1 0 2 2

U2R 3 2 2 0 2

R2L 4 2 2 2 0

Page 9: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Results

• Twenty-four entries, scores:0.2331 0.2356 0.2367 0.2411 0.2414 0.2443 0.2474 0.2479 0.2523 0.2530 0.2531 0.2545 0.2552 0.2575 0.2588 0.2644 0.2684 0.2952 0.3344 0.3767 0.3854 0.3899 0.5053 0.9414

• 1-Nearest Neighbor scored 0.2523

Page 10: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Winning Method:Bagged Boosting

• Submitted by Bernhard Pfahringer, ML Group, Austrian Research Institute for AI

• 50 samples from the original 5 million odd examples set– Contrary to standard bagging the sampling was slightly biased:– all of the examples of the two smallest classes U2R and R2L– 4000 PROBE, 80000 NORMAL, and 400000 DOS examples– duplicate entries in the original data set removed

• Ten C5 decision trees induced from each sample– used both C5's error-cost and boosting options.

• Final predictions computed from 50 single predictions of each training sample by minimizing “conditional risk”– minimizes sum of error-costs times class-probabilities

• Took approximately 1 day of 200MHz 2 processor Sparc to train

Page 11: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Confusion Matrix(Breakdown of score)

Winning Entry: Predicted actual

0 1 2 3 4 %correct

0 60262 243 78 4 6 99.5% 1 511 3471 184 0 0 83.3% 2 5299 1328 223226 0 0 97.1% 3 168 20 0 30 10 13.2% 4 14527 294 0 8 1360 8.4%

correct 74.6% 64.8% 99.9% 71.4% 98.8% For 1-NN: Predicted actual

0 1 2 3 4 %correct

0 60322 212 57 1 1 99.6% 1 697 3125 342 0 2 75.0% 2 6144 76 223633 0 0 97.3% 3 209 5 1 8 5 3.5% 4 15785 308 1 0 95 0.6%

%correct 72.5% 83.9% 99.8% 88.9% 92.2%

Page 12: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Analysis of winning entry

• Result comparable to 1-NN except on “rare” classes– Training sample of winner biased to rare

classes– Does this give us a general principle?

• Misses badly for some attack categories– True for 1-NN as well– Problem with feature set?

Page 13: KDD Cup ’99:  Classifier Learning Predictive Model for Intrusion Detection

Second and Third places(Probably not statistically significant)

• Itzhak Levin, LLSoft, Inc.: Kernel Miner– Link broken?

• Vladimir Miheev, Alexei Vopilov, and Ivan Shabalin, MP13, Moscow, Russia

• Verbal rules constructed by an expert• First echelon of voting decision trees • Second echelon of voting decision trees

– Steps sequentially– Branch to the next step occurs whenever the current one has

failed to recognize the connection– Trees constructed using their own (previously developed) tree

learning algorithm