an analysis of the 1999 darpa/lincoln laboratory evaluation data for network anomaly detection matt...

An Analysis of the 1999 DARPA/Lincoln LaboratoryEvaluation Data for Network

Anomaly Detection

Matt Mahoney

[email protected]

Feb. 18, 2003

Is the DARPA/Lincoln Labs IDS Evaluation Realistic?

• The most widely used intrusion detection evaluation data set.

• 1998 data used in KDD cup competition with 25 participants.

• 8 participating organizations submitted 18 systems to the 1999 evaluation.

• Tests host or network based IDS.• Tests signature or anomaly detection.• 58 types of attacks (more than any other evaluation)• 4 target operating systems.• Training and test data released after evaluation to

encourage IDS development.

Problems with the LL Evaluation

• Background network data is synthetic.• SAD (Simple Anomaly Detector) detects

too many attacks.• Comparison with real traffic – range of

attribute values is too small and static (TTL, TCP options, client addresses…).

• Injecting real traffic removes suspect detections from PHAD, ALAD LERAD, NETAD, and SPADE.

1. Simple Anomaly Detector (SAD)

• Examines only inbound client TCP SYN packets.• Examines only one byte of the packet.• Trains on attack-free data (week 1 or 3).• A value never seen in training is an anomaly.• If there have been no anomalies for 60 seconds,

then output an alarm with score 1.

Train: 001110111 Test: 010203001323011

60 sec. 60 sec.

DARPA/Lincoln Labs Evaluation

• Weeks 1 and 3: attack free training data.• Week 2: training data with 43 labeled attacks.• Weeks 4 and 5: 201 test attacks.

SunOS Solaris Linux NT

RouterInternet

SnifferAttacks

SAD Evaluation

• Develop on weeks 1-2 (available in advance of 1999 evaluation) to find good bytes.

• Train on week 3 (no attacks).

• Test on weeks 4-5 inside sniffer (177 visible attacks).

• Count detections and false alarms using 1999 evaluation criteria.

SAD Results

• Variants (bytes) that do well: source IP address (any of 4 bytes), TTL, TCP options, IP packet size, TCP header size, TCP window size, source and destination ports.

• Variants that do well on weeks 1-2 (available in advance) usually do well on weeks 3-5 (evaluation).

• Very low false alarm rates.• Most detections are not credible.

SAD vs. 1999 Evaluation

• The top system in the 1999 evaluation, Expert 1, detects 85 of 169 visible attacks (50%) at 100 false alarms (10 per day) using a combination of host and network based signature and anomaly detection.

• SAD detects 79 of 177 visible attacks (45%) with 43 false alarms using the third byte of the source IP address.

1999 IDS Evaluation vs. SAD

0 20 40 60 80 100

Src IP[3]

TCP Hdr

SAD TTL

Forensics

Dmine

Expert 2

Expert 1

Recall %

Precision

SAD Detections by Source Address(that should have been missed)

• DOS on public services: apache2, back, crashiis, ls_domain, neptune, warezclient, warezmaster

• R2L on public services: guessftp, ncftp, netbus, netcat, phf, ppmacro, sendmail

• U2R: anypw, eject, ffbconfig, perl, sechole, sqlattack, xterm, yaga

2. Comparison with Real Traffic

• Anomaly detection systems flag rare events (e.g. previously unseen addresses or ports).

• “Allowed” values are learned during training on attack-free traffic.

• Novel values in background traffic would cause false alarms.

• Are novel values more common in real traffic?

Measuring the Rate of Novel Values

• r = Number of values observed in training.• r1 = Fraction of values seen exactly once (Good-

Turing probability estimate that next value will be novel).

• rh = Fraction of values seen only in second half of training.

• rt = Fraction of training time to observe half of all values.

Larger values in real data would suggest a higher false alarm rate.

Network Data for Comparison

• Simulated data: inside sniffer traffic from weeks 1 and 3, filtered from 32M packets to 0.6M packets.

• Real data: collected from www.cs.fit.edu Oct-Dec. 2002, filtered from 100M to 1.6M.

• Traffic is filtered and rate limited to extract start of inbound client sessions (NETAD filter, passes most attacks).

Attributes measured

• Packet header fields (all filtered packets) for Ethernet, IP, TCP, UDP, ICMP.

• Inbound TCP SYN packet header fields.

• HTTP, SMTP, and SSH requests (other application protocols are not present in both sets).

Comparison results

• Synthetic attributes are too predictable: TTL, TOS, TCP options, TCP window size, HTTP, SMTP command formatting.

• Too few sources: Client addresses, HTTP user agents, ssh versions.

• Too “clean”: no checksum errors, fragmentation, garbage data in reserved fields, malformed commands.

TCP SYN Source Address

Simulated Real

Packets, n 50650 210297

r 29 24924

r1 0 45%

rh 3% 53%

rt 0.1% 49%

r1 ≈ rh ≈ rt ≈ 50% is consistent with a Zipf distribution and a constant growth rate of r.

Real Traffic is Less Predictable

r (Number ofvalues)

Time

Synthetic

Real

3. Injecting Real Traffic

• Mix equal durations of real traffic into weeks 3-5 (both sets filtered, 344 hours each).

• We expect r ≥ max(rSIM, rREAL) (realistic false alarm rate).

• Modify PHAD, ALAD, LERAD, NETAD, and SPADE not to separate data.

• Test at 100 false alarms (10 per day) on 3 mixed sets.

• Compare fraction of “legitimate” detections on simulated and mixed traffic for median mixed result.

PHAD

• Models 34 packet header fields – Ethernet, IP, TCP, UDP, ICMP

• Global model (no rule antecedents)• Only novel values are anomalous• Anomaly score = tn/r where

– t = time since last anomaly– n = number of training packets– r = number of allowed values

• No modifications needed

ALAD

• Models inbound TCP client requests – addresses, ports, flags, application keywords.

• Score = tn/r

• Conditioned on destination port/address.

• Modified to remove address conditions and protocols not present in real traffic (telnet, FTP).

LERAD

• Models inbound client TCP (addresses, ports, flags, 8 words in payload).

• Learns conditional rules with high n/r.• Discards rules that generate false alarms

in last 10% of training data.• Modified to weight rules by fraction of real

traffic.If port = 80 then word1 = GET, POST (n/r = 10000/2)

NETAD

• Models inbound client request packet bytes – IP, TCP, TCP SYN, HTTP, SMTP, FTP, telnet.

• Score = tn/r + ti/fi allowing previously seen values.– ti = time since value i last seen

– fi = frequency of i in training.

• Modified to remove telnet and FTP.

SPADE (Hoagland)

• Models inbound TCP SYN.

• Score = 1/P(src IP, dest IP, dest port).

• Probability by counting.

• Always in training mode.

• Modified by randomly replacing real destination IP with one of 4 simulated targets.

Criteria for Legitimate Detection

• Source address – target server must authenticate source.

• Destination address/port – attack must use or scan that address/port.

• Packet header field – attack must write/modify the packet header (probe or DOS).

• No U2R or Data attacks.

Mixed Traffic: Fewer Detections, but More are Legitimate

Detections out of 177 at 100 false alarms

0

20

40

60

80

100

120

140

PHAD ALAD LERAD NETAD SPADE

Total

Legitimate

Conclusions

• SAD suggests the presence of simulation artifacts and artificially low false alarm rates.

• The simulated traffic is too clean, static and predictable.

• Injecting real traffic reduces suspect detections in all 5 systems tested.

Limitations and Future Work

• Only one real data source tested – may not generalize.

• Tests on real traffic cannot be replicated due to privacy concerns (root passwords in the data, etc).

• Each IDS must be analyzed and modified to prevent data separation.

• Is host data affected (BSM, audit logs)?

Limitations and Future Work

• Real data may contain unlabeled attacks. We found over 30 suspicious HTTP request in our data (to a Solaris based host).

IIS exploit with double URL encoding (IDS evasion?)

GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir

Probe for Code Red backdoor.GET /MSADC/root.exe?/c+dir HTTP/1.0

Further Reading

An Analysis of the 1999 DARPA/Lincoln Laboratories Evaluation Data for Network Anomaly Detection

By Matthew V. Mahoney and Philip K. Chan

Dept. of Computer Sciences Technical Report CS-2003-02

http://cs.fit.edu/~mmahoney/paper7.pdf

an analysis of the 1999 darpa/lincoln laboratory evaluation data for network anomaly detection matt...

Documents