1 statistical methods for detecting computer attacks from streaming internet data ginger davis,...

44
1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering Department Joint work with: David Marchette & Karen Kafadar INTERFACE 2008 May 22, 2008

Upload: tracy-shelton

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

1

Statistical Methods for Detecting Computer Attacks from Streaming

Internet Data

Ginger Davis, University of Virginia

Systems & Information Engineering Department

Joint work with:

David Marchette & Karen Kafadar

INTERFACE 2008

May 22, 2008

Page 2: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Outline

• Motivation

• Data

• TCP Classification

• Graphical Displays

2

Page 3: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Motivation

• Cyber attacks on computer networks are threats to nearly all operations in society.

• We need computational tools and statistical methods to identify attacks and stop them before they force shutdowns.

• Use patterns in Internet traffic data to– Perform user profiling– Detect anomalies, network interruptions,

unusual behavior, masquerades3

Page 4: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

4

Personal Computer The Internet (Circa 2006) Burning Power Transformer (May 2007)

+ =

Project Background

Facts: • The Internet is growing• Computer network attacks are increasing• Need for network security research & tools

Page 5: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Previous Work in Detecting Aberrations

• Examples

– Disease surveillance

– Nuclear product manufacturing

– Fraud detection (credit cards; phone use)

• These data sets are often

– Reasonable small (say less than 100 per day)

– Easily stratified (by disease, site, cardholder)

– Approximately independent

• Can often apply Statistical Process Control tools

5

Page 6: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Features of Internet Traffic Data

• Relentless (“streaming”)

• Not independent of other systems: thousands of messages from thousands of ports/addresses each minute

• Diverse (text, numeric, image)

• Dispersed (geographically)

• Data often not from some convenient mathematical pdf

6

Page 7: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Four Stages of Data Graphics• Static Graphics

– Scatterplot, conditioning plot, density plot• Interactive Graphics

– Brushing, cropping, cutting, coloring, rotating, linked plots

• Dynamic Graphics (interact directly with fixed size data set on the client)– Recursive or dynamically smoothed plot, mode tree

• Evolutionary Graphics (continually evolving streaming data sets)– Waterfall diagram, streaming chart, skyline plot

7

Page 8: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Challenges

• Internet traffic data are streaming

• Unusable in raw form and require pre-processing

• Detecting anomalies requires characterizing typical behavior

8

Page 9: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Specific challenges for streaming data

• Data value – what to collect/discard/save for later

• Data warehouse – acquisition, storage, distribution

• Tools/algorithms for pre-processing

• Methods for analysis– Robustness,sufficiency

• Informative visual displays9

Page 10: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Internet Traffic Data

• All internet communications are transmitted via packets.

• Fundamental unit of information is a packet

• Packet consists of data and headers that control communication– Internet Protocol (IP) addresses– Transmission Control Protocol (TCP)

10

Page 11: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Internet Traffic Data

11

Page 12: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Internet Traffic Data

12

Page 13: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Internet Traffic Data

13

Page 14: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

14

IP Header (Marchette 2001)

Page 15: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

15

TCP

Page 16: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

16

TCP

Page 17: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

17

TCP Header

Page 18: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Hierarchy of Data

• Packets

– Identifying characteristics

– Bytes of information being sent

• Flows

– Communication between source-destination

• Connection

– Collection of source flows and destination flows

• Activity

– Collection of similar connections

• User session

– Collection of activities

18

Page 19: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Hierarchy of Data Example

19

Page 20: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

Goal for Data Hierarchy

• Developing models for each level of the hierarchy which are dependent on models for other levels in the hierarchy

20

Page 21: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

TCP Classification

• Detecting anomalies requires characterizing typical behavior

• We will classify network traffic according to its application

21

Page 22: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

22

Background

Motivation:

• Port numbers map packets to their respective applications

• The only thing that matters is that the two communicating hosts know which port number to look for

• Malicious users can use a well known port like 80 (web traffic) for other uses and as a result are less likely to be noticed.

Page 23: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

23

Goal and Objective

Goal:

To prevent malicious users from masquerading their activities.

Objective:

To develop classification tree and multinomial logit models which could be used to correctly identify application protocols by looking at session variable characteristics

Page 24: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

24

DataPreliminary Data Processing Methodology:

• Convert: Binary ->Text -> SQL Proved to be slow, and inefficient

Inadequate Session Aggregation Results

GMUInwardTraffic

GMUOutwardTraffic

Step 1.WiresharkMerge &

Data Dump

Step 2.C++ Parser

PacketDumpFile

MySQLPacketTable

Step 3.Session

Aggregator

MySQLSessionTable

Process

Data File

Database

Diagram Legend:

Page 25: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

25

DataRevised Data Processing Methodology:

• Convert: Binary -> Text -> SQL Faster, more efficient, tracks more variables for each session

LibpcapFiles

Step 1.C++ Binary

PacketImporter

MySQLPacketTable

Step 2.RevisedSession

Aggregator

MySQLSessionTable

Process

Data File

Database

Diagram Legend:

Page 26: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

26

DataSession Aggregation Process:

• Ordered observations in database by time

• Logically grouped each packet into a session using standard TCP semantics

• Created unique session definitions

• Maintained averages and variances for each session’s variables

• Session completion status is determined and marked according to TCP semantics

• Packet and session tables were linked by foreign keys

Page 27: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

27

DataEnterprise Data Set

Collected By: Lawrence Berkeley National Laboratory

Contains:

129,903,861 TCP Packets

453,135 TCP Sessions

GMU Data Set

Collected By: George Mason University _______

Contains:

7,024,590 TCP Packets

91,016 TCP Sessions

House Data Set

Collected By: Capstone Team 8 _______________

Contains:

1,110,335 TCP Packets

21,311 TCP Sessions

Page 28: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

28

Model Creation

Process

Data File

Database

Diagram Legend:

MySQLSessionTable

Step 1.IMiner

csvExport

CompleteSessionscsv File

AllSessionscsv File

Step 2.IMiner Training

and TestingData SetCreation

Training Data(70% of Originial

Observations)

Testing Data(30% of OriginalObservations)

Step 2.IMiner Training

and TestingData SetCreation

Training Data(70% of Originial

Observations)

Testing Data(30% of OriginalObservations)

Training and Testing Data Set Creation

Page 29: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

29

Model CreationScenarios Used in Data Analysis

Real World Corporate Scenario

Used all application ports present in the data sets

Idealized Scenario

Used only “top” application ports in the data sets

Home Network Scenario

Used only http, https, pop, and smtp application ports present in the data sets

Page 30: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

30

Model CreationClassification Tree Algorithm Parameters

• RPART – originally developed for R • Dependent Variable – Application Port• Independent Variables – 39 session variables• Splitting Criteria – Gini Index

Page 31: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

31

Model CreationClassification Tree Snapshot

Page 32: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

32

Model CreationMultinomial Logit Models

• Dependent variable – Application Port• Independent variables – 39 session variables

Page 33: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

33

Results: Classification TreesReal World Corporate Scenario: All Ports and All Variables

Takeaway: Good prediction capability within the same data set; inconsistent results when benchmarked against different data sets.

Page 34: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

34

Idealized Scenario: Top Ports and All Variables

Takeaway: Significant prediction improvement for the Enterprise data set. Limiting ports, cleansed the noise from the data.

Results: Classification Trees

Page 35: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

35

Home Network Scenario: Four Ports and All Variables

Takeaway: Improved prediction results both within and across data sets.

Results: Classification Trees

Page 36: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

36

Port 80 Across Data Sets – 4 Application Ports

Takeaway: HTTP traffic (port 80) predictions appear to be robust across the models when only looking at four application variables.

Results: Classification Trees

Page 37: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

37

Idealized Scenario: Top Ports – All Variables

Takeaway: Weaker prediction results in the Enterprise data set. Practical in a real-time environment given appropriate environment/implementation.

Results: Multi-categorical Logistical

Page 38: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

38

ConclusionProject Takeaways:

• Replicated / expanded prior research work successfully on real network data

• Used a fast/exportable model creation and classification process -> Classification Trees

• Created a robust toolkit for processing and storing network data

Page 39: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

39

Future Work

• Implement classification trees in a real network security application

• Handle minority class presence in the data

• Make use of pruning to develop smaller models

Page 40: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

40

Evolutionary Displays for EDA

Page 41: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

41

Waterfall Diagrams (Wegman & Marchette 2003)

Page 42: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

42

Page 43: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

43

Page 44: 1 Statistical Methods for Detecting Computer Attacks from Streaming Internet Data Ginger Davis, University of Virginia Systems & Information Engineering

44

Summary / Future Work