predicting zero-day software vulnerabilities through data-mining --third presentation

32
PREDICTING ZERO-DAY SOFTWARE VULNERABILITIES THROUGH DATA-MINING --THIRD PRESENTATION Su Zhang 1

Upload: evonne

Post on 25-Feb-2016

59 views

Category:

Documents


3 download

DESCRIPTION

Predicting zero-day software vulnerabilities through data-mining --Third Presentation. Su Zhang. Outline. Quick Review. Data Source – NVD. Data Preprocessing. Experimental Results. An Essential Limitation. An Alternative Feature. Conclusion. Future Work. Quick Review. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

1

PREDICTING ZERO-DAY SOFTWARE VULNERABILITIES THROUGH DATA-MINING

--THIRD PRESENTATION

Su Zhang

Page 2: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

2

Outline

• Quick Review.• Data Source – NVD.• Data Preprocessing.• Experimental Results.• An Essential Limitation.• An Alternative Feature.• Conclusion.• Future Work.

Page 3: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

3

Quick Review

Page 4: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

4

Source Database – NVD

• National Vulnerability Database– U.S. government repository of standard

vulnerability management data.– Data included in each NVD entry• Published Date Time • Vulnerable software’s CPE Specification• CVSS (Common Vulnerability Scoring System) • External links/reference/summary

Page 5: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

5

Instances

• An instance is a tuple including configuration information and vulnerability.– <CPE, Vulnerability>– e.g. (Microsoft, windows7, sp1, CVSS,

vulnerability1)

Page 6: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

6

Number of Instances

others

Adobe

IBM Php

Apple

Micro

soft

Mozilla

Cisco Su

nLinux

0

10000

20000

30000

40000

50000

60000

Instances Table

Instances

Page 7: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

7

Number of CVEs

rest HP

Linux

Mozila Cis

coOracle

IBM Apple Su

n

Micro

soft

0

500

1000

1500

2000

2500

Vulnerability Table

Vul_Num

Page 8: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

8

Data Preprocessing• NVD data—Training/Testing dataset

– Starting from 2005 since before that the data looks unstable.– Remove some obvious errors in NVD (e.g.

“cpe:/o:linux:linux_kernel:390”).

• Attributes– Published time : Month and day/epoch time. – Version: discretization/binning.– Versiondiff: A normalized difference between two versions.

• Radix-based versiondiff.• Counter (Rank) - based versiondiff.

– Vendor: Removed (For each vendor we only built one model).

Page 9: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

9

Predictive & Predicted Attributes

• Predictive feature– Time – Versiondiff– TTPV (Time to previous Vulnerability)– CVSS (Common vulnerability scoring system)

• Predicted feature (intermediate result)– TTNV (Time to next vulnerability)

• We believe this feature could quantify the risk level of software.

• Final result – Quantitative risk level indicator

Page 10: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

10

Fitness Indicator - Correlation Coefficient [13]

Page 11: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

11

Training/Testing dataset

• We used ratio of training : testing = 2 : 1 for our experiments

• All training data is earlier than testing data.

Page 12: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

12

Correlation Coefficient for Linux Vulnerabilities Using TwoFormats of Time

Page 13: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

13

Counter (Rank) Based Versiondiff

• We rank all versions regardless of their values– If one only have three versions: 5.0, 2.2 and 2.1, then

their values will be replaced by 3, 2 and 1.– i.e. versiondiff (5.0, 2.2) = versiondiff (2.2, 2.1), versiondiff (5.0, 2.1) = 2*versiondiff (2.2, 2.1).

• Characteristic:– This schema neglects the quantitative differences

between versions. The radix is a “dynamic” number depending on how many version possibilities it has.

Page 14: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

14

Fixed Radix (100) Versiondiff

• The radix for each sub version is a fixed value – 100.– Versiondiff(2.1 , 3.1) = 100– Versiondiff(3.3 , 3.1) = 2

• Underlying principle : – Difference between major versions suggests a

higher degree of dissimilarity than difference between (relative) minor versions.[14]

Page 15: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

15

Correlation Coefficient for Linux Vulnerabilities Using TwoFormats of Versiondiff

Page 16: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

16

CVSS Metrics

• Access vector {ADJACENT_NETWORK, NETWORK, LOCAL}

• Confidentiality {COMPLETE, PARTIAL, NONE}

• Integrity {COMPLETE, PARTIAL, NONE}

• Availability {COMPLETE, PARTIAL, NONE}

Page 17: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

17

Correlation Coefficient for Adobe Vulnerabilities UsingCVSS Metrics or Not

Page 18: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

18

Software(Linux Kernel) Version Discretization/Binning

• Rationale: Group values with high similarity.

• How?– Rounding all the sub versions to its third

significant major version.– E.g. Bin (2.6.3.1.2) = 2.6.3

Page 19: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

19

Software Version (Linux Kernel)Discretization/Binning (Cont)

• Why & Why not?– Why 3? More than half instances (31834/56925) have a version

longer than 3.

– Why not 4? Only 1% (665/56925) instances’ versions longer than 4.

– Why not 2? Difference on the third subversion will be regarded as a huge dissimilarity for Linux kernel. [1]

– Why not Microsoft? Versions of Microsoft products are naturally discrete. (all of them have numeric versions less than 20)

Page 20: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

20

Correlation Coefficient for Linux Vulnerabilities Using Binned Versions or Not

Page 21: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

21

An Essential Problem of Versiondiff

• Most of the new vulnerabilities affecting current version will affect previous versions as well.–Microsoft Bulletin.– Adobe Bulletin.– Therefore, most versiondiff are zero (or unknown).• Microsoft : 85.2% (14229/16699)• Linux: 61.5% (39448/64052)• Mozilla: 53.4% (12057/22566) …

Page 22: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

22

A Possible Alternative Attribute

• Occurrences number of each version of each software.– This could somehow illustrate the trend of each

version (Since the number of occurrence will keep increasing and most of the instances will have a meaningful value (instead of zero))

– This attribute is just follow our intuition but we couldn’t find any rationale behind it.

Page 23: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

23

Microsoft

• Windows– Instances without version information. Instead of using

the aforementioned attribute, we use occurrence number of given software ( windows).

• Non-windows applications– Instances including version information. We used the

aforementioned attribute as one of the predictive feature.

Page 24: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

24

Windows and non-windows instances

Page 25: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

25

Different Applications Have Quite Different Trends

• Firefox– It has promising results (correlation coefficient is

close to 0.7 for both training and test data)when we tried building models on it.

– Adding CVSS or not will not affect the results.

• Internet Explorer– It has similar results when adding CVSS.– But its results will be extremely bad without CVSS.

Page 26: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

26

Correlation Coefficient for IE Vulnerabilities Using CVSSMetrics or Not

Page 27: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

27

Correctly Classified Rate for Firefox Vulnerabilities UsingCVSS Metrics or Not

Page 28: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

28

Google(Chrome)

• It is becoming more and more vulnerable vendor (in terms of numbers of instances).

• It has more than 10,000 instances.

• However, more than half of them appeared within two months (Apr-May 2010).

Page 29: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

29

Conclusion• Conclusion: Vendor-based Models couldn’t be built now

because of the limitation of NVD data. However, group similar application-based models is another possibility.

• Why? – Trend of TTNV is not stable (have been shown in previous test).– Some errors could dramatically affect the results.– Inconsistent definitions. (Caused by different maintainers)[12]. – Version information couldn’t be used effectively.

Page 30: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

30

Future Work

• Number of zero-day vulnerabilities of each software– This may need life-cycle information.

• CVSS Score– Indicates the risk levels for different

vulnerabilities.

Page 31: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

31

Questions & Discussions

Thank you!

Page 32: Predicting zero-day software vulnerabilities through data-mining --Third Presentation

References• [1]Andrew Buttner et al, ”Common Platform Enumeration (CPE) – Specification,” 2008.• [2]NVD, http://nvd.nist.gov/home.cfm.• [3]O. H. Alhazmi et al, “Modeling the Vulnerability Discovery Process,” 2005.• [4]Omar H. Alhazmi et al, “Prediction Capabilities of Vulnerability Discovery Models,” 2006.• [5]Andy Ozment, “Improving Vulnerability Discovery Models,” 2007.• [6]R. Gopalakrishna and E. H. Spafford, “A trend analysis of vulnerabilities,” 2005.• [7]Christopher M. Bishop, “Pattern Recognition andMachine Learning,” 2006.• [8]Xinming Ou et al, “MulVAL: A logic-based network security analyzer,” 2005.• [9] Kyle Ingols et al, “Modeling Modern Network Attacks and Countermeasures Using Attack

Graphs” 2009. • [10] Miles A. McQueen et al, “Empirical Estimates and Observations of 0Day Vulnerabilities,”

2009.• [11] Alex J. Smola et al, “A Tutorial on Support Vector Regression,” 1998. • [12] Vulnerability Discovery & Software Security Andy Ozment. Ph.D Dissertation.• [13] Correlation Coefficient, http://mathworld.wolfram.com/CorrelationCoefficient.html.• [14] Microsoft Software Versioning, http://msdn.microsoft.com/en-us/library/system.version.aspx.

32