a whirlwind tour of academic techniques for real-world security researchers

50
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS Silvio Cesare, Deakin University

Upload: silvio-cesare

Post on 22-Apr-2015

1.002 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

A WHIRLWIND TOUR

OF ACADEMIC TECHNIQUES

FOR REAL-WORLD SECURITY RESEARCHERS

Silvio Cesare, Deakin University

Page 2: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Introduction

Started off in industry (Qualys, now Volvent).

Have a Masters by Research.

About to receive a PhD from Deakin University.

Last 5 years in post-graduate University research.

Learnt some cool things along the way.

Page 3: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

What did I do at University?

Malwise v1 (Masters)

Malware variant detection system.

Malwise v2

Improved version.

Simseer Search

More improved malware variant search service.

Simseer Cluster

Binary clustering service.

Simseer

Binary comparison and visualization service.

Clonewise

Automated detection of embedded libraries in source.

Bugalyze

Detection of bugs using data flow analysis.

Page 4: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Outline

Mathematical Objects

Comparing

Similarity Searching

Classification

Clustering

Program Analysis

Page 5: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

An incomplete list of mathematical

objects

Strings

Vectors

Sets

Sets of Objects

Trees

Graphs

Page 6: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Objects

Objects have different performance.

Example

Comparing two vectors is fairly fast.

Exact matching two strings is fairly fast.

Inexact matching two strings is medium slow/fast.

Comparing two graphs is slow.

A K T KT K

| | | | | sequence alignment O(mn)

A TK TT T K

Page 7: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Transforming one object to another

Problem

Comparing two 100kb strings using the edit distance is impractically slow.

Solution

Transform the strings into vectors.

Then, use a vector comparison – which is fast.

Examples

Comparing malware samples

Finding near duplicate web pages

Comparing E-Mails

ed(“hello”, “ggello”) = 2

Page 8: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

N-Grams

Extract all N-length substrings (N-Grams) from

original string.

From training set of strings, choose best N-Grams.

Each unique N-Gram is an index in a vector.

The value of the element is the number of times it

occurs.

W|IEH}R

W|IE

|IEH

IEH}

EH}R

Page 9: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Another N-Gram example

Extract N-Grams

Represent new object as a ‘Set of N-Grams’

Compare sets using set similarity metrics

Page 10: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

A Graph problem

Graph problems like approximate similarity are slow to solve.

Decompose graph into subgraphs of at most k-nodes.

Canonicalize small graphs, represent by adjacency matrix, transform to string.

Graph is now a ‘Set of Strings’.

Optionally represent as vector of ‘important k-subgraphs’.

Use Vector distance metrics to compare, index, and search.

Page 11: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

K-subgraph decomposition

L_0

L_3

L_6

L_7L_1

L_2 L_4

L_5

L_0

L_3

L_6

L_7L_1

L_2 L_4

L_3

L_6

L_7L_1

L_2 L_4

L_5

L_0

L_3

L_6

L_7L_1

L_2 L_4

L_5

true

true

true

true

true

L_0

L_3

L_6

L_1

L_2 L_4

L_5

0101000

0000000

0000010

0010100

0000010

0000001

1001000

0001010

0000000

1000000

0000100

0010000

0101000

1000000

0000001

0000100

0000001

0010000

0001010

0010000

0100100

Page 12: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Graphs – Case Study

Implemented in Malwise and Simseer

Take control flow graphs of programs.

Decompile into strings.

One:

Consider program as a vector of N-Grams of

decompiled strings.

Two:

Consider program as a set of strings.

L_0

L_3

L_6

L_7L_1

L_2 L_4

L_5

true

true

true

true

true

proc(){

L_0:

while (v1 || v2) {

L_1:

if (v3) {

L_2:

} else {

L_4:

}

L_5:

}

L_7:

return;

}

Page 13: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Final Remarks on Objects

Know how to represent your problem.

Look into how the representation can be

approximated

By transforming it into another object

Vectors are often a good choice.

Page 14: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Comparing

Problem

Measure the similarity (or distance between) two

objects.

Solution

Represent objects mathematically.

Use multitude of mathematical measures.

Examples

Malware similarity

Near duplicate web pages

Page 15: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Comparing Sets

A set is a collection of elements.

Given an equality function between elements, we

can measure set similarity.

Inexact matching

Jaccard index

Dice coefficient

BA

BABAJ

),(

BA

BAs

2

Page 16: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Comparing Vectors – Ugh, math.

Euclidean Distance

Manhattan Distance

Cosine Similarity

n

i

pqii

qpd1

2

)(),(

BA

BAsimilarity

)cos(

n

iii

pqqpd1

),(

Page 17: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Vector distance – a different look

A vector is an n-dimensional point in space.

E.g., a 2-d vector is <x,y>

Page 18: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Cosine similarity

Line from origin to n-dimensional point.

Given 2 lines, what’s the angle (theta) between

them?

The smaller the angle, the more similar.

Point A

Point B

Theta

Page 19: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Comparing Vectors – Case Study

Malwise v2

Feature vector of N-Grams of decompiled flowgraphs

Manhattan Distance

Simseer Search

Same feature vector

Euclidean Distance

Page 20: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Comparing Sets – Case Study

Malwise v1

An element is a graph invariant of the control flow

graph, represented as an integer.

A program is a set of integers.

Compare similarity between two programs using

Dice coefficient.

Page 21: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Malwise v1 - Comparing Sets

42

3

T F

TT

1

(1 -> 2), (1 -> 4)

(2 -> 3), ()

(), ()

(4 -> 3), ()

i

ii

i

ii

ii

i

i

BxwAxw

BAxw

BAs

2

),(

Page 22: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Comparing Sets of Strings in Malwise

v2 – Case Study

String is a decompiled flowgraph.

Program is a set of strings.

Edit distance between strings.

Construct 1:1 mapping between elements of sets:

Such that the sum of distances is minimized.

Solved using ‘combinatorial optimisation’

Assignment Problem

Solution by “graph matching”

Page 23: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Malwise v2 - Comparing Sets of

Strings

qd=ed(p,q)

p

BW|{B}BR

BI{B}BR

BSSR

BSR

BSSSR

BR

BW|{B}BR

BSSR

BR

L_0

L_3

L_6

L_7L_1

L_2 L_4

L_5

true

true

true

true

true

W|IEH}Rproc(){

L_0:

while (v1 || v2) {

L_1:

if (v3) {

L_2:

} else {

L_4:

}

L_5:

}

L_7:

return;

}

Page 24: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Final Remarks on Comparing

Inexact matching is your friend.

Try to use known distance metrics.

They have useful properties and index better.

If it’s too slow to compare, transform the object.

Page 25: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Similarity Searching

Problem

Find all ‘similar’ objects to my query in a database

Example

Find all words in a dictionary with at most 3 differences

to my query word.

This problem is known as a ‘similarity search’

Solution

Naive exhaustive search.

Better to use ‘Metric Trees’

Page 26: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Similarity Search Constraints

Variations

K-nearest neighbours – the k closests objects to the query.

All objects within a specific distance to the query.

Search based on using a ‘metric distance’.

Metric distances satisfy mathematical properties.

Examples

Euclidean Distance

Jaccard Distance

Cosine Distance is not metric

Page 27: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Searching – Case Study

Malwise v2

Distance metric is Manhattan Distance.

Use VP-Trees to index and search in stage 1.

Use DBM-Trees to index and search in stage 2.

Implemented using open source GBDI Arboretum

library.

q

Query Malicious

Query Benign

d(p,q)

p

r

Malware

Query

Page 28: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Final Remarks on Searching

Searching for inexact matches is useful.

Use good distance metrics.

Use open source libraries.

Page 29: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Classification

The problem:

Given a set of N classes.

And a query object.

Assign one of the classes to the object.

Examples

Is this binary (malicious, not malicious)?

Is this gmail email (primary, social, promotional)?

Is this web page (defaced, not defaced)?

Class B

Class A

Page 30: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Classification Methodology

Supervised Learning

Given a training set of objects labelled by their class.

Build a model.

Then use the model to classify unknown objects.

Unsupervised Learning

No labelled data exists.

“Cluster” objects into classes.

Use clusters to train model.

Then classify as per-normal.

Page 31: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Classification – What do I have to do?

Represent objects using “feature vectors”

A vector is an array.

Each element represents a “feature”.

The value of the element tends to be a count of something, or a size.

Feature examples

The number of times a dictionary word such as “Hello” appears in an E-Mail.

The size of a binary.

The number of times LoadLibraryA is executed.

Page 32: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Classification – WEKA?

Put the feature vectors into the text-based ARFF file

format.

Plug into the WEKA machine learning toolkit.

Experiment with different classifiers.

Part of your labelled data can be used to evaluate

the accuracy.

Page 33: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Weka ARFF file

@RELATION iris

@ATTRIBUTE sepallength NUMERIC

@ATTRIBUTE sepalwidth NUMERIC

@ATTRIBUTE petallength NUMERIC

@ATTRIBUTE petalwidth NUMERIC

@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa

4.6,3.1,1.5,0.2,Iris-setosa

5.0,3.6,1.4,0.2,?

Page 34: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

WEKA

10/25/2013 University of Waikato 34

Page 35: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Classification – Case Study

Clonewise

Feature vector is set of features extracted from a pair

of packages.

Classify - do these packages share code (yes, no)?

Classify – is the 1st package embedded in the 2nd

package (yes, no)?

Page 36: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Final Remarks on Classification

Lots of problems can be considered as this.

Learn how to use WEKA.

Vectors are very good representations.

Page 37: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Clustering

Problem

To group together “similar” objects under some notion

of similarity.

Easy solution

Represent objects using “feature vectors”.

Plug into WEKA.

Packages in Fedora Linux

Page 38: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Clustering - Case Study

Simseer Cluster

Represent binaries using N-Grams of decompiled

flowgraphs.

Use most frequent N-Grams as features.

Distance measure is cosine distance.

Page 39: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Final Remarks on Clustering

A classic machine learning problem.

Again, learn to use WEKA.

Page 40: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Program Analysis

An incredibly large and deep field.

This section skims the surface.

Main approaches

Theorem Proving

Model Checking

Abstract Interpretation

Data Flow Analysis

Page 41: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Model Checking

Looks at program states generated by a program.

Some states indicate bugs.

Try BLAST, a model checker for small C programs.

Caveat - it’s pretty old now.

Page 42: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Theorem Proving - SMT

SMT – what is it?

An equation solver that covers the types of operations seen in machine code.

Approach for Bug Detection

User input can be anything generally, so treat this as a “symbolic” variable.

The rest is concrete.

Simulate execution of the program, plugging all the machine code that is executed into the solver formuli.

Concolic execution

Combining symbolic execution with concrete execution.

Page 43: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Concolic Execution

At branches, can we have user input that forces us

to go down each path?

Use the SMT solver to tell us.

Launch execution down ‘feasible’ paths.

Use the solver to tell us if bugs are present.

What user input, if any, can make this pointer NULL?

Page 44: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Concolic path-sensitive analysis

movl $0x4020a0,(%esp)

call 4011

b 8 <_puts>

addl $0x1,-0x8(%ebp)

lea 0x4(%esp),%ecx

and $0 xfffffff0,%esp

pushl -0x4(%ecx)

push %ebp

mov %esp,%ebp

push %ecx

sub $0x24,%esp

call 4011b 0 <___main>

movl $0x0,-0x8(%ebp)

jmp 40115f <_main+0x2f>

add $0x24,%esp

pop %ecx

pop %ebp

lea -0x4(%ecx),%esp

ret

cmpl $0x9,-0x8(%ebp)

jle 40114f <_main+0x1f>

2

23

1

4

Page 45: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Abstract Interpretation

Abstract the execution of the program.

Example

Only consider the sign of a variable, not the actual

value.

Requires a transfer function

What an instruction does to the abstract data.

And a Join/Meet function

How data is combined when it meets from different

control flow.

Page 46: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Data Flow Analysis

Similar to abstract interpretation.

Uses a transfer function, a join.

Implement both using a monotone framework.

Data Flow analysis is used by compilers.

Classic data flow problems

The reach of defining or assigning to a variable.

Knowing if a variable will be read again before being

assigned a new value.

Page 47: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Data Flow Analysis – Case Study

Implemented in Bugalyze.

Example bug detection

In free(ptr), where is ptr used before it is reassigned,

and is it used in a free?

Has found real bugs in Debian Linux.

Still a work-in-progress.

Page 48: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Bugalyze – Case Study

Page 49: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Final Remarks on Program Analysis

A wide and deep field.

Good to know the basic approaches.

Reversing is becoming more rigourous (think

HexRays).

Page 50: A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Conclusion

Academia has some useful techniques.

It’s good to know some of the basic methods.

Will improve industrial programs.

Any questions?