a whirlwind tour of academic techniques for real-world security researchers
DESCRIPTION
TRANSCRIPT
A WHIRLWIND TOUR
OF ACADEMIC TECHNIQUES
FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare, Deakin University
Introduction
Started off in industry (Qualys, now Volvent).
Have a Masters by Research.
About to receive a PhD from Deakin University.
Last 5 years in post-graduate University research.
Learnt some cool things along the way.
What did I do at University?
Malwise v1 (Masters)
Malware variant detection system.
Malwise v2
Improved version.
Simseer Search
More improved malware variant search service.
Simseer Cluster
Binary clustering service.
Simseer
Binary comparison and visualization service.
Clonewise
Automated detection of embedded libraries in source.
Bugalyze
Detection of bugs using data flow analysis.
Outline
Mathematical Objects
Comparing
Similarity Searching
Classification
Clustering
Program Analysis
An incomplete list of mathematical
objects
Strings
Vectors
Sets
Sets of Objects
Trees
Graphs
Objects
Objects have different performance.
Example
Comparing two vectors is fairly fast.
Exact matching two strings is fairly fast.
Inexact matching two strings is medium slow/fast.
Comparing two graphs is slow.
A K T KT K
| | | | | sequence alignment O(mn)
A TK TT T K
Transforming one object to another
Problem
Comparing two 100kb strings using the edit distance is impractically slow.
Solution
Transform the strings into vectors.
Then, use a vector comparison – which is fast.
Examples
Comparing malware samples
Finding near duplicate web pages
Comparing E-Mails
ed(“hello”, “ggello”) = 2
N-Grams
Extract all N-length substrings (N-Grams) from
original string.
From training set of strings, choose best N-Grams.
Each unique N-Gram is an index in a vector.
The value of the element is the number of times it
occurs.
W|IEH}R
W|IE
|IEH
IEH}
EH}R
Another N-Gram example
Extract N-Grams
Represent new object as a ‘Set of N-Grams’
Compare sets using set similarity metrics
A Graph problem
Graph problems like approximate similarity are slow to solve.
Decompose graph into subgraphs of at most k-nodes.
Canonicalize small graphs, represent by adjacency matrix, transform to string.
Graph is now a ‘Set of Strings’.
Optionally represent as vector of ‘important k-subgraphs’.
Use Vector distance metrics to compare, index, and search.
K-subgraph decomposition
L_0
L_3
L_6
L_7L_1
L_2 L_4
L_5
L_0
L_3
L_6
L_7L_1
L_2 L_4
L_3
L_6
L_7L_1
L_2 L_4
L_5
L_0
L_3
L_6
L_7L_1
L_2 L_4
L_5
true
true
true
true
true
L_0
L_3
L_6
L_1
L_2 L_4
L_5
0101000
0000000
0000010
0010100
0000010
0000001
1001000
0001010
0000000
1000000
0000100
0010000
0101000
1000000
0000001
0000100
0000001
0010000
0001010
0010000
0100100
Graphs – Case Study
Implemented in Malwise and Simseer
Take control flow graphs of programs.
Decompile into strings.
One:
Consider program as a vector of N-Grams of
decompiled strings.
Two:
Consider program as a set of strings.
L_0
L_3
L_6
L_7L_1
L_2 L_4
L_5
true
true
true
true
true
proc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}
Final Remarks on Objects
Know how to represent your problem.
Look into how the representation can be
approximated
By transforming it into another object
Vectors are often a good choice.
Comparing
Problem
Measure the similarity (or distance between) two
objects.
Solution
Represent objects mathematically.
Use multitude of mathematical measures.
Examples
Malware similarity
Near duplicate web pages
Comparing Sets
A set is a collection of elements.
Given an equality function between elements, we
can measure set similarity.
Inexact matching
Jaccard index
Dice coefficient
BA
BABAJ
),(
BA
BAs
2
Comparing Vectors – Ugh, math.
Euclidean Distance
Manhattan Distance
Cosine Similarity
n
i
pqii
qpd1
2
)(),(
BA
BAsimilarity
)cos(
n
iii
pqqpd1
),(
Vector distance – a different look
A vector is an n-dimensional point in space.
E.g., a 2-d vector is <x,y>
Cosine similarity
Line from origin to n-dimensional point.
Given 2 lines, what’s the angle (theta) between
them?
The smaller the angle, the more similar.
Point A
Point B
Theta
Comparing Vectors – Case Study
Malwise v2
Feature vector of N-Grams of decompiled flowgraphs
Manhattan Distance
Simseer Search
Same feature vector
Euclidean Distance
Comparing Sets – Case Study
Malwise v1
An element is a graph invariant of the control flow
graph, represented as an integer.
A program is a set of integers.
Compare similarity between two programs using
Dice coefficient.
Malwise v1 - Comparing Sets
42
3
T F
TT
1
(1 -> 2), (1 -> 4)
(2 -> 3), ()
(), ()
(4 -> 3), ()
i
ii
i
ii
ii
i
i
BxwAxw
BAxw
BAs
2
),(
Comparing Sets of Strings in Malwise
v2 – Case Study
String is a decompiled flowgraph.
Program is a set of strings.
Edit distance between strings.
Construct 1:1 mapping between elements of sets:
Such that the sum of distances is minimized.
Solved using ‘combinatorial optimisation’
Assignment Problem
Solution by “graph matching”
Malwise v2 - Comparing Sets of
Strings
qd=ed(p,q)
p
BW|{B}BR
BI{B}BR
BSSR
BSR
BSSSR
BR
BW|{B}BR
BSSR
BR
L_0
L_3
L_6
L_7L_1
L_2 L_4
L_5
true
true
true
true
true
W|IEH}Rproc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}
Final Remarks on Comparing
Inexact matching is your friend.
Try to use known distance metrics.
They have useful properties and index better.
If it’s too slow to compare, transform the object.
Similarity Searching
Problem
Find all ‘similar’ objects to my query in a database
Example
Find all words in a dictionary with at most 3 differences
to my query word.
This problem is known as a ‘similarity search’
Solution
Naive exhaustive search.
Better to use ‘Metric Trees’
Similarity Search Constraints
Variations
K-nearest neighbours – the k closests objects to the query.
All objects within a specific distance to the query.
Search based on using a ‘metric distance’.
Metric distances satisfy mathematical properties.
Examples
Euclidean Distance
Jaccard Distance
Cosine Distance is not metric
Searching – Case Study
Malwise v2
Distance metric is Manhattan Distance.
Use VP-Trees to index and search in stage 1.
Use DBM-Trees to index and search in stage 2.
Implemented using open source GBDI Arboretum
library.
q
Query Malicious
Query Benign
d(p,q)
p
r
Malware
Query
Final Remarks on Searching
Searching for inexact matches is useful.
Use good distance metrics.
Use open source libraries.
Classification
The problem:
Given a set of N classes.
And a query object.
Assign one of the classes to the object.
Examples
Is this binary (malicious, not malicious)?
Is this gmail email (primary, social, promotional)?
Is this web page (defaced, not defaced)?
Class B
Class A
Classification Methodology
Supervised Learning
Given a training set of objects labelled by their class.
Build a model.
Then use the model to classify unknown objects.
Unsupervised Learning
No labelled data exists.
“Cluster” objects into classes.
Use clusters to train model.
Then classify as per-normal.
Classification – What do I have to do?
Represent objects using “feature vectors”
A vector is an array.
Each element represents a “feature”.
The value of the element tends to be a count of something, or a size.
Feature examples
The number of times a dictionary word such as “Hello” appears in an E-Mail.
The size of a binary.
The number of times LoadLibraryA is executed.
Classification – WEKA?
Put the feature vectors into the text-based ARFF file
format.
Plug into the WEKA machine learning toolkit.
Experiment with different classifiers.
Part of your labelled data can be used to evaluate
the accuracy.
Weka ARFF file
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,?
WEKA
10/25/2013 University of Waikato 34
Classification – Case Study
Clonewise
Feature vector is set of features extracted from a pair
of packages.
Classify - do these packages share code (yes, no)?
Classify – is the 1st package embedded in the 2nd
package (yes, no)?
Final Remarks on Classification
Lots of problems can be considered as this.
Learn how to use WEKA.
Vectors are very good representations.
Clustering
Problem
To group together “similar” objects under some notion
of similarity.
Easy solution
Represent objects using “feature vectors”.
Plug into WEKA.
Packages in Fedora Linux
Clustering - Case Study
Simseer Cluster
Represent binaries using N-Grams of decompiled
flowgraphs.
Use most frequent N-Grams as features.
Distance measure is cosine distance.
Final Remarks on Clustering
A classic machine learning problem.
Again, learn to use WEKA.
Program Analysis
An incredibly large and deep field.
This section skims the surface.
Main approaches
Theorem Proving
Model Checking
Abstract Interpretation
Data Flow Analysis
Model Checking
Looks at program states generated by a program.
Some states indicate bugs.
Try BLAST, a model checker for small C programs.
Caveat - it’s pretty old now.
Theorem Proving - SMT
SMT – what is it?
An equation solver that covers the types of operations seen in machine code.
Approach for Bug Detection
User input can be anything generally, so treat this as a “symbolic” variable.
The rest is concrete.
Simulate execution of the program, plugging all the machine code that is executed into the solver formuli.
Concolic execution
Combining symbolic execution with concrete execution.
Concolic Execution
At branches, can we have user input that forces us
to go down each path?
Use the SMT solver to tell us.
Launch execution down ‘feasible’ paths.
Use the solver to tell us if bugs are present.
What user input, if any, can make this pointer NULL?
Concolic path-sensitive analysis
movl $0x4020a0,(%esp)
call 4011
b 8 <_puts>
addl $0x1,-0x8(%ebp)
lea 0x4(%esp),%ecx
and $0 xfffffff0,%esp
pushl -0x4(%ecx)
push %ebp
mov %esp,%ebp
push %ecx
sub $0x24,%esp
call 4011b 0 <___main>
movl $0x0,-0x8(%ebp)
jmp 40115f <_main+0x2f>
add $0x24,%esp
pop %ecx
pop %ebp
lea -0x4(%ecx),%esp
ret
cmpl $0x9,-0x8(%ebp)
jle 40114f <_main+0x1f>
2
23
1
4
Abstract Interpretation
Abstract the execution of the program.
Example
Only consider the sign of a variable, not the actual
value.
Requires a transfer function
What an instruction does to the abstract data.
And a Join/Meet function
How data is combined when it meets from different
control flow.
Data Flow Analysis
Similar to abstract interpretation.
Uses a transfer function, a join.
Implement both using a monotone framework.
Data Flow analysis is used by compilers.
Classic data flow problems
The reach of defining or assigning to a variable.
Knowing if a variable will be read again before being
assigned a new value.
Data Flow Analysis – Case Study
Implemented in Bugalyze.
Example bug detection
In free(ptr), where is ptr used before it is reassigned,
and is it used in a free?
Has found real bugs in Debian Linux.
Still a work-in-progress.
Bugalyze – Case Study
Final Remarks on Program Analysis
A wide and deep field.
Good to know the basic approaches.
Reversing is becoming more rigourous (think
HexRays).
Conclusion
Academia has some useful techniques.
It’s good to know some of the basic methods.
Will improve industrial programs.
Any questions?