a machine learning approach to android malware detection

Justin Sahs and Prof. Latifur Khan

1

A MACHINE LEARNING APPROACH TO ANDROID MALWARE DETECTION

The Problem2

Smartphones represent a significant and growing proportion of computing devices

Android in particular is the fastest growing smartphone platform, and has 52.5% market share*

The power of the Android platform allows for applications providing a variety of services, including sensitive services like banking

This power can also be leveraged by malware

*http://www.gartner.com/it/page.jsp?id=1848514

The Problem (cont.)3

Tremendous growth in Android market from 2,300 applications in March 2009 to 400,000 applications by January 2012 has also attracted a significant growth in malware for Android.

TrendMicro, a global leader in antivirus, has predicted growth of Android Malware by December 2012 to be 129,000 malware.

Anyone and everyone can develop Android applications and host it on the Android market. Online markets do not have a process to check android applications for malware.

Google added a new security feature on Feb 2 this year to its android market to fight malware which will scan every new submission and current apps for anomalous behavior.* This new system does not apply to

alternative markets

…

Various Android Markets*Google Android Bouncer

The Problem (cont.)4

Smartphones are becoming increasingly ubiquitous. A report from Gartner shows that there were over 100

million smart phones sold in the first quarter of 2011, an increase of 85% over the first quarter of 2010*.

Malware often disguise themselves as normal applications

Malware can cause financial loss, theft of private information

Users need robust malware detection software

*http://www.gartner.com/it/page.jsp?id=1689814

One Class SVM

Training

Model

Testing

PredictionTesting

Apps

Training Apps

Static Analysis: Data Mining Approach

5

Malware Application Detection

Feature Extraction6

We use an open source library called Androguard to extract features from applications: Permissions Control Flow Graphs

One for every method in the application

We use these extracted features to train a machine learning-based classifier

Feature set is not homogeneous (bit-vector, string and graph representations)

Feature Extraction: Acquiring Applications

7

APK files are Android package files. Applications that are packaged as APKs can be installed in any compatible android device.

Benign APK files were harvested from the official Android Market using the android-market-api,* in addition to a collection of known malware

We used 2,172 APK files in our analysis.

http://code.google.com/p/android-market-api/

Background: Structure of an android application

8

Contains permissions and other metadata

Contains Application signing information

Contains any auxiliary files. The Android framework does not generate IDs for assets. Accessed through AssetManager api.The compiled program code

Contains auxiliary files (resources) with IDs generated by the Android framework.

Contains compiled xml files and resources.

Classification9

Classification: Permissions10

Built-in permissions Access to hardware and certain parts of the Android API Based on a list of 121 standard built-in permissions, we

construct a 121-bit vector, with a 1 for each requested permission, and a 0 otherwise

Non-standard permissions Mainly access to other applications’ APIs We split the strings into three sections: a prefix (usually

“com” or “org”), a section of organization and product identifiers, and the permission name, ignoring instances of the strings “android” and “permission,” which are ubiquitous

Classification: Permissions (example)11

Represented as a bit vector:

00000100 00000000 00000000 00100000 00000000 00010000 01000000 10000000 00000100 00101000 01110001 00000000 00011000 00000101 00100001 1

And three sets of strings:

[“com”],[“launcher”],[“CONTROL”, “GLOBAL”, “INSTALL”, “READ”, “SEARCH”, “SETTINGS”, “SHORTCUT”]

Built-in:android.permission.WRITE_EXTERNAL_STORAGEandroid.permission.CALL_PHONEandroid.permission.EXPAND_STATUS_BARandroid.permission.GET_TASKSandroid.permission.READ_CONTACTSandroid.permission.SET_WALLPAPERandroid.permission.SET_WALLPAPER_HINTSandroid.permission.VIBRATEandroid.permission.WRITE_SETTINGSandroid.permission.READ_PHONE_STATEandroid.permission.ACCESS_NETWORK_STATEandroid.permission.WRITE_APN_SETTINGSandroid.permission.RECEIVE_SMSandroid.permission.RECEIVE_MMSandroid.permission.RECEIVE_WAP_PUSHandroid.permission.INTERNETandroid.permission.SEND_SMSandroid.permission.READ_SMSandroid.permission.WRITE_SMS

Requested Permissions:

Non-standard:com.android.launcher.permission.INSTALL_SHORTCUTcom.android.launcher.permission.UNINSTALL_SHORTCUTcom.android.launcher.permission.READ_SETTINGScom.android.launcher.permission.WRITE_SETTINGSandroid.permission.GLOBAL_SEARCH_CONTROL

Classification: Control Flow Graphs (CFGs)

12

Constructed from the compiled bytecode of the application

Each method can be represented as a graph Nodes represent contiguous sequences of non-jump

instructions Edges represent jumps (goto, if, loops, etc.)

CFGs encode the behavior of the methods they represent, and are therefore a potential source of discriminating information

The actual bytecode is often obfuscated, either by the compiler for optimization or deliberately to prevent reverse engineering or detection

We perform reduction on the extracted CFGs to counteract obfuscation

Classification: CFG Reduction13

We reduce graphs according to three rules:

1) Contiguous instruction blocks are merged

2) Unconditional jumps are merged with their target

3) Contiguous conditional jumps that share a destination are merged.

(2) (3)

(1)

Classification: Training and Testing14

Once we have extracted our four feature representations, we use them to train a One-Class Support Vector Machine (1C-SVM)

The 1C-SVM is designed to detect test examples that are significantly different than the training data We have far more examples of benign applications

than malware to train on

One Class SVM (1C-SVM)15

A Support Vector Machine (SVM) finds the maximum-margin separating hyperplane between the positive and negative training examples in some feature space i.e. it maximizes the distance between the hyperplane and the

closest examples from each class

The SVM uses comparison functions called kernels to map each extracted feature into a high-dimensional feature space The linear separation of the data in the feature space may

correspond to a very non-linear separation of the original data Each kernel takes two feature representations as input and

outputs a number that measures similarity

Features and Kernels16

String Kernel

Graph Kernel

• The Set Kernel applies some other kernel to each pair of elements from the two input sets, e.g. the String Kernel if the elements are strings

Classification: Training and Testing (cont.)

17

We use a data mining library, scikit-learn (http://scikit-learn.org/), which implements a convenient wrapper around the popular LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

The use of a SVM requires specialized functions called kernels that are used to compare features between applications We implement these kernels ourselves

Kernels18

We have three feature representations (bit-vectors, strings, and graphs)

We have three kernels, each of which takes two feature representations as input, and outputs a measure of similarity

1)A bit-vector kernel that counts the number of equivalent bits Example: let our two bit-vectors be

<0 0 1 1 1 0 1><1 0 1 0 1 0 1>

Then, we have 5 matching bits, so a kernel value of 5

Kernels (cont.)19

2) A kernel over strings that counts the number of common subsequences between two strings, weighted by length

Length is measured by the distance between the first and last elements in both strings

For example, the strings “abc” and “bxc” have as common subsequences “b”, “c”, and “bc”, which have lengths 1, 1 and 2+3=5, respectively.

Kernels (cont.)20

3) A graph kernel that iteratively relabels the nodes of each graph based on that node’s children’s labels

Original labels are based on the instructions present in each node

The generated labels are counted to generate a vector

The kernel returns the dot product of these two vectors

Example:

22

11 44

22

11 44

44Graph A Graph B

Original Graphs

Kernels (cont.)21

22

11 44

22

11 44

44Graph A Graph B

Original Graphs

214214

1414 44

214214

1414 44

44Graph A Graph B

Iteration 1

214144214144

144144 44

214144214144

144144 44

44Graph A Graph B

Iteration 2

21414414442141441444

14441444 44

21414414442141441444

14441444 44

44Graph A Graph B

Iteration 3

Kernels (cont.)22

Labels and count vectors:

The dot product of these two vectors is 1*1 + 1*1 + 4*8 + 1*1 + 1*1 + 1*1 + 1*1 + 1*1 + 1*1= 40

1 2 4 14 144

214

1444

214144

2141441444

Graph A 1 1 4 1 1 1 1 1 1

Graph B 1 1 8 1 1 1 1 1 1

Kernels (cont.)23

Additionally, we have a kernel over sets, which applies some other kernel, k0, over the elements of each set. It applies the element kernel to every pair of elements in the two sets, and exponentiates these values, so that the better matches (higher values) are emphasized:

Then we feed the sets of strings from the non-standard permissions feature and the sets of graphs from the CFG feature into this with the string kernel and graph kernel, respectively

Kernels (cont.)24

Each of these kernel values are normalized, then summed to form the final kernel value

One such value is calculated for every pair of training examples, generating a kernel matrix

The kernel matrix is used to train the 1C-SVM

During testing, one value is calculated for each pair of training and testing examples.

Experimental Results25

We tested our system with 2081 benign applications and 91 malicious applications

The system correctly classifies approximately 90% of malware, but only correctly classifies approximately 50% of benign applications

We also tested against each of the individual features alone

Background: Measures of Quality26

We examine several measures of quality: True Positive Rate (aka Recall): the proportion of actual malware that

our model classifies as malware False Negative Rate: the proportion of actual malware that our model

classifies as benign; “miss” rate True Negative Rate: the proportion of actual benign applications that

our model classifies as benign False Positive Rate: the proportion of actual benign applications that our

model classifies as malware; “false alarm” rate Precision: The proportion of malware-classified applications that are

actually malware F1: The harmonic mean of precision and recall; this gives a measure of

quality between precision and recall, closer to the worse of the two F2: Like F1, but with recall weighted twice as much as precision

F½: Like F1, but with precision weighted twice as much as recall

Experimental Results (cont.)27

Experimental Results (cont.)28

Note: The downward trend in precision and F-measures is due to the increasing benign sample size and fixed malware sample size

Conclusions and Future Work29

The high true positive is promising, but the low true negative shows much room for improvement

There are a number of areas ripe for future investigation: Additional features from static analysis or even

dynamic analysis New and better kernels and feature representations Alternative models such as the Semi-Supervised SVM,

Kernel PCA or probabilistic models

a machine learning approach to android malware detection

Documents

android applications

growth of android malware

android platform

android framework

android apibased

android package files

official android market

compatible android device