automated classification and analysis of internet malware m. bailey j. oberheide j. andersen z. m....

Automated Classification and Analysis of Internet Malware

M. BaileyJ. OberheideJ. AndersenZ. M. MaoF. JahanianJ. Nazario

RAID 2007

Presented by Mike Hsiao 20081107

University of Michigan,Arbor Network

2

Outline

Introduction Anti-Virus Clustering of Malware Behavior-Based Malware Clustering Evaluation Related Work Conclusion

3

Introduction

Current different anti-virus products characterize malware in ways that are inconsistent across anti-virus products, incomplete across malware, and fall to be concise in there semantics.

The authors propose a new classification technique that describes malware behavior in terms of system state changes.

Automated Classification and Analysis of Internet Malware

4

Introduction (cont’d)

Spam, phishing, denial of service attacks, botnets, and worms largely depend on some form of malicious code, commonly referred to as malware.– Exploiting software vulnerability– Tricking users into running malicious code

5

Introduction (challenges)

Agobot (name of a malware) has been observed to have more than 580 variants.

– Agobot variants have the ability to perform DoS attacks, steal bank passwords and account, propagate over the network using a diverse set of remote exploits, use polymorphism and disassembly, and even patch vulnerabilities and remove competing malware.

A recent Microsoft survey found more than 43,000 new variants of backdoor trojans and bots during the first half of 2006.

multi-vector

6

Introduction

The authors developed a dynamic analysis approach, based on the execution of malware and the casual tracing of the OS objects created due to malware execution.

The reduced collection of these user-visible system state changes is used to create a fingerprint of the malware’s behavior.

– These fingerprints are more invariant and useful than abstract code sequence (representing program behaviors)

7

Introduction

These can be directly used in assessing the potential damage incurred, enabling detection and classification of new threats, and assisting in the risk assessment of these threats in mitigation and clean up.

The author provide a method for automatically categorizing these malware profiles into groups that reflect similar classes of behaviors.

8

Outline


9

Understanding Anti-Virus Malware Labeling

In order to accurately characterize the ability of AV to provide meaningful labels for malware, …

Note: AV systems rarely use the exact same labels for a threat, and users of these systems have come to expect simple naming differences across vendors.

e.g, WORM_MSBLAST.A

10

A pool of malware classified by AVs as SDBot families

The classification of SDBot is ambiguous.

11

Properties of a Labeling System

Consistency– Identical items must and similar items should be assigned t

he same label.

Completeness– A label should be generated for as many items as possible.

Conciseness– The labels should be sufficient in number to reflect the uniq

ue properties of interest, while avoiding superfluous labels.

12

Outline


13

Defining and Generating Malware Behaviors

Individual system calls may be at a level that is too low for abstracting semantically meaningful information

– a higher abstraction level is needed to effectively describe the behavior of malware.

The authors define the behavior of malware in terms of non-transient state changes that the malware causes on the system.

– spawned process, modified registry keys, modified files, network connection attempts.

14

Clustering of Malware

Ten unique malware samples - P: number of process - F: file - R: registry - N: network

Our approach to generating meaningful labels is achieved through clustering of the behavioral fingerprints.

15

Comparing Individual Malware Behaviors - NCD

Intuitively, Normalized Compression Distance (NCD) represents the overlap in information between two samples.

C(x) is the zlib-compressed length of x.

16

Constructing Relationships Between Malware

dist

ance

17

Extracting Meaningful Groupsdi

stan

ce

c1c2

c3c4

Clusters are constructed from the tree by first calculating the inconsistency coefficient of each cluster, and then thresholding based on the coefficient.

18

Outline


19

Comparing AV Groupings and Behavioral Clustering

The propose method created 403 cluster from 3,698 individual malware.

– http://www.eecs.umich.edu/~mibailey/malware/

The authors expect that a behavior-based approach would separate out these more general classes if their behavior differs, and aggregate across the more specific classes if behaviors are shared.

20

Comparing AV Groupings and Behavioral Clustering (example)

Symantec, who adopts a more general approach, has two binaries identified as “back-door.sdbot”.

They were divided into separate clusters in our evaluation based on

– differing processes created, – differing back-door ports, – differing methods of process invocation or reboot, – and the presence of AV avoidance in one of the samples.

21

Comparing AV Groupings and Behavioral Clustering (example)

FProt, which has a high propensity to label each malware sample individually,

– had 47 samples that were identified as belonging to the sdbot family.

FProt provided 46 unique labels for these samples, nearly one unique label per sample.

In our clustering, these 46 unique labels were collapsed into 15 unique clusters reflecting their overlap in behaviors.

22

Measuring the Completeness, Conciseness and Consistency

No such behavior - P: number of process - F: file - R: registry - N: network

In the large sample, roughly 2,200 binaries shared exactly identical behavior with another sample. When grouped, these 2,200 binaries created 267 groups in which each sample in the group had exactly the same behavior.

23

Application of Clustering and Behavior Signatures (1/2)

Classifying Emerging Threats– For example, cluster c156 consists of three malware sampl

es that exhibit malicious bot-related behavior, including IRC command and control activities.

– Each of the 75 behaviors observed in the cluster is shared with other samples of the group—96.92% on average, meaning the malware samples within the cluster have almost identical behavior.

– It is clear that our behavioral classification would assist in identifying these samples as emerging threats through their extensive malicious behavioral profile.

24

Application of Clustering and Behavior Signatures (2/2)

Resisting Binary Polymorphism

Examining the Malware Behaviors

25

Outline


26

Related Work

Content-based signatures– insufficient to cope with emerging threats due to intentional

evasion

Lower-layer behavioral profiles– individual system calls, instruction-based code templates, s

hellcode, network connection and session behavior– do not provide semantic value in explaining behaviors exhib

ited by a malware variant or family

Ellis– similar data being sent from one to the next

27

Outline


28

Conclusion

They showed that AV systems are incomplete in that they fail to detect or provide labels.– not consistent– vary widely in their conciseness

Create a behavioral fingerprint of the malware’s activity– the state changes that are a causal result of the in

fection.

29

Comments

Host-based observation– v.s. network observation

Classify collected malware– v.s. detect malicious behavior

Closely to understand what are happening while these malware are executed.– v.s. the revealed behaviors that reflect the abnor

malities of compromised service

automated classification and analysis of internet malware m. bailey j. oberheide j. andersen z. m....

Documents

execution of malware

malware execution

competing malware

malware profiles

aa pool of malware

automated classification

classification of new

new classification technique