bioinformatics techniques for metamorphic malware analysis and detection: grijesh
DESCRIPTION
ABSTRACT : -------------------- Modern malware that are metamorphic or polymorphic in nature mutate their code by employing code obfuscation and encryption methods to thwart detection. Thus, conventional signature based scanners fail to detect these malware. In order to address the problems of detecting known variants of metamorphic malware, we propose a method using bioinformatics techniques effectively used for Protein and DNA matching. Instead of using exact signature matching methods, more sophisticated signature(s) are extracted using multiple sequence alignment (MSA). The results show that the proposed method is capable of identifying malware variants with minimum false alarms and misses. Also, the detection rate achieved with our proposed method is better compared to commercial antivirus products used in the study. Status: ---------- This work has been accepted by 8th IEEE International Conference on Innovations in Information Technology (Innovations'12). Link: ------- http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=6207739&url=http://ieeexplore.ieee.org/iel5/6203543/6207707/06207739.pdf?arnumber=6207739 e-mail: [email protected]TRANSCRIPT
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
1/60
A
M.Tech DISSERTATION REPORT
on
BioInformatics Techniques for MetamorphicMalware Analysis and Detection
Submitted for partial fulfillment for the degree of
Master of Technology
(Computer Engineering)
in
Department of Computer Engineering
(June-2011)
Supervisors: By:
Dr. Vijay Laxmi Grijesh Chauhan
Dr. Manoj Singh Gaur (2009PCP116)
MALAVIYA NATIONAL INSTITUTE OF TECHNOLOGY JAIPUR
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
2/60
Department of Computer Engineering
Malaviya National Institute of Technology Jaipur
Rajasthan - 302017
CERTIFICATE
This is to certify that the Dissertation Report on BioInformatics Techniques
for Metamorphic Malware Detection, by Grijesh Chauhan is the work
completed under my supervision, hence approved for submission in partial ful-
fillment for the Master of Technology in Computer Engineering during academic
session 2009-2011.
(Dr.Vijay Laxmi) (Dr. M.S.Gaur)
Reader and Head of Department Professor
Date : Date:
M.N.I.T., Jaipur M.N.I.T.,Jaipur
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
3/60
Declaration
I, Grijesh Chauhan, declare that this Dissertation titled, BioInformatics Tech-
niques for Metamorphic Malware Analysis and Detection and the work presented
in it are my own. I confirm that:
This work was done wholly or mainly while in candidature for a M.Tech.
degree at MNIT.
Where any part of this Dissertation has previously been submitted for a
degree or any other qualification at MNIT or any other institution, this has
been clearly stated.
Where I have consulted the published work of others, this is always clearly
attributed.
Where I have quoted from the work of others, the source is always given.
With the exception of such quotations, this Dissertation is entirely my own
work.
I have acknowledged all main sources of help.
Signed:
Date:
i
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
4/60
Abstract
Modern malware which are metamorphic or polymorphic in nature mutates their
code by employing code obfuscation and encryption methods to thwart detection.
Conventional signature based scanners fail to detect these malware. Also, signa-
ture based scanner requires frequent updates and size of data base also increases
exponentially. In order to address the problems of detecting known variants of
metamorphic malware, we proposed a method known as MetamOrphic Malware
Exploration Techniques using MSA (MOMENTUM) using Biometrics techniques
for Protein and DNA matching. Instead of using fixed signature more sophisticated
signature(s) extracted using multiple sequence alignment (MSA). Experiments are
conducted over obfuscated malware data set collected from VX Heavens,tools and
user agencies and benign samples gathered from fresh installation of Windows XP
operating system,Cygwin etc. Experiment are performed by segregating the data
set into two parts one for modeling signature and other is reserved for testing. The
results shows that the proposed method is capable of identifying malware variants
with minimum false alarms and misses.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
5/60
Acknowledgements
I take immense pleasure to express my deep and sincere gratitude to my esteemed
guide, Dr. Vijay Laxmi, (Head of the Department, Department of Computer En-
gineering, Malaviya National Institute of Technology), and Dr. Manoj Singh Gaur
(Professor, Department of Computer Engineering, Malaviya National Institute of
Technology) for their invaluable guidance, and spending precious hours for my
work. Their excellent cooperation and suggestion through stimulating and bene-
ficial discussions provided me with an impetus to work and made the completion
of work possible.
My sincere thanks to all faculty members of Department of Computer Engineering,MNIT Jaipur, for their constant support, imparting best knowledege in M.Tech
course.
I would like to thank all non-teaching staff members of Department of Computer
Engineering, Malaviya National Institute of Technology, Jaipur and all those peo-
ple whose lovely sense of favors I have received for completing this Dissertation
work.
I would always be indebted to the support and prayers of my parents in com-
pleting this work successfully. I thank my friends who have directly or indirectly
contributed by giving their valuable suggestions.
Signed:
Date:
iii
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
6/60
Contents
Declaration i
Abstract ii
Acknowledgements iii
List of Figures vi
List of Tables vii
1 Introduction 1
1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Contributions of Thesis. . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Malware and Types 7
2.1 Types of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Virus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Worms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Trojans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Backdoors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Logic Bombs . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.6 Adware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Polymorphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Metamorphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Dead Code Insertion . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Reorder Instruction using Jump . . . . . . . . . . . . . . . . 12
2.3.3 Equivalent Instruction Substitution . . . . . . . . . . . . . . 14
2.3.4 Subroutine In lining and Outlining . . . . . . . . . . . . . . 14
2.3.5 Independent Instruction Permutation . . . . . . . . . . . . . 16
2.4 Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Static Detection. . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Dynamic Detection . . . . . . . . . . . . . . . . . . . . . . . 17
iv
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
7/60
Contents v
2.4.3 Heuristic Detection . . . . . . . . . . . . . . . . . . . . . . . 17
3 Bioinformatics Techniques 18
3.1 Global Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 NeedlemanWunsch Method . . . . . . . . . . . . . . . . . . 193.1.2 Levenshtein distance . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Multiple Sequence Alignment Method. . . . . . . . . . . . . . . . . 23
3.4.1 Iterative Alignment . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Progressive Alignment . . . . . . . . . . . . . . . . . . . . . 24
4 Metamorphic Malware Exploration Technique Using MSA (MO-MENTUM) 26
4.1 Data acquisition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Analysis of metamorphism in Tools/Real malware . . . . . . . . . . 28
4.2.1 Type of obfuscation. . . . . . . . . . . . . . . . . . . . . . . 29
4.2.2 Indentification of Base Malware . . . . . . . . . . . . . . . . 30
4.3 Signature Modeling and Testing . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Single Signature. . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 Group Signature . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.3 Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Result and Inferences 34
5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Intra Family Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Inter Family Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Testing with Signature . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6 Comparative Analysis with Antiviruses . . . . . . . . . . . . . . . . 39
6 Conclusions and Future Work 41
A Executable Unpacking 43
A.1 Symptoms of Packed Malicious Executables . . . . . . . . . . . . . 44
A.2 Manual Unpacking of Packed Executable . . . . . . . . . . . . . . . 45
A.3 Executable Unpacking using Ether . . . . . . . . . . . . . . . . . . 46
Bibliography 49
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
8/60
List of Figures
2.1 Metamorphic malware variants using obfuscation and embedded withmetamorphic engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Subroutine In lining and Subroutine Outlining . . . . . . . . . . . . . . . 15
2.3 Subroutine Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Global Alignment for DNA Sequences . . . . . . . . . . . . . . . . . . . . 203.2 Local Alignment for DNA Sequences . . . . . . . . . . . . . . . . . . . . . 22
3.3 Phylogentic tree and alignment of sequences. . . . . . . . . . . . . . . . . 22
3.4 Multiple Aligned opcode sequences corresponding to malware samples. . . 24
3.5 Progressive Alignement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 Brief Outline of Method for Metamorphic Malware Detection . . . . . . . 27
4.2 Method for Investigation of Metamorphism. . . . . . . . . . . . . . . . . . 29
4.3 Sum of Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Signature Modeling and Testing. . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Extraction of single signature.. . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Wildcard based representation of Group signature. . . . . . . . . . . . . . 33
5.1 Intra Family Analysis of malware (Synthetic and Real). . . . . . . . . . . 36
5.2 Inter Family Analysis of malware (Synthetic and Real). . . . . . . . . . . 37
5.3 Detection rate of antiviruses compared with different type of constructedsignature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
A.1 Portable Executable Unpacking Procedure . . . . . . . . . . . . . . . . . . 44
A.2 Userspace Unpacking using Ether . . . . . . . . . . . . . . . . . . . . . . . 48
vi
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
9/60
List of Tables
2.1 Different types of Junk code instructions used by metamorphic engine. . . 13
2.2 Dictionary of equivalent instructions. . . . . . . . . . . . . . . . . . . . . . 15
4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Instruction Replacements . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 Comparative Analysis of Malware Samples. . . . . . . . . . . . . . . . . . 37
5.2 Evaluation Metrics for different types of signatures.. . . . . . . . . . . . . 38
vii
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
10/60
Chapter 1
Introduction
The advent of Internet has increased the appearance of malware in the digital world.
Majority of the transactions are performed online by nave users which have increased
the threat of stolen password, transaction credentials or personal informations. The
term malware generally refers to all software which have illicit intentions. They are
categorized into computer viruses, worms, Trojan, backdoors, rootkits etc. Basically,
malware can be categorized based on the mode of propagation as mobile malware whichare worms, spyware, botnets etc. or static malware like viruses. The focus of these
malicious softwares are to replicate be exploiting system vulnerabilities.
Conventionally malware scanners are based on matching signatures of known samples
for detection. The signature based scanners are fast but imposes certain limitations
like (a) failure to detect unseen malware (b) lacks semantic knowledge of the samples
(c) failure to detect obfuscated or encrypted instances. Minor change in the code of
malicious samples would thwart detection.
Antivirus companies have evolved with better methods for identifying malware but mal-
ware writing is getting sophistication and challenging scanners. Identification of poly-
morphic and metamorphic malware is difficult as a simple change in the byte pattern
significantly changes the signature of the samples. Maintaining the signature for each
malware results in (a) increase of malware data base and (b) system may be infected by
new samples by the time signature is created. Basically, the detection process can be
categorized as (a) static analysis and (b) dynamic analysis. Malware can be analyzed by
1
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
11/60
Chapter 1. Introduction 2
checking the structure (content) of the assembly code without the executing the samples.
Thus, the system is not infected and maliciousness is derived by either constructing the
control flow graph or frequencies of opcodes. In dynamic analysis each malware sam-
ple is executed in a controlled environment. The impact of infection is monitored by
inspecting the strains left by malware samples (system registry, processor register etc.).
The method gives refined output but is expensive with respect to running time.
1.1 Motivation
Metamorphic malware mutate its code on each replication preserving functionality of thecode. The code is mutated with the help of a small mutation engine called as metamor-
phic engine. Metamorphic malware uses different obfuscation mechanisms to evade the
conventional signature based scanner based on exact string matching techniques. Meta-
morphic engine is a prime element which keeps it hidden from the antivirus products.
Also, size of metamorphic engine is designed to be small so as to bypass the detection [8].
This indicates that metamorphic engine performs structural transformation to the code
with limited set of replacement. As total change in the code is impossible since the
functionality of malware variant would suffer a change and might loose its maliciousness
by producing an unnecessary code. Malicious programs compared to benign are less
diverse since maliciousesness is preserved for infection and propagation.
DNA/ proteins mutate from one generation to another inheriting some functional, struc-
tural similarity with the ancestors. In this implementation work it was assumed that
metamorphic malware like the DNA/protein sequence transforms the code with mod-
ification in the opcode sequence. The mismatches in the opcode sequence from one
generation to another may be considered as the point of mutation. Thus, exact string
matching techniques would fail to detect new malware variants. At this point we shift
from the general area of exact matching and exact pattern discovery to the general area
of inexact, approximate matching, and sequence alignment. Bioinformatics sequence
alignment method is used in this work which aligns the sequence based on the evolu-
tionary relationship and is found to be better for signature extraction and detection of
variants of malware.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
12/60
Chapter 1. Introduction 3
1.2 Objective
Motivated by Bioinformatics techniques the objective of this thesis is to detect meta-
morphic malware. Using the sequence alignment method for each malware family two
types of signature(s) are constructed which are (a) group and (b) signature. Unseen
malware is tested with extracted signature(s). Also the obfuscation and metamorphism
in malware constructors and real malware is explored to identify the types of prominent
instructions used for mutating the malware.
1.3 Related Work
In their proposed work, authors [14] and [15] created a rewriting engine for detecting
morphed malware variants. The analysis of variants of malware is based on syntactic as
well as semantic structure of a program. Signatures of malware are represented in the
form of a control flow graph. Signature matching technique is based on tree automaton.
Krugel et al [16] proposed a method based code analysis to identify structural similarity
between malicious code (worms). The proposed method is based on the CFG generated
for worms which describes a fingerprint for worm. Their system is found to be resilient
against common code transformation techniques.
Authors in [17] proposed a novel method for analyzing malware based on code graph.
Each malware executable was inspected and instructions corresponding to system call
sequence were represented in the form of a topological graph. The proposed code graph
system was used to differentiate malware and benign programs by checking the applica-
bility of specific system call.
In their proposed work [9], authors proposed a semantic based approach for detecting
variants of malware. This method is based on the functionality of system call executed
by malware samples. The main focus is to identify all instructions and its parameters
which are used for calling a system call. They propose a pattern matching technique
which is able to identify semantically equivalent parts of code. The method is capable
of identifying programs that are related to each other and the ones that are totally
dissimilar. Rachit et al [13]created a malware normalizer making use of term rewritingrules. The method was applied on virus named as Win32.Evol. The main objective of
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
13/60
Chapter 1. Introduction 4
their proposed work was to convert program variants into smaller number of variants i.e
to convert all programs into a normal program.
In Hunting for metamorphic engines [10], Hidden Markov Models (HMMs) were used to
represent statistical properties of a set of metamorphic virus variants. The metamorphic
virus data set was generated from metamorphic engines: Second Generation virus gener-
ator (G2), Next Generation Virus Construction Kit (NGVCK), Virus Creation Lab for
Win32 (VCL32) and Mass Code Generator (MPCGEN). HMM is trained on a family of
metamorphic viruses and determines whether a given program is similar to the viruses
the HMM represents.
In[11], the critical API calls were extracted statically using IDA-Pro [6]. Thus, all thelatebounded API calls that are made using GetProcAddress, LoadLibraryEx, etc. are
not taken into account. On top of this approach did not work for packed malware.
The authors in [1] proposed a phylogeny model, particularly used in areas of bioin-
formatics, for extracting information in genes, proteins or nucleotide sequences. The
ngram feature extraction technique was proposed and fixed permutation was applied
on the code to generate new sequences, called n-perms. Since new variants of malware
evolve by incorporating permutations, the proposednperm model was developed to cap-
ture instruction and block permutations. The experiment was conducted on a limited
data set consisting of 9 benign samples and 141 worms collected from VX Heavens [ 2].
The proposed method showed that similar variants appeared closer in the phylogenetic
treewhere each node represented a malware variant. The method did not depict how the
nperm model would behave if the instructions in a block of code are replaced by equiv-
alent instructions which could either expand or shrink the size of blocks (with respect
to number of instructions in a block).
1.4 Contributions of Thesis
In this thesis work a novel method to detect metamorphic malware variants is proposed.
The method is based on static analysis where the unpacked samples are disassembled
and the opcode sequences of samples are used for comparison. In [7] proposed that
the opcode sequence there is large difference in the opcode sequence of malicious and
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
14/60
Chapter 1. Introduction 5
benign sample. Thus, opcode could be used to create sequence of malware samples. A
evolutionary tree also known as Phylogenetic treeis constructed for a family of malware.
Threshold within the family is computed and unseen samples are detected using this
threshold. Two types of signatures called as (a) group signature and (b) single signature
for a family is constructed. In order to extract single and group signature multiple se-
quence alignment (MSA) is used which is primarily used in area of bioinformatics. Our
experiments shows some promising results and shows the effectiveness of the method for
detecting known samples of metamorphic malware with less false alarms. Experiments
have been conducted on obfuscated malware data set collected from VX Heavens [2]
and some from user agencies. Malware variants are also prepared using the constructors
like NGVCK, MPCGEN, G2, PSMPC. Through our experiment we have found that theobfuscation is minimal in samples created using the constructors. Primarily the obfus-
cation is simple instruction replacement, junk code insertion which is reordered using
the jump instructions. Also, most of the families of the malware generated using the
constructors overlaps depicting minimal obfuscation of the code from one generation to
other generation.
1.5 Outline
In Chapter 2, an introduction to malware and different types of malcode is given. The
chapter discusses infection and propagation modes used by the malicious software. Then,
polymorphic malware is briefly introduced with detailed explanation to metamorphic
malware is covered. Later in the chapter malware detection techniques are described.
Chapter 3 discusses various bioinformatics techniques used in DNA/protein sequence
alignment. In this chapter two types of sequence alignment method known as global
and local alignment is described. Phylogenetic tree used for evolutionary relationship
is explained with brief outline of the construction techniques. During the end of this
chapter Multiple Sequence Alignment (MSA) is described in detail, this method is used
for aligning more than two sequences. Methods for constructing MSA which are iterative
and progressive method is also introduced.
Chapter 4 describes the proposed and implementation method known as MetamorphicMalware Exploration Technique Using MSA(MOMENTUM). This chapter explains in
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
15/60
Chapter 1. Introduction 6
detail the dataset preprocessing which involves unpacking and classification into different
families. This chapter describes different steps involved in exploring metamorphism on
synthetic and real malware data and highlights the prominent opcode sequence used by
malware. Signature modeling is explained in detail along with testing unseen samples
with extracted signature to validate the hypothesis for detection.
Chapter 5 give details of experiments conducted along with the analysis of results.
Finally, conclusions and future work is discussed in Chapter 6.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
16/60
Chapter 2
Malware and Types
Malware can be defined as programs with unethical intentions. They contain instruc-
tions which tries to find vulnerabilities of computer systems in an unauthorized manner
to infect or steal valuable information from machines. Once installed, some malware
provide access of user machines to remote attackers. All malicious software can be cat-
egorized as computer viruses, worms, Trojans, backdoor, adware, spyware etc. Many
malicious softwares are distributed along with free wares or open source software withthe motive of making money. They are primarily installed on computer systems while
browsing sites from which games, movies, web browsers, music etc. are downloaded. The
compromised machines exposes useful information of the system and user to the attack-
ers machine which could be either (a) credit card number (b) root password or (c) use
the compromised system to launch attacks or sending spam messages to other systems.
Once the system is infected it tries to delete system files, change registry entry, hides
task manager, launch spying software which can monitor user key logging activities.
2.1 Types of Malware
Malware can be classified based on their mode of infection and propagation mechanism.
Modern malware are more sophisticated in terms of their complexity in behaviour and
appearance of code. Present day malware are employing antidebugging, antivirtual
machine checks to stay dormant in order to evade detection. As antivirus products
7
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
17/60
Chapter 2. Malware and Types 8
are becoming more powerful malware writing is becoming more complex and challeng-
ing than the antivirus products. Brief outline of various types of malware is given in
subsequent subsections.
2.1.1 Virus
A computer virus is a program which infect the system by replication. They use a host
program for infection and are propagated only by human intervention. The virus would
be activated only if infected program keeps on executing. Viruses can be harmful and
some are written for fun. Harmful viruses could delete system files or freeze computer
by occupying volume of hard disk space. Harmless computer virus displays messages
to attract users but replicate by creating their clones. Normally, computer viruses
targets autorun files, executable system files, macros of document files for the purpose
of replication. Computer viruses have basically four function (a) Asearch routinewhich
locates a program or file with specific file extension to infect. Once the file is found
it marks each such file to avoid over infection or avoid searching infected files (b) copy
routine which copies the malicious code to a host file. This malicious code could be
prepended, appended or added at different locations of the host file (c) antidetection
mechanism to evade detection by antivirus products. These mechanism could be either
encryption, code morphing or interrupt vector table modification etc. (d)payload which
is primarily is the main part of any virus used for self replication.
2.1.2 Worms
Worms are malicious program which are also selfreplicating program like computer
virus but use Internet to spread. The most striking feature of a worm is that it does
not require human intervention to spread. Worm exploits two fundamental vulnerability
(a) software bug and (b) security holes to propagate. Software bug could be either the
buffer overflow vulnerability which appears in program by using functions like strcpy
instead of safe function likestrncpy, allows the attack to allocate oversized memory and
copy malicious code as with well known program finger. Similar type of software bug
is found in a program like sendmail which deliver message to programs residing in thelocal or remote machine. The recipient program executes a script in a new shell which
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
18/60
Chapter 2. Malware and Types 9
is present in the body of the message. Worm attempts to scan open ports to launch
different types of attacks. It also spreads through email by sending spam messages to
contact list of a particular user account. In most cases user is indirectly forced to open
or download attachments for triggering malicious activities of worm. Basically once a
vulnerable system is located, worm scans /etc/passwd file for encrypted password and
possibly cracks it by making multiple attempts. Thus, once username and password is
fetched any malicious code could be remotely executed by worm using utility like rexec.
2.1.3 Trojans
A Trojan Horse is a nonself replicating program and enters the computer in an unno-
ticeable manner and is usually disguised as a legitimate application. Once the system is
infected by Trojan it allows unrestricted access of the user system to attacker sitting
in the remote location. These malicious software require a host program in which they
hide. The basic component of a Trojan Horse is a server and client program. The server
launch a program which attracts the user which exists in the form of games, images,
videos etc. in which the malicious program hides. After these applications are down-
loaded in the system, machine gets infected and Trojan (client program) performs spying
activity.
2.1.4 Backdoors
Backdoor is a program which is created to bypass network security checks to create a
channel for the attacker to control, spy or interact with the victim machine. Backdoors
are planted in softwares (open source or free ware) before their distribution. When these
softwares are installed and executed backdoor open the channel, connect to the remote
machine to leak valuable information concerning the user and computer system. Some
of the backdoors are created for legitimate purpose in order to avoid time consuming
authentication performed for debugging network server [18]. Sometimes backdoor make
use of Trojans for compromising a computer system. The user machine is victimized
when a image of video consisting of backdoor is downloaded. Many backdoors are
installed if an ActiveX is installed in the user system while browsing certain sites. Most
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
19/60
Chapter 2. Malware and Types 10
of the browsers prompts the user when they download ActiveX control to prevent their
machines from attacks.
2.1.5 Logic Bombs
This category of malware can exist stand alone or could be interleaved inside legitimate
program. They do not replicate and have two basic component (a) payload: which is
capable of performing malicious activities like formating harddisk or deleting system
files (b) trigger: which make it more dangerous as the logic bombs would stay dormant
for a specific event to occur to deliver its malicious payload.
2.1.6 Adware
It forces unsolicited advertisements when user is browsing the Internet. Adware gathers
browsing behaviour, planted by many companies by creating interest to shop by popping
up too many advertisements. Sometimes adware are very dangerous as they redirect to
unsolicited site which requires users to fill in their information like password for email,
credit card or cvv numbers which logs keystrokes to gather all valuable information.
Most of the popular malware today employ encryption and obfuscation to evade
detection. Such malware are called as polymorphic and metamorphic malware
they are described in subsequent subsections.
2.2 Polymorphic
Polymorphic malware encrypting their code with random key to avoid detection.
Each polymorphic virus have a polymorphic engine colled virus decryption routine
(VDR), which generate new keys and contains decryption module for decrypting
the encrypted malicious body responsible for infecting applications and system.
Once executed, the virus is re-encrypted and added to another vulnerable host
application. Thus, when an antivirus scans the malware for signature it find
different pattern (as keys are different) and thus thwart detection.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
20/60
Chapter 2. Malware and Types 11
Malware scanner perform in memory scanning of each suspicious sample for de-
tection. Ultimately a malware needs to execute for infecting the machine hence
should reside in the main memory. Thus, the antivirus scans though all samples
in the memory and match all patterns against the signatures in the repository.Another major problem found with the polymorphic malware are its decryption
algorithm. If the scanner could locate the decryption algorithm then this could
become a signature for identification of polymorphic malware. Malware authors
scrambles statements or replace some registers with unused register to obtain dif-
ferent byte pattern to avoid detection. Another approach could be to prepare a
dictionary of some binary code and its equivalent replacement with other binary
patterns. Using this table the polymorphic engine could automatically identify bi-
nary pattern, map these pattern using the dictionary to replace it with equivalentcode to generate new malware variants.
2.3 Metamorphic
Metamorphic malware are very sophisticated in nature as it completely modifies
the code upon each replicate to generate a new malware variant. This make the
antivirus products very difficult to identify metamorphic malware using signature
matching techniques. Metamorphic malware constitutes a engine normally re-
ferred to as metamorphic engine which mutates the code from one generation to
other. Normally the size of metamorphic engine is kept too small in order to avoid
detection. A metamorphic engines alters the program by applying various obfus-
cation technique like (a) junk code (b) instruction permutation by reordering the
control flow using jump instructions (c) equivalent instruction replacement and
(d) subroutine in lining and outlining. Figure 2.1 shows metamorphic malware
embedded with metamorphic engine using obfuscation transformation.
2.3.1 Dead Code Insertion
In this technique some garbage code or NOP is inserted to the actual code. Ba-
sically this is the simplest of the obfuscation as it does not reorder the program
code. Garbage code is inserted to confuse the scanner by increasing irrelevant
byte pattern in the malicious samples to avoid detection. Dead code insertion isillustrated by all instruction written in boldface in the following code snippet.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
21/60
Chapter 2. Malware and Types 12
Figure 2.1: Metamorphic malware variants using obfuscation and embeddedwith metamorphic engine.
mov eax, 020H
mov eax, eax ;Garbage Codemov ebx, 0ABH
add eax, ebx
add eax, 00H ;Garbage Code
push eax
pop ebx
push eax ;Garbage Code
pop eax ;Garbage Code
nop ;Garbage Code
add eax, ebx
add eax, 00H ;Garbage Code
mul ecx
mov [esi], ebx
Some of the junk code used are listed in Table 2.1. The left hand side of the
Table depicts the instructions and the right hand side depicts the meaning of each
instruction.
2.3.2 Reorder Instruction using Jump
This virus adds jump instruction and garbage code in each mutant. The Win95/Zperm
is an example of this technique. Since the virus body is not constant, string based
detection is not possible. Consider the following piece of code without any jump
instructions
instruction 1 ; entry point
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
22/60
Chapter 2. Malware and Types 13
Table 2.1: Different types of Junk code instructions used by metamorphicengine.
Instructions Meaning
NOP No OperationCLD No Operation
PUSHFD POPFD No OperationPUSHAD POPAD No OperationMOV REG, REG REG := REG
ADD REG, 0 REG := REG + 0OR REG, 0 REG := REG|0
AND REG, -1 REG := REG & -1PUSH REG POP REG No Operation
XCHG REG, REG No OperationXOR REG, 0 No Operation
SUB REG, 0 No OperationSBB REG, 0 No OperationADC REG, 0 No OperationSHL REG, 0 No OperationSHR REG, 0 No OperationAND REG, 1 REG := REG & 1
instruction 2
instruction 2
.
.
.
instruction n
In later generation the virus body is modified by the engine by inserting jump
instructions at random positions which is shown below.
instruction 2
jump 3
instruction 4
jump n
instruction 1 ;entry point
jump 2
instruction 3
jump 4
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
23/60
Chapter 2. Malware and Types 14
.
.
.
instruction n
2.3.3 Equivalent Instruction Substitution
Some malware like Win95Zperm [21] and Win32.Evol [8] make use of equivalent
instruction substitution as an obfuscation mechanism. In our proposed code mor-
pher, we make use of a dictionary of instructions which can be possibly replaced by
equivalent instructions. Instruction replacement can either expand or shrink the
size of code of offspring. Our morpher basically increase the size of the generated
variants. Table 2.2 depicts the instruction and their equivalent set of instructions.
2.3.4 Subroutine in Lining and Outlining
Subroutine in liningis a method in which the call to subroutine is replaced by its
definition. It is a form of program obfuscation which replaces some/all calls to the
subroutine with their code definitions. Code outliningdivides a block of code into
subroutine (s) and add subroutine call for the newly created subroutine (s). The
Figure 2.2 shows an example of subroutine in lining for two subroutine call S1()
and S2() and outlining of code to create a new subroutine S12().
S2: mul ecx
ret
mov edx, eaxret
...
move eax, ebxadd eax, 12hpush eax
mul ecx
mov edx, eax...
...
Call S1
Call S2...
S1: move eax, ebx
add eax, 12h
push eax
...move eax, ebx
add eax, 12h
push eax
mul ecx
...
mov edx, eax
call S12
mov edx, eax
move eax, ebx
...
...
S12: push eax
add eax, 12h
mul ecx
ret
Figure 2.2: Subroutine In lining and Subroutine Outlining
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
24/60
Chapter 2. Malware and Types 15
Table 2.2: Dictionary of equivalent instructions.
Instructions Equivalent InstructionsADD REG, -1 NEG REG; NOT REGor NOT REG; NEG REGADD REG, 0 NOP
ADD REG, 1 INC REG or NOT REG; NEG REGor NEG REG; NOT REGAND REG, -1 NOP
XOR Reg,-1 NOT Reg
XOR Mem,-1 NOT Mem
MOV Reg,Reg NOP
SUB Reg,Imm ADD Reg,-Imm
SUB Mem,Imm ADD Mem,-Imm
AND REG, 0 MOV REG, 0
AND REG, REG CMP REG, 0
JMP REG PUSH REG; RET
MOV REG, REG NOP
AND Mem,0 MOV Mem,0
XOR Reg,Reg MOV Reg,0
SUB Reg,Reg MOV Reg,0
OR Reg,Reg CMP Reg,0AND Reg,Reg CMP Reg,0
MOV REG1, REG2 PUSH REG2; POP REG1 or XCHG REG1, REG2NOP PUSHFD; POPFDor PUSHAD; POPAD or PUSH REG; POP REGXOR Reg,0 MOV Reg,0
XOR Mem,0 MOV Mem,0
ADD Reg,0 NOP
ADD Mem,0 NOP
OR Reg,0 NOP
OR Mem,0 NOP
AND Reg,-1 NOP
AND Mem,-1 NOP
AND Reg,0 MOV Reg,0
TEST Reg,Reg CMP Reg,0
LEA Reg,[Imm] MOV Reg,ImmLEA Reg,[Reg+Imm] ADD Reg,Imm
LEA Reg1,[Reg2] MOV Reg1,Reg2
LEA Reg1,[Reg1+Reg2] ADD Reg1,Reg2
MOV Reg,Reg NOP
Subroutine Permutation: Some metamorphic viruses make use of permutation
of subroutines. If a virus code consists ofnsubroutine, it is possible to have n
generations. Figure 2.3 shows few permutations of the virus code consisting of 5
subroutines.
5
EP 1
2
3
4
5
1
2
3
4
EP
Figure 2.3: Subroutine Permutation
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
25/60
Chapter 2. Malware and Types 16
2.3.5 Independent Instruction Permutation
Transposition or instruction permutation modifies the instruction execution order
if they are not interdependent. Consider two instructions op R1, R2 followed byop R3, R4. These two instructions can be swapped provided R1, R2, R3, R4
are different. For example, the instructions mov ecx, imm and inc eax are not
interdependent hence they can be swapped.
...
mov ecx, imm
inc eax
.....
is equivalent to
...
inc eax
mov ecx, imm
2.4 Detection Techniques
Malware detection deals with the different mechanism for filtering out malicious
programs. The detection mechanisms can be broadly classified as static, dynamic
and heuristic methods.
2.4.1 Static Detection
Static analysis deals with detection of malcode without executing them on com-
puter system. The disassembled code is scanned for malicious by examining either
the import address table (IAT), opcode patterns, byte ngram. Signature in the
form of byte patterns are extracted from each malicious samples and checked
against a repository. Static detection mechanism using control flow graphs as
signatures is also used to flag maliciousness.
The main advantage of static detection mechanism is that the system is not in-
fected by malcode. The detection approach is fast as surface scanning of malware
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
26/60
Chapter 2. Malware and Types 17
program is performed. This method lacks detection of encrypted malware as the
actual malicious payload is released during execution.
2.4.2 Dynamic Detection
Dynamic analysis is used to mine maliciousness by executing malware samples in
controlled environment. The controlled environment is used so as to keep the host
machine unaffected. Dynamic analysis is particularly useful when dealing with
encrypted malware. Code emulation might result in appropriate detection but
this mechanism when used alone may sometimes defeat the detection process as
the decryption may consume much of the time. In order to thwart detection some
malware use multiple jump instruction to defeat dynamic scanners.
2.4.3 Heuristic Detection
Heuristic detection mechanism can be used along with static or dynamic tech-
niques. The scanner primarily use heuristics for detecting unseen malware sam-
ples. Some of the heuristics for detection of malicious code are (a) presence of
entry point in last section (b) suspicious section names (c) large data sections or
(d) small import table size. Heuristic detector are prone to too many false alarms
where the benign samples are incorrectly identified as malware.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
27/60
Chapter 3
Bioinformatics Techniques
Bioinformatic is the application of computer science on biological data. In bioin-
fomatics biological informations are extracted to gain better understanding about
different biological species. Sequence alignment is an elementary method used
in any biological study to compare two or more biological sequences (protein or
DNA). The alignment method attempt to find regions of high similarities as a
whole or parts to deduce evolutionary relationship among sequences. Metamor-
phic malware like proteins or nucleotide have some fragments of code which areinherited from their base malware. These segments of code is partially subjected
to change from one generation to subsequent generations. Malcode is transformed
by a metamorphic engine to conceal the malicious payload so that maliciousness
is not revealed. Fundamentally code obfuscation is performed by metamorphic
engine to thwart detection.
The structure of metamorphic variants are different but they share common func-
tionalies. Difference in variants of the same base malware cannot be too large
hence, techniques used in bioinformatics can be applied for its detection. It can
be assumed that genes in DNA can be thought as opcode sequence in malware.
The size of the metamorphic engine is usually small to hide it from detection.
Each malware sample is represented as a sequence of mnemonic pattern (opcode
sequence) without considering the operands. Initially the approach might appear
to be trivial but metamorphic malware variants cannot undergo total transforma-
tion. Our assumption is that there may be replacement of some opcode(s) with
equivalent opcode(s) but complete change is impossible in order to maintain pre-
serve functionality. It can be inferred that variants preserve some base malicious
18
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
28/60
Chapter 3. Bioinformatics Techniques 19
code which is transformed by the engine to produce new variant(s). Thus, using
sequence alignment techniques opcode sequences are arranged:
To determine similarity amongst malware samples.
To explore frequent occurring patterns in a family of malware. These pat-
terns depict maliciousness.
To store, retrieve and compare malicious opcode sequences.
The basic approach to sequence alignment can be broadly categorized as:
1. Global Sequence Alignment
2. Local Sequence Alignment
Global alignment technique aligns sequences over complete length. This method
is particularly useful when the sequences are more or less of similar length. On the
other hand, local sequence alignment attempts to compare segments of all possible
lengths to optimize the similarity measure. Local alignment mainly used when
the query sequences have dissimilar size. Multiple sequence alignment (MSA) is
another form of alignment technique used to align three or more sequences. MSA
is used in identifying conserved sequence regions across a group of sequences. In
this work using evolutionary relationship among sequences progressive MSA is
implemented. In the following sections sequence alignment methods (global, local,
MSA) is introduced.
3.1 Global Alignment
Global Alignment is used to align sequences end to end. Figures 3.1 shows global
alignments for two sequence X and Y. The alignment of two DNA sequence in
the Figure 3.1 shows match, mismatch and gaps introduced by global alignment
methods. Two well known methods of global alignment are (a)NeedlemanWunsch
and (b) Levenshtein or Edit distance. These methods are briefly discussed in
following subsections.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
29/60
Chapter 3. Bioinformatics Techniques 20
Figure 3.1: Global Alignment for DNA Sequences
3.1.1 NeedlemanWunsch Method
NeedlemanWunsch method [20] determines global optimal alignment between the
two sequenceXandY. Following are some basic steps involved in aligning opcode
sequence:
Initialization: In this step a score and trace back matrix of size (M+ 1)
(N + 1) is created where M and N are the length of two instances. Let
the score and trace back matrix be S(M+ 1, N+ 1) and T(M+ 1, N+ 1).Initially the first row and first column of score and trace back matrix is filled
with 0.
Populate Score Matrix: The score of each cell S(i, j) is determined by the
scores of neighboring three cells i.e. (top, diagonal and left). In addition to
filling the score matrix the trace back matrix is populated with the directions
like left(L), diagonal(D) and up(U). The trace back matrix depicts the
direction of cell with maximum value in the score matrix which contributes
for the score of new cell S(i, j). Thus, S(i, j) is computed as follows:
S(i, 0) =i
S(0, j) =j
S(i, j) =max(S(i 1, j1) + (X[i], Y[i]), S(i 1, j) + , S(i, j1) + ))
where (X[i], Y[i]) indicate match/mismatch score while aligning character X[i], Y[i]
and is gap penality.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
30/60
Chapter 3. Bioinformatics Techniques 21
Traceback: Traceback step recover to the alignment from the trace back matrix.
Traceback start at bottom-right cell T(M+ 1, N+ 1) until the first row or column
is encountered. Each cell with direction Ddepicts match and cells with directions
of L, Udepicts the gap introduced in the sequence.
3.1.2 Levenshtein distance
The Levenshtein distance also known as edit distance algorithm is an approx-
imate string matching algorithm used to find the occurence of a subtring of
a pattern in a text. This method is used to determine the similarity between
two sequences. Edit distance determines the minimum number of opera-
tions required to transform one opcode sequence into to other. One of thecommon way of implementing the edit distance method is using a dynamic
programming approach. The Levenshtein distance algorithm for two strings
string1, string2of length m and nis shown below:
1. Create a distance matrix consisting ofm rows and ncolumns.
2. Initialize the first row and column as [0 m] and [0 n].
3. For each of the symbol ofstring1and string2
Ifstring1[i]= string2[j], the costis 0.
Ifstring1[i]!= string2[j], then the costis 1.
The value of cell distanceMatrix[i, j]is minimum of
distanceMatrix[i-1, j] + 1,distanceMatrix[i, j-1] + 1,
or d[i-1, j-1] + cost.
3.2 Local Alignment
Simth Waterman [22] is a local sequence alignement method which can be
used to align sequences of arbitarary length. The score and trace back matrix
in case ofSmith Waterman alignment method is computed in similar way
the NeedlemanWunsch method execept that zero is included to prevent
calculated negative similarity. This state of the cell indicates no similarity.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
31/60
Chapter 3. Bioinformatics Techniques 22
For any two sequenceXandYthe score matrix is populated using equation
given below:
S(i, j) =max(S(i1, j1) + (X[i], Y[i]), S(i1, j) + , S(i, j1) + ), 0)
where S is the score matrix, is score corresponding to match and repre-
sents the gap penlaty. The regions of high similarity is estimated by finding
maximum score from the score matrix. Aligned sequences are retrived by
reading the trace back matrix follwing the direction starting from the cell
having maximum value. Figure 3.2 depict local alignment of DNA sequences.
Figure 3.2:Local Alignment for DNA Sequences
3.3 Multiple Sequence Alignment Method
The multiple sequence alignment (MSA) method is used to align more than
two sequences at a time. MSA can be build up by repeatedly applying
global/local on two sequences and later on align subsequent alignments and
sequences. In the proposed methodology (MOMENTUM), MSA in partic-ularly is used to determine related functional, structural aspects of opcode
sequences in terms of signature(s).
Given a set of k malware samples with opcode sequences M1, M2, Mk,
gaps are inserted while aligining the opcode sequence so that all opcode
sequence have same length. This similar opcode sequences are conserved
and the number of gaps is minimized. Figure 3.3 depicts the MSA of five
malware sequences. Two common methods of implementing MSA are:
1. Iterative method
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
32/60
Chapter 3. Bioinformatics Techniques 23
2. Progressive alignment method
Iterative method repeatedly realign the initial sequences as well as adding
new sequences to the growing MSA. Second, is most widely used method to
building MSA uses a heuristic based progressive technique.
Figure 3.3: Multiple Aligned opcode sequences corresponding to malwaresamples.
3.3.1 Iterative Method
The iterative alignment method builts an initial alignment of sequences.
They are primarly used to improve overall alignment score. A tree is created
which depicts the order in which nodes are aligned. The tree is read in a
bottom up fashion repeatedly by aligning sequences until the root node is
visited which gives the complete alignment for a family. The main advantage
of using the iterative alignment method is it fast and scales large number
of sequences. The iterative alignment method has a limitation that the
misalignment is preserved and is propogated to all sequences.
3.3.2 Progressive Alignment
he hierarchical or tree method), that builds up a final MSA by combining
pairwise alignments beginning with the most similar pair and progressing to
the most distantly related
Progressive Alignment method identifies most similar instances align them
first. Successively less similar instances are added to the initial alignment.
This process is repeated until combined results of aligning opcode sequences
of a malware famliy is obtained. ClustalW[23] is a progressive alignment
techinque which is based on dynamic programming (DP) approach. Fig-ure 3.4 shows the aligned sequences obtained using progressive alignment
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
33/60
Chapter 3. Bioinformatics Techniques 24
method.
Figure 3.4: Progressive Alignement
The basic progressive alignment approach involves three steps:
Compute Distance Matrix unsing pairwise alignment for all pairs of
malware sequences in a family.
Construct Phylogenetuc Tree using distance matrix as heuristic. A
phylogenatic tree illustrate evolutionary relationship among various
biological species. Figure 3.5 depicts the a phylogenetic tree for five
different sequences. In this figure set of closely related sequences has
common root node. NeighbourJoining (NJ) [24] method is used to
construct tree. The phylogenetic tree use as guide tree defines the
order in which the sequences are aligned in the next step.
Figure 3.5: Phylogentic tree.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
34/60
Chapter 3. Bioinformatics Techniques 25
Construct MSA by traveling guided tree in bottomup align opcode
sequences using evolutionary relationship, with similar ones aligned
first followed by the less similar instances.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
35/60
Chapter 4
Metamorphic Malware
Exploration Technique Using
MSA (MOMENTUM)
Metamorphic malware have self modifying and replication ability. It is
equipped with a metamorphic engine which generates variants using code
obfuscation techniques. Opcode sequence which represents maliciousness is
transformed using metamorphic engine to obscure the infection mechanism.
Sequence alignment methods can be used to determine the conserved regions
of opcode which might be similar with respect to other opcode sequences.
Also, the mismatch could be analyzed to determine semantic equivalence
of instructions. In this chapter, we discuss the applicapability of various
sequence alignment methods in different phase of proposed Metamorphic
Malware Exploration Technique Using MSA (MOMENTUM)for detection
and classification of malware executable. Figure 4.1 briefly outlines the im-
plemented method.
4.1 Data acquisition
Experiments are condcuted on malware and benign samples in Portable Ex-
ecutable (PE) [25] format. The malware samples are collected from var-
ied sources which includes synthetic malware created using virus kits like
NGVCK, MPCGEN, G2, PSMPC and real malware collected from VX Heav-ens and user agencies. Gathered malware samples are scanned using 14
26
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
36/60
Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 27
Figure 4.1: Brief Outline of Method for Metamorphic Malware Detection
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
37/60
Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 28
antiviruses (trial period) and were classified into different families. Benign
samples are collected fromSystem 32folder of fresh installation of Windows
XP operating system. Some benign samples are collected from different site
which includes games, browsers, media players etc. Each benign sample isalso scanned using the antiviruses.
Since most of the malware collected are packed. Sample are unpacked using
signature based unpackers like PEiD, GUNPacker[3] and dynamic unpacker
like EtherUnpack. The details of unpacking is discussed in Appendix A.
Table 4.1 gives the description of the data set used in the experiment.
Table 4.1: Dataset Description
TYPE SOURCE NO. FAMILIES NO. SAMPLESSynthetic NGVCK, G2, 46 1051
PSMPC, MPCGENReal Malware User Agencies, 57 1330
Vx HeavensBenign System 32, Cygwin, 1 1064
ganmes etc.
4.2 Analysis of metamorphism in Tools/Real
malware
In the proposed work the metamorphism amongst the malware samples gen-
erated with various constructors are analyzed. Similar experiment is con-
ducted on malware real samples collected from Vx Heavens and user agen-
cies. Initially, pairwise alignment is found out for all opcode sequences of
the malware samples using global and local alignment methods. Two type
of analysis is performed (a) one is the intra family and (b) second is the inter
family analysis. From the intra family pairwise alignment we obtain distance
of samples, a base file and the opcode sequence alignments between the mal-
ware samples. Average distance of samples in a family is computed which is
useful for investigating the degree of metamorphism in a family of malware.
With opcode sequence alignments we can determine the types of instructions
contributing obfuscation. Inter family pairwise alignment between the base
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
38/60
Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 29
Figure 4.2: Method for Investigation of Metamorphism.
malware is performed to determine if different malware families overlap. Fig-
ure 4.2 depicts the method of identification of metamorphism in synthetic
and real malware. It is also observed in most of the cases mov,pushand pop
instructions are used.
4.2.1 Type of obfuscation
Metamorphic engine make use of instruction substitution or permutation as
a way of obfuscation. The opcode sequence appear as a mismatch or gap in
the alignment and depicts a point of mutation. Usually it is in case of mal-
ware families single and multiple instruction replacement is observed. These
replacements are incorporated by the metamorphic engine by maintaining
the functionality of the variants of a family to evade detection. Table 4.2 listout some of the instructions used for obfuscation in the collected malware
samples.
4.2.2 Indentification of Base Malware
The Sum of Pair (SOP) alignment method computes the pairwise alignment
between every pair of opcode sequence. At a time three sequences could
be aligned by constructing a cube like structure. This method is imposesconstraint on the system with respect to the memory and space utilization.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
39/60
Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 30
Table 4.2: Replacement of opcodes for malware generator (NGVCK, G2,PSMPC, MPCGEN). For all generator mov, push, pop and jump instructions
are replaced.
NGVCK G2 PSMPC MPCGENadd mov int call jnz loop mov pop
push mov mov pop - cmp movmov pop lea mov - int movcall mov xor cwd - mov leamov sub mov movsb - jmp intpush add rep movsb - call addmov xor xor mov - add movswand mov cwd mov - lea jmp
mov jz int inc - movsw movmov cmp movsb movsw - push pop
To align three sequences the running time complexity is (23 1)n3orO(n3).
in general for k sequence the running time complexity is O(2k 1)nk or
O(2knk). Thus, it can be inferred that alignment between two sequence can
be extended fork sequence but the running time exponentially increases.
A method known asStar Sequence Alignmentmethod is used to align mul-
tiple sequences. In this method a malware sample Mc is selected as the
central or base file. Then, the optimal alignment of all instancesMiwithMc
is computed, and each new sample is aligned with base file by inserting gaps
to finally form multiple aligned sequence. Figure 4.3 depicts the pairwise
alignment of the samples and selection of central file using Sum of Pairs
method.
4.3 Signature Modeling and Testing
In this phase of the method signature(s) are extracted from the data set.
The data set is initially portioned into train and test set. Signatures for each
family is extracted from the MSA of signatures of each family. Figure 4.4
depicts the phase involved in modeling the signatures.
4.3.1 Single Signature
Opcode sequence corresponding to each malware family is aligned using
MSA. From each row of aligned MSA sequence an opcode that appears
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
40/60
Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 31
Figure 4.3: Malware samples arranged in star like fashion with M2 is basesamples andM1 the closest and M5 the farthest samples from base. The closest
sample will be more similar to the base malware samples.
Figure 4.4: Signature Modelling and Testing
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
41/60
Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 32
in 60% of the samples in a row is preserved. The combination of all such
opcode sequence from all rows of a MSA is considered as a single signature
for a family. Figure 4.5 show single signature extracted from MSA of opcode
sequences.
Figure 4.5: Extraction of single signature.
4.3.2 Group Signature
Each malware family is subdivided into number of smaller groups based on
Phyogenetic tree. All samples which are close based on the distance are
grouped to form a subgroup. A subgroup may contain two or more samples,
opcode sequences are aligned using MSA and single signature for each sub-
group is extracted. Thus, for k subgroups we obtain k signatures. MSA of
k signatures are further created and wild card based signature is retained.
This signature is also referred as group signature. The main advantage of
representing group signature based on wild card is that it saves time dur-
ing the testing phase otherwise test sample need to be checked against i
prominent signatures from k subgroup signature where i < k. Figure 4.6
shows wildcard representation of group signature and Mtis the malware test
sample. This
Figure 4.6: Wildcard based representation of Group signature.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
42/60
Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 33
4.3.3 Testing
The last module of MOMENTUM determines the family to which the unseen
samples (malware/benign) belong. This is determined by aligning the test
samples against single and group signatures of each family. The unseen
samples is said to belong to a family if high score value or low values of
distance by aligning it with signature(s).
Threshold of each malware family is determined and samples in the test set
is detected by using three types of signature. For computing the threshold
corresponding to a family both malware and benign samples in the training
set is considered. Each variant and benign samples are matched with the
signature(s) and a score is determined. Higher score represents high matchwith a signature. Threshold thfor a family is determined as follows.
th=(Bmax+ Mmin)
2
where Bmin, Bmax depicts minimum and maximum score corresponding to
benign samples with signature(s). Similarly Mmin, Mmax represents high-
est and lowest score of a malware with the signature(s). A test sample t
is considered as benign if the score obatined by aligning this sample withthe signature if less than threshold th otherwise the sample is flagged as
malware.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
43/60
Chapter 5
Result and Inferences
The experiments are performed on Intel Core i7 870 processor with 8GB
RAM installed on the machine. Some tools like IDA Pro disassemble, GUN-
Packer, Ether are installed in machines which is used for different purpose
like (a) packed executable analysis (b) to disassemble code. The data set
consists of malware families synthetic and real malware. Malware samples
are collected from VX Heavens repository, use agencies and some have been
constructed using the malware constructors like NGVCK (Next Generation
Virus Construction Kit), G2, PSMPC, and MPCGEN. Following are differ-ent phases in the experiments.
1. Dataset preparation: Collected samples of malware and benign exe-
cutables are scanned using 14 antiviruses. Using the scanned reports of
the antiviruses, malware executables are separated into different fami-
lies. The entire data set is divided into two parts one for training and
other for testing. Executables are disassembled using IDA Pro disas-
sembler to obtain the assembly code of the executables and mnemonicsare extracted fro each assembly representation of the malicious/benign
files.
2. Validation of obfuscation: From each representative malware fam-
ily a central or base file is selected. Sequence alignment techniques are
applied within the family to obtain alignments for each pair of sam-
ples. Alignments depicts point of match and mutations. Total number
of mutations in malware dataset is estimated.
34
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
44/60
Chapter 5. Result and Inferences 35
3. Metamorphism in Malware Tools: Inter family pairwise analysis
is performed amongst all base samples selected for each family. If the
distance between any two base malware is very less then the families
are considered to overlap.
4. Signature Modelling: Two types of signature are extracted from
MSA of each malware family. These signatures are referred as (a)
single and (b) group. A training model is prepared with malware and
benign samples in the dataset and threshold for each malware family
is determined. Unseen samples (of test set) are tested using threshold
determined during training and evaluation metrics is computed.
5.1 Evaluation Metrics
Experimental results are evaluated using evaluation metrics like TPR,TNR,
FPR, FNR. These metrics are computed using True positives (T P), True
Negative (T N), False Positive (F P) and False Negative (F N). T P indicates
the number of samples classified as malware, T Nis the number of correctly
classified benign instances, F P is the number of benign samples incorrectly
classified as malware and F N is the malicious samples classified as benign.
The performance of any detector/scanner can be measured by primarily
checking theTrue Positive rate (TPR)and True Negative Rate (TNR)which
are also known as sensitivity and specificityrespectively.
1. True Positive Rate (TPR):
T P R= T P/(T P+ F N)
2. False Positive Rate (FPR):
F P R= F P/(F P+ T N)
3. True Negative Rate (TNR):
T NR= T N/(T N+ F P)
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
45/60
Chapter 5. Result and Inferences 36
4. False Negative Rate (FNR):
F NR= F N/(F N+ T P)
In case of a protection system, high value ofTPR and TNR along with low
FPR andFNR is required. This would ascertain that the scanner is capable
of correctly identifying samples as malware or benign.
5.2 Intra Family Analysis
Figures 5.1 shows intra family analysis for malware constructors.
Figure 5.1: Intra Family Analysis of malware (Synthetic and Real).
From the graph we can observe the following
Non zero values indicates presence of metamorphism in synthetic data.
Levenshtein distance is high due to junk code insertion.
In spite of high values of global distance, local distances are low in most
of the samples. This indicates presence of similar regions in code.
5.3 Inter Family Analysis
Inter family analysis is performed by comparing the base samples of different
families. Figure 5.2 shows inter family analysis of malware families.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
46/60
Chapter 5. Result and Inferences 37
Figure 5.2: Inter Family Analysis of malware (Synthetic and Real).
Distance is less than intra family distance. This indicates most of
malware share some base code and could be detected using commonsignature.
Levenstein Distance is relatively high in comparison of local and Needle-
man Wunsch alignments because of variable functionality of the code
resulting in increase of the number of gaps in alignment.
5.4 Comparative Analysis
This section shows comparative analysis among different types of samples
based on various parameters (a) alignment per samples (b) average sum of
distance and (c) degree of obfuscation (refer Table 5.1).
Table 5.1: Comparative Analysis of Malware Samples
Virus Type Replacement Avg. SoD Obfuscation/Alignment
NGVCK 47 1.03 Average Simple
G2 3 1.45 Low SimpleMPCGEN 31 0.61 Average Simple
PSMPC 1 1.35 Low WeakVx Heavens 122 8.3 Large Complex
Viruses generated using tools belong to same family.
Families of real malware are distinct.
In PSMPC loop and jump instructions contribute for obfuscation thisincreases the distance between samples.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
47/60
Chapter 5. Result and Inferences 38
NGVCK viruses overlaps with real malware (Savior).
mov,add,sub,pushand pophave been replaced most of the times with
equivalent instructions instructions.
Obfuscation is primarly single instruction is replacement instead of
multiple instructions. This is validated by observing the global and
local alignments of samples. The types of mismatch in global and
local alignment are same suggesting less complex obfuscation.
5.5 Testing with Signature
Malware families created using the scanners are separated into number of
families. For each malware family two types of signature (single and group)
are extracted. Single signature is the maximum preserving opcode sequence
in a multiple aligned sequence of a family of malware. Each row of MSA
depict match, mismatch and gap corresponding to opcode sequences. Group
signature is the wildcard representation of signatures of the subfamilies in a
family. Table 5.2 shows values for evaluation metrics for different types of
signature.
Table 5.2: Evaluation Metrics for different types of signatures.
Types of TPR FNR TNR FPRSignatures
Single 0.95 0.046 0.48 0.52Group 0.73 0.27 0.99 0.01
It is observed that the detection rate is approximately 95% with a FPR of46%. This indicates that most of the malware samples are detected but many
benign samples are incorrectly classified as malware. Since single signature
is constructed by extracting maximum preserving (55%) opcodes in MSA
row, opcodes responsible for mutations are lost in signature (they appears
to be less dominant). Thus, most of the benign samples in test set score well
with the signature and are detected as malware.
In case of group signature a detection rate of 73% is obtained with very less
false positive rate (FPR = 0.1). This indicates that malware samples in the
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
48/60
Chapter 5. Result and Inferences 39
test set is detected by wild card representation of signature. The group sig-
nature actually depicts wildcard representation of signatures of subfamilies
for a family. Opcode sequence present in this signature is absent in benign
samples, thus, they could be discriminated from the malware samples.
5.6 Comparative Analysis with Antiviruses
Entire dataset was scanned using 14 antiviruses and the detection rate was
computed from their scan report. Figure5.3depicts the detection rate ob-
tained from antiviruses and the MOMENTUM. The top five detection rate
was obtained with antiviruses like Avast, Avira, AVG, GData, Kaspersky
(arranged in ascending order of detection rate). It was observed that the de-
tection rate of MOMENTUM is close to the top three commercial antivirus
product. Some of the malicious files (total 37 malware) were not detected
by any of the antivirus.
Figure 5.3: Detection rate of antiviruses compared with different type of
constructed signature.
Out of 37 undetected malware executable from different antiviruses, using
our implementation methodology (MOMENTUM) 30 malware was detected
with single signature and 20 malcode were detected using group signature
(wildcard signature). Effectiveness of the method suggests that bioinformat-
ics sequence alignment methods could used effectively to detect malware.
Also, these methods could be used for generating malware signatures and in
assisting scanners for detection purpose.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
49/60
Chapter 6
Conclusions and Future Work
Malicious Software (malware) is a major threat to computer systems. Mal-
ware detection mechanisms are gaining prominence amongst researchers and
have turned out to be a topic of research. The number of malware has
increased at an alarming rate due to the fact that malware writers are de-
ploying obfuscation methods. The nonsignature based detection methods
are important as the malware writer are producing metamorphic or poly-
morphic malware. Thus, a strong signature based methods is required to
detect these modern malware.
In this thesis the problem of detection of metamorphic malware is discussed
using MSA methods. Signature(s) (single and group ) for a malware family
is extracted and tested using the unseen samples. Metamorphism amongst
malware constructors and real malware is explored. It was found in this
investigation that the malware constructors used minimal obfuscation which
were mainly single, multiple instruction replacement. Primarily the obfus-
cation found was code reordering.
The detection rate of the implementation method (MOMENTUM) is also
compared with that of antiviruses. It was obaserved that the unseen samples
were detected using signatures with low false positives. Also, the detection
rate of implementation method is comparable with that of antivirus like
Avast, Avira, AVG. Some of the undected malware executables from all an-
tiviruses were detected by MOMENTUM. In continuation to the present
work some suitable scoring scheme could be devised that could identify un-
seen samples. This could be initiated by assigning some weights to mnemonic
40
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
50/60
Chapter 6. Conclusions and Future Work 41
pairs that are responsible for mutation. Also, the operands of instructions
could be considered to improve detection rates.
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
51/60
Appendix A
Executable Unpacking
A packer is program used to encrypt the executable there by reducing its
size and to avoid the executable from reverse engineering. Most of the pack-
ers are dependent on specific file format like Portable Executable (PE) or
Dynamic Link Library (DLL). The packed executable would restore in its
original form once it is loaded in the memory. Malware authors use packers
to avoid detection by anti virus products as the malicious code is hidden from
the scanners. Basically, we can think of packer as a software which place
an executable inside another executable. Thus, the outer executable is re-sponsible for unpacking the original executable which is hidden by a packer.
The basic function of packers is to encrypt the code, resources and import
table. Executable packers insert some random number ofjump instructions
in order to confuse the disassemblers. Advanced packers also encrypts the
Portable Executable (PE) sections so that the antivirus virtually fails to
scan proper malicious code. Static analysis of packed code is not possible as
the malicious payload is unpacked during runtime. Thus, the antivirus us-
ing sandbox environment has the capability of unpacking the executable byexecuting each suspicious sample. However, unpacking executable is com-
putational expensive. If the unpacked malware is analyzed for detection
then we may basically scan the packer code instead of malicious executable
code. Unpacking could be performed using the generic unpacker like GUN-
Packer [3]. The basic problems with these signature based packers are (a)
packer signatures need to be updated periodically and (b) difficulty in the
detection of multiple layer packed executables.
42
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
52/60
Chapter 6. Conclusions and Future Work 43
Another way of software unpacking is by using Ether Unpack [4]. The main
problem using Ether is that it requires dedicated operating system and hard-
ware. Initially the sample to be unpacked is executed in the guest operating
system (Windows XP SP2) and Ether tries to locate all memory writes thatare performed by the executing process. Whenever a memory write oper-
ation is performed the process dump is stored under the images directory.
Ether considers each memory write operation as the candidate Original En-
try Point (OEP). FiguresA.1depicts the process of unpacking executables
(malware/benign) using signature based packers and Ether Unpack.
Figure A.1: Portable Executable Unpacking Procedure
A.1 Symptoms of Packed Malicious Executa-
bles
Packed PE files can be detected using signature based, heuristics based or
dynamic unpackers. Native and packed malicious code some difference which
are listed below
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
53/60
Chapter 6. Conclusions and Future Work 44
(i.) Nonstandard section names: Most of the compilers and linkers
have follow convention for naming the sections. The executable pack-
ers prepends nostandard section name like .upx0, .upx1 etc. in the
packed code.
(ii.) Small Code Section: The packed code contain small code with pop-
ulated data section. The disassembles also exposes the code of stub
instead of actual code.
(iii.) Missing String Table: The string table or symbol table is used by
most of the compiler to store address of symbols instead of maintaining
multiple strings in the table. The packer normally encrypts the strings,
inserts garbage address corresponding to each string in the string table.
(iv.) Small Import Table size: The native executable have populated en-
tries in the Import Address Table (IAT) one for each API. The packed
PE samples have small import table with few imports of common APIs
like GetProcAddressor LoadLibrary.
(v.) Execution of Code starts at last section: The PE file is divided
into logical structures called as sections which are data, code, reloc
etc. Some of the malware packers hide the original entry point andadd new section possibly at the end of the all sections.
(vi.) Section Characteristics: The characteristics are the flags for each
section describing about the permissions alloted to a section. The
code section has characteristics flag set as executable but lacks write
permission. The malware packers either have both execute and write
permission or leave the permissions as 0.
A.2 Manual Unpacking of Packed Executable
Packed PE files can be identified using signature based packers which tries
match executable packer signature with the known signatures of the packers
stored in the repository. Another way to find a executable as being packed
using the known packers is to perform entropy analysis of the suspicious file.
The entropy for complete file or the few bytes from the beginning of the
file could indicate whether a file is packed or not. Following are the stepsadopted to manually unpack malicious code (refer Figure
-
5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh
54/60
Chapter 6. Conclusions and Future Work 45
(i.) The preliminary step is to identify the type of packer used to pack an
executable. Once a packer is known to us we need to locate the original
entry point of the executable by executing the suspicious sample.
(ii.) The executable is loaded in OllyDebugger and a break point is set and
the program is allowed to execute until it stops the execution. At this
point the memory dump is retrieved. The memory dump contain both
the unpacked and the unpacking stub code.
(iii.) The dump executables entry point still points to the starting address
of the packer. Since it is required that the unpacked data should be first
executed followed by the unpacker code the entry point is calculated
asRVA Entry Point = OEP - Base Address
(iv.) Finally the import table is reconstructed by specify proper RVA Entry
point. This total would reconstruct the import address ta