software fingerprinting for automated assembly code analysis › pdfs › unc236 ›...

Software fingerprinting for automated assembly code analysis Project overview

P. Charland DRDC – Valcartier Research Centre

Defence Research and Development Canada Scientific Report DRDC-RDDC-2015-R027 March 2015

© Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2015

© Sa Majesté la Reine (en droit du Canada), telle que représentée par le ministre de la Défense nationale, 2015

DRDC-RDDC-2015-R027 i

Abstract ……..

With the revolution in information technology, the dependence of the Canadian Armed Forces (CAF) on their information systems continues to grow. While information systems-based assets confer a distinct advantage, they also make the CAF vulnerable if adversaries interfere with those. Unfortunately, the technology required to disrupt and damage an information system through malicious software (malware) is far less sophisticated and expensive than the amount of investment required to create the system. To understand and mitigate this threat, reverse engineering has to be performed to analyze malware. However, software reverse engineering is a manually intensive and time-consuming process. The learning curve to master it is quite steep and once mastered, the process is hindered when anti-reverse engineering techniques are used. This results in the very few available reverse engineers being quickly saturated. This Scientific Report describes new approaches to accelerate the reverse engineering process of malware. The goal is to reduce redundant analysis efforts by automating the identification of code fragments which reuse (i) previously analyzed assembly code or (ii) open source code publicly available.

Significance to defence and security

With today’s proliferation of malware, new analysis techniques which go beyond existing best practices are needed to accelerate the reverse engineering process. The intended audience for this report are thus software reverse engineers and malware analysts, as well as managers and decision makers of organizations which have to reverse engineer third party code. The techniques presented in this report will contribute to provide the CAF with a core capability to respond more effectively and rapidly to cyber threats and lead to a change in their defence capabilities. To defend against cyber attacks, the CAF currently rely on antivirus and intrusion detection systems (IDS). These approaches consist of looking for symptoms such as increased network traffic or anomalous signals, and try to fight the problems that arise. Although they can be effective in some cases, they do not help in understanding what the real problem is and are ineffective when attacks morph. Providing a new capability for malware reverse engineering, as it is proposed in this report, will allow the CAF to uncover vulnerabilities that malware exploit. This will also give them the required knowledge to generate defences against future malicious attacks to pre-empt them, even when they morph. This new capability will introduce a paradigm shift in the practices of the CAF in the cyber environment. They will move from a strictly defensive approach, which has shown its limits, to a proactive approach. This proposed approach has also security potential for the Royal Canadian Mounted Police (RCMP), the Communications Security Establishment Canada (CSEC), the Canadian Security Intelligence Service (CSIS), and Public Safety (PS) Canada.

ii DRDC-RDDC-2015-R027

This page intentionally left blank.

DRDC-RDDC-2015-R027 iii

Résumé ……..

Avec la révolution des technologies de l'information, la dépendance des Forces armées canadiennes (FAC) sur leurs systèmes d'information ne cesse de croître. Bien que les systèmes d’information sont des actifs qui confèrent un net avantage aux FAC, ils les rendent également vulnérables si des adversaires interfèrent avec ceux-ci. Malheureusement, la technologie nécessaire pour perturber et endommager un système d'information par l’entremise de logiciels malveillants est beaucoup moins sophistiquée et onéreuse que le montant de l'investissement nécessaire à la création du système. Pour comprendre et atténuer cette menace, la rétro-ingénierie doit être effectuée afin d'analyser ces logiciels malveillants. Cependant, la rétro-ingénierie logicielle est un processus manuel et fastidieux. La courbe d'apprentissage est assez abrupte et une fois maîtrisé, le processus est entravé lorsque des techniques pour contrer la rétro-ingénierie sont utilisées. Cela se traduit par le fait que les très rares rétro-ingénieurs disponibles sont rapidement saturés. Le présent rapport scientifique décrit de nouvelles approches pour accélérer le processus de rétro-ingénierie de logiciels malveillants. L'objectif est de réduire les efforts d'analyse redondants en automatisant l'identification de fragments de code qui réutilisent (i) du code assembleur analysé précédemment ou (ii) du code source ouvert accessible à tous.

Importance pour la défense et la sécurité

Avec la prolifération actuelle des logiciels malveillants, de nouvelles techniques d'analyse allant au-delà des meilleures pratiques existantes de rétro-ingénierie sont nécessaires pour accélérer le processus. L’audience visée par ce rapport sont donc les rétro-ingénieurs logiciels et analystes de maliciels, ainsi que les gestionnaires et décideurs d’organisations qui ont à faire la rétro-ingénierie de code tiers. Les techniques présentées dans ce rapport contribueront à fournir aux FAC une capacité de base pour répondre plus efficacement et rapidement aux menaces cybernétiques et mènera à un changement dans leurs capacités de défense. Pour se prémunir contre les cyber-attaques, les FAC comptent actuellement sur des antivirus et des systèmes de détection d'intrusions. Ces approches consistent à chercher des symptômes, tels qu’une augmentation du trafic réseau ou des signes anormaux, et essayer de lutter contre les problèmes qui surgissent. Bien qu'elles puissent être efficaces dans certains cas, elles n'aident pas à comprendre ce qu’est le véritable problème et sont inefficaces lorsque les attaques se transforment. Fournir une nouvelle capacité de rétro-ingénierie de logiciels malveillants, tel que proposée dans le présent rapport, permettra aux FAC de découvrir les vulnérabilités que les logiciels malveillants exploitent. Ceci leur fournira aussi les connaissances requises afin de générer des mesures défensives contre de futures attaques malveillantes afin de les contrer, même lorsqu’elles se transforment. Cette nouvelle capacité introduira un changement de paradigme dans les pratiques des FAC dans le cyber-environnement. Celles-ci passeront d'une approche strictement défensive, dont les limites ont été exposées, à une approche proactive. L’approche proposée a aussi un potentiel pour la sécurité, notamment la Gendarmerie royale du Canada (GRC), le Centre de la sécurité des télécommunications Canada (CSTC), le Service canadien du renseignement de sécurité (SCRS) et la Sécurité publique (SP) Canada.

iv DRDC-RDDC-2015-R027


DRDC-RDDC-2015-R027 v

Table of contents

Abstract …….. ................................................................................................................................. i Significance to defence and security ................................................................................................ i Résumé …….. ................................................................................................................................ iii Importance pour la défense et la sécurité ....................................................................................... iii Table of contents ............................................................................................................................. v List of figures ................................................................................................................................. vi List of tables .................................................................................................................................. vii 1 Introduction ............................................................................................................................... 1 2 Assembly to source code matching ........................................................................................... 3

2.1 RE-Google ....................................................................................................................... 3 2.2 Open source code repositories ......................................................................................... 3

2.2.1 Black Duck Open Hub Code Search ................................................................... 4 2.2.2 GitHub ................................................................................................................. 4

2.3 Exact matching ................................................................................................................ 5 2.4 Close matching .............................................................................................................. 10 2.5 Contextual matching ...................................................................................................... 12

3 Code clone detection ............................................................................................................... 14 3.1 Assembly code clone detection ..................................................................................... 14 3.2 Assembly code clone search .......................................................................................... 15

3.2.1 Type I clones ..................................................................................................... 15 3.2.2 Type II clones .................................................................................................... 16 3.2.3 Type III clones .................................................................................................. 16 3.2.4 Type IV clones .................................................................................................. 17

3.3 Exact and inexact code clones ....................................................................................... 17 3.4 Code clone search .......................................................................................................... 19

4 Conclusions and future work .................................................................................................. 21 References ..... ............................................................................................................................... 23 List of symbols/abbreviations/acronyms/initialisms ..................................................................... 27

vi DRDC-RDDC-2015-R027

List of figures

Figure 1: Black Duck Open Hub Code Search results .................................................................... 4

Figure 2: GitHub results .................................................................................................................. 5

Figure 3: Citadel sub_42514F function in IDA Pro ........................................................................ 6

Figure 4: Sert.cpp file excerpt from GitHub .................................................................................... 7

Figure 5: Sert.cpp file excerpt on GitHub ....................................................................................... 9

Figure 6: Assembly code listing of a malicious executable .......................................................... 10

Figure 7: channels.c file excerpt on the Black Duck Open Hub Code Search .............................. 11

Figure 8: Comments in the channels.c file on the Black Duck Open Hub Code Search ............... 12

Figure 9: Video capture capability discovered in Citadel ............................................................. 13

Figure 10: Type I clone example ................................................................................................... 15

Figure 11: Type II clone example ................................................................................................. 16

Figure 12: Type III clone example ................................................................................................ 16

Figure 13: Type IV clone example ................................................................................................ 17

Figure 14: Exact clone detected in both Citadel (left) and Zeus (right) ........................................ 18

Figure 15: Inexact clone detected in Citadel (left) and Zeus (right) .............................................. 19

Figure 16: Zeus code fragment ...................................................................................................... 20

Figure 17: Citadel code fragment .................................................................................................. 20

DRDC-RDDC-2015-R027 vii

List of tables

Table 1: Correspondance between assembly and source code ........................................................ 8

Table 2: Source vs. assembly code clones ..................................................................................... 17

viii DRDC-RDDC-2015-R027


DRDC-RDDC-2015-R027 1

1 Introduction

Canadians – individuals, industry and government – are embracing the many advantages that cyberspace offers, and the economy and quality of life are the better for it. But their increasing reliance on cyber technologies makes them more vulnerable to those who attack Canada’s digital infrastructure to undermine the national security, economic prosperity, and way of life [1]. The first decade of the 21st century saw a dramatic change in the nature of cybercrime. Once the province of teenage boys spreading graffiti for kicks and notoriety, hackers today are organized, financially motivated gangs [2]. In the past, virus writers displayed offensive images and bragged about the malware they had written; now hackers target companies and governments to steal intellectual property, build complex networks of compromised computers, and rob individuals of their identities [2]. It is a common scenario that the only piece of evidence of a targeted attack is the malicious executable code itself. Analyzing malicious code (malware) requires reverse engineering, as malware authors seldom do the favor of providing the source code of their creations.

Software reverse engineering is a manually intensive and time-consuming process, whose objective is to determine the functionality of a program. It consists in taking a program’s executable binary, translating it into assembly code, and then manually analyzing the resulting assembly code. Most of the steps involve translating assembly code into a series of abstractions that represent the overall flow of a program to determine its functionality. The learning curve to master reverse engineering is quite steep but once mastered, the process is hindered when a program is obfuscated by anti-reverse engineering techniques or actively tries to avoid detection, as most malware do.

During the last few years, the sophistication of malware has considerably evolved and has thus complicated the reverse engineering process. While malware used to consist of small programs written mostly in assembly, which spread by infecting other executable files, today’s malware programs are written using high-level languages, come in many forms (e.g., botnets, rootkits, malicious document files), and each new variant improves on the previous ones, by adding new capabilities and fixing bugs. Also, as developing stealthy and persistent malware requires a high degree of technical complexity, it is quite common for code fragments to be reused between different malware.

The fact that malware authors share source code among them [3, 4], have adopted a versioning approach, and use evasion techniques to bypass antivirus detection have resulted in a proliferation of malware. According to Panda Security, in 2013 alone, there were 30 million new malware strains in circulation, at an average of 82,000 per day [5]. This results in the very few available reverse engineers being quickly saturated, as in-depth malware analysis is a tedious and time-consuming task. Reverse engineers should thus leverage the code reuse in the production of malware and be able to correlate different malware programs to identify the similarities between them and thereby, the code fragments they share. This would prevent reverse engineers from reanalyzing the code fragments of a new malware, which have already been analyzed in a previous context and therefore, reduce redundant analysis efforts.

Although software reverse engineering came to an age in 1990 [6], it is only in recent years that its importance and visibility have arisen. However, despite a growing community, it is still

2 DRDC-RDDC-2015-R027

perceived by many as a dark art. This report will thus serve to illustrate our vision of how, by leveraging the open source code repositories publicly available and assembly code fragments previously analyzed, the entry barrier new analysts face could be reduced.

The remainder of this Scientific Report is organized as follows: Chapter 2 presents the assembly to source code matching process and its associated usage scenarios. Chapter 3 provides background information on code clone detection, the research area on which the identification of reused code fragments is built, as well as the different types of assembly code clones which should be detected by the proposed approach. Finally, Chapter 4 provides conclusions and future work.


2 Assembly to source code matching

Malware authors reuse code from all sources including legitimate ones, such as open source code. For example, both the Conficker worm and the Waledac bot used an open source implementation for their cryptographic functions [7, 8]. In the case of the Waledac bot, this represented 25% of its code [8]. Also, the much-discussed FLAME malware contained publicly available open source code packages such as SQLite and LUA [9].

Another example of source code reuse for malware creation is the Citadel Trojan. Citadel is based on the leaked source code of the Zeus crimeware kit [10]. Other malware (e.g., Gameover Zeus, Ice IX, LICAT, and Murofet) were also created after the source code of Zeus was made public [10].

2.1 RE-Google

Some people have compared reverse engineering to solving a jigsaw puzzle [11]. You first start by finding the corner pieces, then the frame, and after that, you work your way forward from there. Using this analogy, the corner pieces for reverse engineering are strings, constants, and function names. Strings contain human readable hints about a given functionality. Specific constants can give additional clues and can sometimes be used to even identify certain types of algorithms. Function names of imported functions from shared libraries (e.g., DLL) can reveal information about the performed actions. However, a lot of experience is needed to know, for example, the significance of a constant in a given context or what the combination of imported functions might results in.

The meaning of strings, constants, and function names could be obtained by searching them on publicly available open source code repositories. This was the idea behind the RE-Google IDA Pro plug-in [11], which has proved to be very valuable to find algorithms and code excerpts containing such information on Google Code. The ability to efficiently recognize the open source origin for a given assembly code fragment is desirable, both in order to enhance the productivity of a reverse engineer, as well as to reduce the odds of common libraries leading to false correlation between otherwise unrelated code bases.

RE-Google [11], a proof of concept plug-in for the IDA Pro disassembler [12], extracts constants, library names, and strings contained in a disassembled binary, and uses them to search for code on Google Code [13]. The links to the top ten source code files found are inserted as comments into the assembly code listing. Reviewing these files frequently provides enlightening insights into the functionality of the code fragment in question and saves considerable time. However, RE-Google uses the Google Code Search Data Application Programming Interface (API), which is no longer available, making this plug-in non-functional.

2.2 Open source code repositories

In order to be able to automatically match assembly with source code, the following alternatives to the Google Search service were evaluated: Antelink [14], CodePlex [15], GrepCode [16],


GitHub [17], Krugle [18], the Open Hub Code Search [19], and searchcode [20]. The two selected options were the Open Hub Code Search and GitHub.

2.2.1 Black Duck Open Hub Code Search

The Open Hub Code Search, by Black Duck Software [21], with its 21,372,664,482 lines of open source code, claims to be the world's largest and most comprehensive code search engine [19]. It is the result of the merge in 2012 with Koders, another open source code search engine acquired by Black Duck Software in 2008.

Through the Open Hub Code Search, Black Duck Software wants to fill the gap left by the shutdown of Google Code Search in 2012. Figure 1 shows a screenshot of the results for the search term sort filtered for the C++ programming language.

Figure 1: Black Duck Open Hub Code Search results.

2.2.2 GitHub

GitHub is a web-based hosting service for software development projects using Git [22] at its heart. Git is a free and open source distributed version control system. GitHub claims to be the largest code host on the planet with over 18.9 million repositories [23]. One advantage that GitHub has over the Open Hub Code Search is its robust API, which allows the integration of


third-party tools or applications [23]. Figure 2 displays a snapshot of the returned results by GitHub, considering only the C++ projects, for the search term sort.

Figure 2: GitHub results.

2.3 Exact matching

The perfect scenario in assembly to source code matching is when the source code of an assembly code function is found on a public repository. This is illustrated in the following example, where the corresponding source code of the Citadel Trojan function sub_42514F (Figure 3) has been found on GitHub (Figure 4).


Figure 3: Citadel sub_42514F function in IDA Pro.


Figure 4: Sert.cpp file excerpt from GitHub.

Figure 4 displays a subset of the file Sert.cpp found on the GitHub Carberp1 repository. All the calls to the Windows API functions (coloured in pink) in Figure 3 are also present in Figure 4, as shown in Table 1. The presence of the letter W appended at the end of the function name CertOpenSystemStoreW in the disassembly and not in the source code listing is due to the fact that there are two versions of CertOpenSystemStore. One for ASCII strings (ending with an A) and one for Unicode (ending with a W). In the present case, the CertOpenSystemStore function was compiled for Unicode.

1 https://github.com/hzeroo/Carberp/blob/6d449afaa5fd0d0935255d2fac7c7f6689e8486b/source%20-%20absource/pro/all%20source/sert/sert.cpp


Table 1: Correspondence between assembly and source code.

Function name IDA Pro virtual address

Source code line number

CertOpenSystemStoreW 0042515A 172

CertEnumCertificatesInStore 00425167 176

CertDuplicateCertificateContext 00425173 178

CertDeleteCertificateFromStore 0042517E 180

CertCloseStore 00425192 183

Figure 3 shows that IDA Pro was also able to extract the string “MY” at the address 00425150, which is passed as a parameter to the function CertOpenSystemStoreW. This string is also present in the file Sert.cpp. It is initialized at line 51 (Figure 5) and its pointer is passed as a parameter to the CertOpenSystemStore function at line 172 (Figure 4).

This example illustrates how being able to match assembly with its corresponding source code greatly accelerates the reverse engineering process. The latter is at a higher level of abstraction and is thus easier to understand. In this case, it serves to clearly illustrate how the different Windows Certificate Store functions used for cryptography and present in Citadel are related.


Figure 5: Sert.cpp file excerpt on GitHub.


2.4 Close matching

With close matching, although the exact corresponding source code cannot be found on a public repository, the fact that certain assembly code features (e.g., strings, constants) are matched can provide additional information. This can sometimes reveal the performed actions of a function. This is best illustrated with the following example. Figure 6 displays the assembly code listing of a malicious Secure Shell (SSH) client. IDA Pro was able to extract the string x11-req at the address 806721A. This string was also found on the Black Duck Open Hub Code Search, in the channels.c2 file of the MirOS Project.

Figure 6: Assembly code listing of a malicious executable.

2 http://code.openhub.net/file?fid=h271Oe3rYxXkxgFY3ZUnlSzqaXc&cid=rWOJw-YTJwg&s=&pp=0&fp=371636&ff=1&filterChecked=true&mp=1&ml=1&me=1&md=1#L0


Figure 7: channels.c file excerpt on the Black Duck Open Hub Code Search.

The MirOS project is a secure operating system from the BSD family for 32-bit i386 and SPARC systems [24]. Its file channel.c comes from the OpenBSD project [25]. The fact that the string x11-req was retrieved on the Black Duck Open Hub Code Search in the channel.c file allows the reverse engineer to know that the disassembled code displayed in Figure 6 is used for initiating X11connection forwarding. This is mentioned in the comments at the beginning of the file (Figure 8).


Figure 8: Comments in the channels.c file on the Black Duck Open Hub Code Search.

2.5 Contextual matching

Contextual matching is the process of characterizing the disassembled code under study by pairing it with some known source code. In this case, although the exact or closely corresponding source code cannot be found, searching on open source code repositories can still provide information about the performed actions of a function. This is what happened for Citadel when an approximate code matching identified a video-related capability. Although the matching process was not perfect, it was accurate enough to reveal the context of the function. The video capture capability of Citadel was unleashed through links to source code files on the Black Duck Open Hub Code Search such as MHRecordContol.h, stopRecord.c, trackerRecorder.h, signalRecorder.h, and waitRecord.c. The links found for the files were added as comments in the disassembly of Citadel, as shown in Figure 8. This observation was further supported by the fact that the API de-obfuscation of Citadel revealed the presence of strings such as _startRecord16. Also, a video_start command was also found as part of the process.


Figure 9: Video capture capability discovered in Citadel.

Although contextual matching is far from being the ideal scenario, the high-level information it provides for a function saves the reverse engineer from manually analyzing it. With Citadel having approximately 800 functions, any function for which the reverse engineer will not have to deal with assembly code analysis is a time saver.


3 Code clone detection

During the last few years, the sophistication of malware has considerably evolved and has thus complicated the reverse engineering process. While malware used to consist of small programs written mostly in assembly, which spread by infecting other executable files, today’s malware programs are written using high-level languages, come in many forms (e.g., botnets, rootkits, malicious document files), and each new variant improves on the previous ones, by adding new capabilities and fixing bugs. Also, as developing stealthy and persistent malware requires a high degree of technical complexity, it is quite common for code fragments to be reused between different malware.

The fact that malware authors share source code among them [1, 4], have adopted a versioning approach, and use evasion techniques to bypass antivirus detection have resulted in a proliferation of malware. Since retrieving the open source origin of a malware code fragment is not always possible, reverse engineers should thus leverage the code reuse in the production of malware and be able to correlate different malware programs to identify the similarities between them and thereby, the code fragments they share. This would prevent them from reanalyzing the code fragments of a new malware, which have already been analyzed in a previous context.

The problem of correlating different code fragments is closely related to the research area of clone detection. Clone detection is a technique to identify duplicate code fragments in a code base. Traditionally, it has been used to decrease the code size by consolidating it and thus, facilitate program comprehension and software maintenance. This need stems from the fact that reusing code fragments by copying and pasting them, with or without modifications, is a common scenario in software development and can be detrimental to software maintenance and evolution. For example, if a bug is found in a code fragment, then all similar code fragments must also be verified for the presence of this bug.

As clone detection is an important problem, it has been studied extensively and numerous clone detection algorithms exist. Depending on the code level analysis used, they can be classified within the following categories: text-based [26, 27, 28], token-based [29, 30, 31, 32], tree-based [33, 34, 35], metrics-based [36, 37], and graph-based [38, 39, 40]. However, most existing clone detection algorithms operate on source code and these solutions are not directly applicable to assembly code. One important application of clone detection on binary code is the detection of copyright infringements. For example, closed source software should not contain open source code released under the GNU General Public License (GPL).

3.1 Assembly code clone detection

Applying clone detection to the problem of malware analysis is challenging, due to the evasion techniques used by malware authors to produce syntactically different executable code, but semantically performing the same malicious functionality. This is further complicated by the fact that clone detection on assembly code is not yet a mature research area. While the research community has expended effort in different approach categories, namely text-based [41, 42], token-based [43, 44, 45], metrics-based [46, 47], structural-based [48, 49, 50, 51, 52, 53], behavioral-based [54], and hybrid-based approaches [55], there is a lack of experimental data to


validate their precision, recall, and scalability. Also, unlike source code clone detection, no independent experimental study has been conducted to compare the performance of the assembly code clone detection methods.

3.2 Assembly code clone search

The objective of clone detection is to identify code fragments of high similarity from a large code base. The major challenge is that the clone detector usually does not know beforehand which code fragments may be repeated. Therefore, a naïve clone detection approach might need to compare every pair of code fragments. Such a comparison is prohibitively expensive in terms of computation and is infeasible to perform in many real-life scenarios. But given a collection of previously analyzed assembly files and a target assembly code fragment, such as in the case of malware analysis, the objective is not to identify all the duplicate code fragments. It is only to identify all the code fragments in the previously analyzed assembly files that are similar to the target fragment. This problem is known as assembly code clone search.

A code fragment is any sequence of assembly code instructions, with or without comments, at any granularity level (e.g., function, basic block). A code fragment is a clone of another code fragment if they are similar according to a given definition of similarity [56]. In clone detection (or search), code fragments can be similar based on their program text (textual similarity) or functionality (functional similarity). In the literature, code clones have been classified into Type I, II, III (textual similarity), and IV (functional similarity) [57]. The results of clone detection take the form of clone pairs. A pair of code fragments is called a clone pair if there exists a clone-relation between them (i.e., a clone pair is a pair of code fragments which are identical or similar to each other) [57].

3.2.1 Type I clones

A Type I clone is when two or more code fragments are identical except for variations in whitespace, layout, and comments. In the example of Figure 10, the only difference between the two code fragments is the presence of the Memory comment indicated in red at line 1. 1 2 3 4

push eax ; Memory call ds:_aligned_free and dword ptr [esi], 0 pop ecx

push eax call ds:_aligned_free and dword ptr [esi], 0 pop ecx

Figure 10: Type I clone example.


3.2.2 Type II clones

Type II clones are structurally and syntactically identical code fragments except for variations in identifiers, literals, types, layout, and comments. In Figure 11, the only difference between the two code fragments is that for some instructions, they use different constants, variable names, and labels. For example, for the assembly code instruction at line 5, the instruction on the left uses the constant 24h and the variable name var_C, while its corresponding instruction on the right uses the constant 20h and the variable name InBuffer. Similar differences also apply for the instructions at line 15. For the jnz instruction at line 17, the loc_10001A97 label is used on the left, while the loc_10001493 label is used for the instruction on its right. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

push edi ; Size call _malloc mov edx, eax mov ecx, edi mov [esp+24h+var_C], edx mov edi, edx mov edx, ecx xor eax,eax shr ecx, 2 rep stosd mov ecx, edx add esp, 4 and ecx, 3 rep stosb mov eax, [esp+20h+var_C] test eax, eax jnz loc_10001A97 mov eax, [ebx] push eax

push edi ; Size call _malloc mov edx, eax mov ecx, edi mov [esp+20h+InBuffer], edx mov edi, edx mov edx, ecx xor eax, eax shr ecx, 2 rep stosd mov ecx, edx add esp, 4 and ecx, 3 rep stosb mov eax, [esp+1Ch+InBuffer] test eax, eax jnz loc_10001493 mov eax, [ebx] push eax

Figure 11: Type II clone example.

3.2.3 Type III clones

A Type III clone is a Type II clone with further modifications. Statements can be changed, added, or removed, in addition to variations in identifiers, literals, types, layout and comments. In the example of Figure 12, the order of the two instructions at lines 3 and 4 was inverted. 1 2 3 4 5 6 7

mov esi, [ebp+arg_0] mov edx, [esi+214h] mov edi, [esi+220h] mov [ebp+var_4], edx cmp [esi+21Ch], edi jl short loc_76641044 lea ebx, [edx+edi*8]

mov esi, [ebp+arg_0] mov edx, [esi+214h] mov [ebp+var_4], edx mov edi, [esi+220h] cmp [esi+21Ch], edi jl short loc_76641044 lea ebx, [edx+edi*8]

Figure 12: Type III clone example.


3.2.4 Type IV clones

A Type IV clone, also known as a semantic clone, occurs when two or more code fragments perform the same computation, but using different syntactic variants. In the example of Figure 13, the two code fragments carry out the same function, i.e., compute the length of a string. However, as it can be seen, their implementation differs significantly. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

strlen1 proc near arg_0 = dword ptr 4 mov eax, [esp+arg_0] loc_401004: cmp byte ptr [eax], 0 jz short done inc eax jmp short loc_401004 done: sub eax, [esp+arg_0] retn strlen1 endp

strlen3 proc near arg_0 = dword ptr 4 push edi mov edi, [esp+4+arg_0] xor ecx, ecx not ecx xor al, al cld repne scasb not ecx lea eax, [ecx-1] pop edi retn strlen3 endp

Figure 13: Type IV clone example.

3.3 Exact and inexact code clones

The above definitions for the different code clones types are commonly used in the literature for source code clone detection [57]. However, they are not directly applicable to assembly code. For example, although possible, Type I clones seldom occur in assembly code and are thus irrelevant. For this reason and to simplify matters, the notion of exact and inexact code clones is introduced. As illustrated in Table 2, an exact clone corresponds either to a Type I or Type II clone, while an inexact clone corresponds to a Type III or Type IV clone.

Table 2: Source vs. assembly code clones.

Source Code Assembly Code

Type I Clone Exact Clone

Type II Clone

Type III Clone Inexact Clone

Type IV Clone


The first usage scenario which should be supported is the detection of exact clones. Figure 14 displays an example of an exact clone detected in both Citadel (on the left) and Zeus (on the right), with their differences highlighted. When compared with their corresponding instructions in Zeus, some Citadel instructions use different registers (e.g., ebx and edi at address 40ADEE and 40ADEF), labels (e.g., loc_40B135 at address 40ADFA), and function names (e.g., sub_433D74 at address 40AE59).

Figure 14: Exact clone detected in both Citadel (left) and Zeus (right).


The second usage scenario to be supported is illustrated in Figure 15. In this example, an inexact clone was identified in both Citadel (on the left) and Zeus (on the right). This clone is related to the RC4 function used for encrypting the command and control (C&C) network traffic between the bot and the C&C server.

Figure 15: Inexact clone detected in Citadel (left) and Zeus (right).

3.4 Code clone search

Another scenario which should also be supported is the possibility to search for a specific code fragment. For example, an analyst might be interested to know if the function at the virtual address 417018 in Zeus (Figure 16) is also present in Citadel (Figure 17).


Figure 16: Zeus code fragment.

Figure 17: Citadel code fragment.


4 Conclusions and future work

As software reverse engineering is a manually intensive and time-consuming process, this report illustrates scenarios on how to accelerate it by partially automating some of its aspects. The objective is to leverage existing sources of information available, namely (i) public open source code repositories and (ii) previously analyzed assembly code fragments. The first approach aims at saving time by providing the significance of a function’s strings, constants, and imported functions, without having the reverse engineer analyze the underlying assembly code. The second attempts to reduce redundant analysis efforts by detecting code clones of a target executable.

Future work will consist of developing prototypes to support the different scenarios explained in the present report. These prototypes will allow matching assembly with its corresponding source code, as well as detecting clones of assembly code fragments. Empirical studies will also be conducted to measure their precision, recall, scalability, and performance using malware samples. The results will be measured against the most representative approaches to compare assembly code.


References .....

[1] Public Safety Canada, Canada’s Cyber Security Strategy: For a Stronger and More Prosperous Canada, Government of Canada, 2010.

[2] Sophos, Security Threat Report: 2010, Sophos Group, Abingdon, United Kingdom, Jan. 2010.

[3] PSTP08-0107eSec Combating Robot Networks and Their Controllers: A Study for the Public Security and Technical Program (PSTP), CPQ0229.1.36.4, Bell Canada, 2010.

[4] D. Babic, D. Reynaud, and D. Song, “Malware Analysis with Tree Automata Inference,” Proc. of the 23rd Int’l Conf. on Computer Aided Verification (CAV ‘11), Snowbird, Utah, Jul. 2011, pp. 116-131.

[5] Panda Security, “Annual Report PandaLabs: 2013 Summary,” 2014; http://mediacenter.pandasecurity.com/mediacenter/wp-content/uploads/2014/07/Annual-Report-PandaLabs-2013.pdf.

[6] E. Eilam, Reversing: Secrets of Reverse Engineering, Wiley Publishing, 2005.

[7] P. Porras, H. Saidi, and V. Yegneswaran, “An Analysis of Conficker,” Mar. 2009; http://mtc.sri.com/Conficker/.

[8] B. Stock, et al., “Walowdac – Analysis of a Peer-to-Peer Botnet,” Proc. of the 2009 European Conf. on Computer Network Defense (EC2ND '09), Milan, Italy, Nov. 2009, pp. 13-20.

[9] sKyWIper Analysis Team, sKyWIper (a.k.a. Flame a.k.a. Flamer): A Complex Malware for Targeted Attacks, tech. report, Department of Telecommunications, Budapest Univ. of Technology and Economics, 2012.

[10] Analysis Team, Malware Analysis: Citadel, tech. report, AhnLab Security Emergency response Center, 2012.

[11] F. Leder, “RE-Google,” Accessed Jan. 2015; http://regoogle.carnivore.it/.

[12] Hex-Rays, “IDA: About,” Accessed Jan. 2015; https://www.hex-rays.com/products/ida/index.shtml.

[13] Google, “Google Code,” Accessed Jan. 2015; http://code.google.com/.

[14] Antelink, “Antepedia Open Source Search Engine - Stay aware of updates & security issues of open source projects,” Accessed Jan. 2015; http://www.antepedia.com/.

[15] Microsoft, “CodePlex - Open Source Project Hosting,” Accessed Jan. 2015; https://www.codeplex.com/.


[16] GrepCode, “GrepCode.com - Java Source Code Search 2.0,” Accessed Jan. 2015; http://grepcode.com/.

[17] GitHub, “GitHub · Build software better, together.,” Accessed Jan. 2015; https://github.com/.

[18] Aragon Consulting Group, “Home | krugle - software development productivity,” Accessed Jan. 2015; http://www.krugle.com/.

[19] Black Duck Software, “Open Hub Code Search,” Accessed Jan. 2015; http://code.openhub.net/.

[20] B.E. Boyter, “searchcode | source code search engine,” Accessed Jan. 2015; https://searchcode.com/.

[21] Black Duck Software, “Black Duck,” Accessed Jan. 2015; https://www.blackducksoftware.com/.

[22] Git, “Git,” Accessed Jan. 2015; http://git-scm.com/.

[23] GitHub, “Features · GitHub,” Accessed Jan. 2015; https://github.com/features.

[24] Black Duck Software, “The MirOS Project Open Source Project on Open Hub,” Accessed Jan. 2015; https://www.openhub.net/p/mirbsd.

[25] OpenBSD, “OpenBSD,” Accessed Jan. 2015; http://www.openbsd.org/.

[26] J.H. Johnson, “Identifying Redundancy in Source Code Using Fingerprints,” Proc. of the 1993 Conf. of the Centre for Advanced Studies on Collaborative Research: Software Eng., Toronto, Ont., Sept. 1993, pp. 171-183.

[27] A. Marcus and J.I. Maletic, “Identification of High-level Concept Clones in Source Code,” Proc. of the 16th IEEE Int’l Conf. on Automated Software Eng. (ASE ‘01), Coronado, Calif., Nov. 2001, pp. 107-114.

[28] J. Ji, et al., “Source Code Similarity Detection Using Adaptive Local Alignment of Keywords,” Proc. of the 8th Int’l Conf. on Parallel and Distributed Computing, Applications and Technologies (PDCAT ‘07), Adelaide, Australia, Dec. 2007, pp. 179-180.

[29] B.S. Baker, “On Finding Duplication and Near-Duplication in Large Software Systems,” Proc. of the 2nd Working Conf. on Reverse Eng. (WCRE ‘95), Toronto, Ont., Jul. 1995, pp. 86-95.

[30] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code, IEEE Trans. on Software Eng., vol. 28, no. 7, Jul. 2002, pp. 654-670.


[31] H.A. Basit, et al., “Efficient Token Based Clone Detection with Flexible Tokenization,” Proc. of the 6th Joint Meeting of the European Software Eng. Conf. and the ACM SIGSOFT Symp. on the Foundations of Software Eng., Dubrovnik, Croatia, Sept. 2007, pp. 513-516.

[32] B. Hummel, et al., “Index-based Code Clone Detection: Incremental, Distributed, Scalable,” Proc. of the 2010 IEEE Int’l Conf. on Software Maintenance, Timisoara, Romania, Sept. 2010, pp. 1-9.

[33] I.D. Baxter et al., “Clone Detection Using Abstract Syntax Trees,” Proc. of the Int’l Conf. on Software Maintenance (ICSM '98), Bethesda, Md., Nov. 1998, pp. 368-377.

[34] V. Wahler, et al., “Clone Detection in Source Code by Frequent Itemset Techniques,” Proc. of the 4th IEEE Int’l Workshop on Source Code Analysis and Manipulation (SCAM ‘04), Chicago, Ill., Sept. 2004, pp. 128-135.

[35] W.S. Evans, C.W. Fraser, and F. Ma, “Clone Detection via Structural Abstraction,” Proc. of the 14th Working Conf. on Reverse Eng. (WCRE ’07), Vancouver, B.C., Oct. 2007, pp. 150-159.

[36] K.A. Kontogiannis, et al., “Pattern Matching for Clone and Concept Detection,” J. of Automated Software Engineering, vol. 3, no. 1-2, Jun. 1996, pp. 77-108.

[37] J. Mayrand, C. Leblanc, and E. Merlo, “Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics,” Proc. of the 1996 Int’l Conf. on Software Maintenance (ICSM ’96), Monterey, Calif., Nov. 1996, pp. 244-253.

[38] J. Krinke, “Identifying Similar Code with Program Dependence Graphs,” Proc. of the 8th Working Conf. on Reverse Eng. (WCRE ’01), Stuttgart, Germany, Oct. 2001, pp. 301-309.

[39] R. Komondoor and S. Horwitz, “Using Slicing to Identify Duplication in Source Code,” Proc. of the 8th Int’l Symp. on Static Analysis (SAS 2001), Paris, France, Jul. 2001, pp. 40-56.

[40] C. Liu, et al., “GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis,” Proc. of the 12th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD ’06), Philadelphia, Pa., Aug. 2006, pp. 872-881.

[41] J. Jang and D. Brumley, BitShred: Fast, Scalable Code Reuse Detection in Binary Code, tech. report CMU-CyLab-10-006, CyLab, Carnegie Mellon Univ., 2009.

[42] J. Jang, D. Brumley, and S. Venkataraman, BitShred: Fast, Scalable Malware Triage, tech. report CMU-CyLab-10-022, CyLab, Carnegie Mellon Univ., 2010.

[43] M.E. Karim, et al., “Malware Phylogeny Generation using Permutations of Code,” J. in Computer Virology, vol. 1, no. 1, 2005, pp. 13-23.

[44] A. Schulman, “Finding Binary Clones with Opstrings and Function Digests, Doctor Dobb's J., vol. 30, no. 9, Sept. 2005, pp. 64-70.


[45] A. Walenstein, et al., “Exploiting Similarity Between Variants to Defeat Malware: “Vilo” Method for Comparing and Searching Binary Programs,” Proc. of BlackHat DC 2007.

[46] D. Bruschi, L. Martignoni, and M. Monga, “Detecting Self-mutating Malware Using Control-flow Graph Matching,” Proc. of the 3rd Int’l. Conf. on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA '06), Berlin, Germany, Jul. 2006, pp. 129-143.

[47] A. Saebjornsen, et al., “Detecting Code Clones in Binary Executables,” Proc. of the 18th Int’l Symp. on Software Testing and Analysis (ISSTA '09), Chicago, Ill., Jul. 2009, pp. 117-128.

[48] E. Carrera and G. Erdélyi, “Digital Genome Mapping - Advanced Binary Malware Analysis,” Proc. of the Virus Bulletin Conf., Chicago, Ill., Sept. 2004, pp. 187-197.

[49] Halvar Flake, “More Fun with Graphs,” Black Hat Federal, 2003.

[50] M.E. Karim, et al., “Malware Phylogeny Generation using Permutations of Code,” J. in Computer Virology, vol. 1, no. 1, 2005, pp. 13-23.

[51] I. Briones and A. Gomez, “Graphs, Entropy and Grid Computing: Automatic Comparison of Malware,” Proc. of the Virus Bulletin Conf., Ottawa, Ont., Oct. 2008, pp. 187-197.

[52] S.S. Anju, et al., “Malware Detection Using Assembly Code and Control Flow Graph Optimization,” Proc. of the 1st Amrita ACM-W Celebration on Women in Computing in India (A2CWiC '10), Tamilnadu, India, Sept. 2010, pp. 65:1-65:4.

[53] T. Dullien, et al., “Automated Attacker Correlation for Malicious Code,” Proc. of the NATO RTO Information Assurance and Cyber Defence, Tallinn, Estonia, Dec. 2010.

[54] P.M. Comparetti, et al., “Identifying Dormant Functionality in Malware Programs,” Proc. of the 2010 IEEE Symp. on Security and Privacy (SP ’10), Oakland, Calif., May 2010, pp. 61-76.

[55] Z. Wang, K. Pierce, and S. McFarling, “BMAT - A Binary Matching Tool for Stale Profile Propagation,” J. of Instruction-Level Parallelism, vol. 2, 2000.

[56] C.K. Roy, J.R. Cordy, and R. Koschke, “Comparison and Evaluation of code Clone Detection Techniques and Tools: A Qualitative Approach,” Science of Computer Programming, vol. 74, no. 7, May 2009, pp. 470-495.

[57] C.K. Roy and J.R. Cordy, A Survey on Software Clone Detection Research, tech. report 2007-541, School of Computing, Queen's Univ., Kingston, Ont., 2007.


List of symbols/abbreviations/acronyms/initialisms

API Application Programming Interface

BSD Berkeley Software Distribution

CAF Canadian Armed Forces

C&C Command and Control

CSEC Communications Security Establishment Canada

CSIS Canadian Security Intelligence Service

CSTC Centre de la sécurité des télécommunications Canada

DLL Dynamic Link Library

DRDC Defence Research and Development Canada

FAC Forces armées canadiennes

GNU GNU's Not Unix!

GPL General Public License

GRC Gendarmerie royale du Canada

IDS Intrusion Detection System

PS Public Safety

RCMP Royal Canadian Mounted Police

RDDC Recherche et développement pour la défense Canada

SPARC Scalable Processor Architecture

SCRS Service canadien du renseignement de sécurité

SP Sécurité publique

SSH Secure Shell


DOCUMENT CONTROL DATA (Security markings for the title, abstract and indexing annotation must be entered when the document is Classified or Designated)

1. ORIGINATOR (The name and address of the organization preparing the document. Organizations for whom the document was prepared, e.g. Centre sponsoring a contractor's report, or tasking agency, are entered in section 8.) Defence Research and Development Canada – Valcartier Research Centre 2459, Route de la Bravoure Quebec (Quebec) G3J 1X5 Canada

2a. SECURITY MARKING (Overall security marking of the document including special supplemental markings if applicable.)

UNCLASSIFIED

2b. CONTROLLED GOODS

(NON-CONTROLLED GOODS) DMC A REVIEW: GCEC DEC 2012

3. TITLE (The complete document title as indicated on the title page. Its classification should be indicated by the appropriate abbreviation (S, C or U) in

parentheses after the title.) Software fingerprinting for automated assembly code analysis: Project overview (U)

4. AUTHORS (last name, followed by initials – ranks, titles, etc. not to be used) Charland, P

5. DATE OF PUBLICATION (Month and year of publication of document.) March 2015

6a. NO. OF PAGES (Total containing information, including Annexes, Appendices, etc.)

40

6b. NO. OF REFS (Total cited in document.)

57 7. DESCRIPTIVE NOTES (The category of the document, e.g. technical report, technical note or memorandum. If appropriate, enter the type of report,

e.g. interim, progress, summary, annual or final. Give the inclusive dates when a specific reporting period is covered.) Scientific Report

8. SPONSORING ACTIVITY (The name of the department project office or laboratory sponsoring the research and development – include address.) Defence Research and Development Canada – Valcartier Research Centre 2459, de la Bravoure Road Quebec (Quebec) G3J 1X5 Canada

9a. PROJECT OR GRANT NO. (If appropriate, the applicable research and development project or grant number under which the document was written. Please specify whether project or grant.)

15bz18

9b. CONTRACT NO. (If appropriate, the applicable number under which the document was written.)

10a. ORIGINATOR'S DOCUMENT NUMBER (The official document number by which the document is identified by the originating activity. This number must be unique to this document.) DRDC-RDDC-2015-R027

10b. OTHER DOCUMENT NO(s). (Any other numbers which may be assigned this document either by the originator or by the sponsor.)

11. DOCUMENT AVAILABILITY (Any limitations on further dissemination of the document, other than those imposed by security classification.)

Unlimited

12. DOCUMENT ANNOUNCEMENT (Any limitation to the bibliographic announcement of this document. This will normally correspond to the Document Availability (11). However, where further distribution (beyond the audience specified in (11) is possible, a wider announcement audience may be selected.)) Unlimited


13. ABSTRACT (A brief and factual summary of the document. It may also appear elsewhere in the body of the document itself. It is highly desirable that the abstract of classified documents be unclassified. Each paragraph of the abstract shall begin with an indication of the security classification of the information in the paragraph (unless the document itself is unclassified) represented as (S), (C), (R), or (U). It is not necessary to include here abstracts in both official languages unless the text is bilingual.)

With the revolution in information technology, the dependence of the Canadian Armed Forces(CAF) on their information systems continues to grow. While information systems-based assetsconfer a distinct advantage, they also make the CAF vulnerable if adversaries interfere with those. Unfortunately, the technology required to disrupt and damage an information system through malicious software (malware) is far less sophisticated and expensive than the amount of investment required to create the system. To understand and mitigate this threat, reverse engineering has to be performed to analyze malware. However, software reverse engineering is a manually intensive and time-consuming process. The learning curve to master it is quite steep and once mastered, the process is hindered when anti-reverse engineering techniques are used. This results in the very few available reverse engineers being quickly saturated. This Scientific Report describes new approaches to accelerate the reverse engineering process of malware. Thegoal is to reduce redundant analysis efforts by automating the identification of code fragments which reuse (i) previously analyzed assembly code or (ii) open source code publicly available.---------------------------------------------------------------------------------------------------------------

Avec la révolution des technologies de l'information, la dépendance des Forces armées canadiennes (FAC) sur leurs systèmes d'information ne cesse de croître. Bien que les systèmes d’information sont des actifs qui confèrent un net avantage aux FAC, ils les rendent également vulnérables si des adversaires interfèrent avec ceux-ci. Malheureusement, la technologie nécessaire pour perturber et endommager un système d'information par l’entremise de logiciels malveillants est beaucoup moins sophistiquée et onéreuse que le montant de l'investissementnécessaire à la création du système. Pour comprendre et atténuer cette menace, la rétro-ingénierie doit être effectuée afin d'analyser ces logiciels malveillants. Cependant, la rétro-ingénierie logicielle est un processus manuel et fastidieux. La courbe d'apprentissage est assez abrupte et une fois maîtrisé, le processus est entravé lorsque des techniques pour contrer la rétro-ingénierie sont utilisées. Cela se traduit par le fait que les très rares rétro-ingénieursdisponibles sont rapidement saturés. Le présent rapport scientifique décrit de nouvelles approches pour accélérer le processus de rétro-ingénierie de logiciels malveillants. L'objectif est de réduire les efforts d'analyse redondants en automatisant l'identification de fragments de codequi réutilisent (i) du code assembleur analysé précédemment ou (ii) du code source ouvert accessible à tous.

14. KEYWORDS, DESCRIPTORS or IDENTIFIERS (Technically meaningful terms or short phrases that characterize a document and could be helpful in cataloguing the document. They should be selected so that no security classification is required. Identifiers, such as equipment model designation, trade name, military project code name, geographic location may also be included. If possible keywords should be selected from a published thesaurus, e.g. Thesaurus of Engineering and Scientific Terms (TEST) and that thesaurus identified. If it is not possible to select indexing terms which are Unclassified, the classification of each should be indicated as with the title.) Software reverse engineering; malware analysis; assembly to source code matching; clone detection

software fingerprinting for automated assembly code analysis › pdfs › unc236 ›...

Documents