lo-phi: low-observable physical host instrumentation for malware analysis

31
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis Chad Spensky, Hongyi Hu and Kevin Leach Pietro De Nicolao Politecnico di Milano June 13, 2016 1 / 31

Upload: pietro-de-nicolao

Post on 16-Apr-2017

43 views

Category:

Science


3 download

TRANSCRIPT

LO-PHI: Low-Observable Physical HostInstrumentation for Malware AnalysisChad Spensky, Hongyi Hu and Kevin Leach

Pietro De Nicolao

Politecnico di Milano

June 13, 2016

1 / 31

Open challenges in Dynamic Malware Analysis

LO-PHI framework implementation

Experiments

Critique

Related and future work

2 / 31

The paper and the code

LO-PHI: Low-Observable Physical HostInstrumentation for Malware Analysis

Chad Spensky⇤†, Hongyi Hu⇤§ and Kevin Leach⇤‡⇤MIT Lincoln Laboratory [email protected]

†University of California, Santa Barbara [email protected]§Dropbox [email protected]

‡University of Virginia [email protected]

Abstract—Dynamic-analysis techniques have become thelinchpins of modern malware analysis. However, software-basedmethods have been shown to expose numerous artifacts, whichcan either be detected and subverted, or potentially interferewith the analysis altogether, making their results untrustworthy.The need for less-intrusive methods of analysis has led manyresearchers to utilize introspection in place of instrumenting thesoftware itself. While most current introspection technologieshave focused on virtual-machine introspection, we present a novelsystem, LO-PHI, which is capable of physical-machine introspec-tion of both non-volatile and volatile memory, i.e., hard disk andsystem memory. We demonstrate that we are able to provide anal-ysis capabilities comparable to existing solutions, whilst exposingzero software-based artifacts and minimal hardware artifacts. Todemonstrate the usefulness of our system, we have developeda framework for performing automated binary analysis. Weemploy this framework to analyze numerous potentially maliciousbinaries using both traditional virtual-machine introspection andour new hardware-based instrumentation. Our results show thatnot only is our analysis on-par with existing software-basedcounterparts, but that our physical instrumentation is capableof successfully analyzing far more binaries, as it is not foiled bypopular anti-analysis techniques.

I. INTRODUCTION

With the rapid advancement of malware, the capabilitiesof existing analysis techniques have become obsolete. Toolsthat exist within the operating system or hypervisor are proneto creating artifacts that are visible to the malicious code.Malware authors can leverage these artifacts to conceal theirtrue intentions by halting execution or potentially subvertingthe analysis technique all together. Even with clever designs,there remains no proven technique for developing low-artifactsoftware-based analysis tools. Moreover, recent work by Kiratet al. [44] showed that at least 5% of the malware analyzed intheir study employed anti-analysis techniques that successfullyevade most existing analysis tools. Subsequently, numeroussystems have been developed that attempt to detect and analyzethese environment-aware malware (e.g., [13], [49]). Chen et

al. [20] even provide a taxonomy of anti-analysis techniquesand mitigations commonly employed. However, no bullet-proof solutions exist, and most existing solutions requirecontinuous updating as they rely on emulation frameworks.

To address these problems we present LO-PHI (Low-Observable Physical Host Instrumentation), a novel system ca-pable of analyzing software executing on commercial-off-the-shelf (COTS) bare-metal machines, without the need for anyadditional software on the machines. LO-PHI permits accuratemonitoring and analysis of live-running physical hosts in real-time, with a minimal addition of “plug-and-play” componentsto an otherwise unmodified system under test (SUT). We havetaken a two-pronged approach that is capable of instrument-ing machines with actual hardware, to minimize artifacts, orwith traditional software-based techniques utilizing hardwarevirtualization, to maximize scale. This permits the tradeoffbetween transparency, scale, and cost when appropriate as wellas the potential for parallel analysis. Our architecture usesphysical hardware and software-based sensors to monitor aSUT’s memory, network, and disk activity as well as actuateits keyboard, mouse, and power. The raw data collected fromour sensors is then processed with modified open source tools,i.e., Volatility [9] and Sleuthkit [18], to bridge the semanticgap, i.e., convert raw data into human-readable, semanticallyrich, output. Because LO-PHI is designed to collect raw low-level data, it is both operating system and file system agnostic.Our framework can be easily extended to support new or legacyoperating systems and file systems as long as the hardware tappoints are suitable for data acquisition.

LO-PHI also necessitated the development of novel in-trospection techniques. While numerous techniques exist formemory acquisition [19], [28], [72] of bare-metal systems,passively monitoring disk activity has only recently begunto be explored [50]. In this work we present a hardwaresensor capable of passively sniffing the disk activity of a livemachine while introducing minimal artifacts. Subsequently,we have also developed the required modules for parsingand reconstructing the underlying serial advanced technologyattachment (SATA) protocol. All of the source code for LO-PHI is available under the Berkeley Software Distribution(BSD) license at http://github.com/mit-ll/LO-PHI.

While the potential applications for LO-PHI are vast, wefocus on our ability to perform automated malware analysis onphysical machines, and demonstrate its usefulness by show-casing the ability to analyze classes of malware that trivially

Permission to freely reproduce all or part of this paper for noncommercialpurposes is granted provided that copies bear this notice and the full citationon the first page. Reproduction for commercial purposes is strictly prohibitedwithout the prior written consent of the Internet Society, the first-named author(for reproduction of an entire paper only), and the author’s employer if thepaper was prepared within the scope of employment.NDSS ’16, 21-24 February 2016, San Diego, CA, USACopyright 2016 Internet Society, ISBN 1-891562-41-Xhttp://dx.doi.org/10.14722/ndss.2016.23121

Network and Distributed System Symposium (NDSS) ’1621-24 February 2016, San Diego, CA, USA

I GitHub repository: https://github.com/mit-ll/LO-PHI

3 / 31

Open challenges in Dynamic Malware Analysis

4 / 31

Dynamic Malware Analysis

Idea: run the malware in a sandbox (VM, debugger, . . . ) and usetools to analyze its behavior.

I We are interested in observing:I Memory accessesI Network activityI Disk activity

I Observation tools must be placed outside the sandboxI At a lower level w.r.t. the malwareI In theory, completely transparent to the malwareI Not so simple. . .

5 / 31

Virtualization is like dreamingI Unsure if you’re living in a dream, or awake?

I Look for artifacts (i.e. anomalies) in your reality!

Malware can do the same. . .

Figure 1: Never stopping spinning top: a possible artifact in dreams.

6 / 31

Artifacts and environment-aware malware

Observer effectExecution of software into a debugger or VM leaves artifacts.

I Artifacts are evidences of an “artificial” environmentI According to [Garfinkel, 2007], building a transparent VMM is

fundamentally infeasible.

Malware resistance to dynamic analysisIf the malware is able to detect artifacts, it can resist totraditional dynamic analysis tools.

1. Remain inactive (not trigger the payload)2. Abort or crash its host3. Disable defenses or tools

7 / 31

Artifacts: some examples

[Chen, 2008], [Garfinkel, 2007] provide a taxonomy of artifacts:I Hardware

I Special devices or adapters in VMsI Specific manufacturer prefixes on device names

I MemoryI Hypervisors placing interrupt table in different positionsI Too little RAM (typical of VMs)

I SoftwareI Presence of installed tools on the systemI isDebuggerPresent() Windows API

I BehaviorI Timing differences (e.g. interception of privileged instructions in

VMMs)

8 / 31

Semantic GapWhat is the malware doing?Need to mine semantics from the extracted raw data.

I From raw data:I Disk read at sector 123I TCP packet ABCI Some bytes at memory address 0xDEADBEEF

I To concise, high-level event descriptions:I /etc/passwd has been readI A connection to badguy.com has been opened

Forensics to the rescueLO-PHI uses well-known forensic tools, adapted for live analysis:

I Disk forensics: SleuthkitI Memory forensics: Volatility

9 / 31

Malware Analysis Framework Evaluation

based FU.rootkit malware driver. We logged o↵ from thecurrent user session, e↵ectively terminating all the runningapplications, and logged in as a di↵erent user. We initiatedthe snapshot-restore operation from that user session, andwithin seconds, the system was restored back to the previoususer’s active session. All the running processes were intactbut without the rootkit driver loaded in the system.

5.2 PerformanceWe compared the performance of our system with di↵erent

system restore options available for bare-metal and virtualmachines systems. We did not test a hardware-based diskrestore option (such as Juzt-reboot [15]), but since all ofthem require a system reboot, we simply measure the systemreboot time of the bare-metal system and use this as a lowerbound. To measure a fast reboot with commodity hardware,we installed a Windows XP system on an Intel quad-coreCPU with 4GB RAM and a 48GB solid-state drive (SSD)with average throughput of 450MB/s. We ignored the initialBIOS initialization of the machine, as it is largely dependedon its hardware configuration (e.g., for the above mentionedconfiguration, the BIOS initialization time was more than 7seconds because of the custom configuration prompt of theSSD drive). We only measured the time of the soft-rebootof the operating system. That is, the time duration fromthe operating system boot menu to a completely logged onsession. This will typically underestimate the actual timethat a system requires to perform a soft-boot.

By profiling the time required for di↵erent stages of Bare-Box’ restore process, we found that a significant amountof time was required for device power state manipulation.Thus, we performed experiments by manipulating the powerstate of only a subset of attached hardware devices. In par-ticular, we performed three sets of experiments:

• All Devices: All devices that are suspended by the sys-tem during a suspend-to-RAM operation are included.

• Minimum Devices: Only those devices that are re-quired for the malware analysis framework were in-cluded.

• Memory Only: No device related operations were per-formed. The purpose of this experiment was to mea-sure the memory copy performance because not all de-vices supported this type of memory restore.

8.4

1.6 1

55.8

3.4

15.7

8.8

4.3

1.1

12

16.6

4.15.7

0

2

4

6

8

10

12

14

16

18

BareBox (All Devices)

BareBox (Minimum

Devices)

BareBox (Memory Only)

Soft-boot (SSD Drive)

Soft-boot (Commodity)

VitualBox snapshot restore

VMWare snapshot restore

Tim

e re

quir

ed to

save

/res

tore

(s

ec)

Save Restore

Figure 3: Evaluation of di↵erent system restoretechniques.

Figure 3 shows the times that it takes to save and re-store snapshots for di↵erent configurations. This includes

BareBox, performing a soft-boot on bare metal, and virtual-machine-based system restore solution. Looking at the re-sults, we can see that the performance of BareBox is compa-rable to fast, virtual-machine-based solutions. For example,in the required “minimum device” configuration, BareBox isas fast as VirtualBox. The di↵erence, of course, is that Bare-Box restores the system on bare metal. Alternative mecha-nisms to restore the system on bare metal take significantlylonger (at least three to four times as long).

0

2

4

6

8

10

12

512 1024 2048

Tim

e re

quir

ed to

save

/res

tore

(s

ec)

Total Physical Memory (MB)

Save (Memory Only)

Restore (Memory Only)

Save (Minimum Devices)Restore (Minimum Devices)Save (All Devices)

Restore (All Devices)

Figure 4: BareBox restore compared with di↵erentdevice restore options.

We also measured the system restore performance of Bare-Box with di↵erent physical memory sizes. As shown in theFigure 4, with a minimum device configuration, the time re-quired for a restore is about four seconds on average. Thisis about three times faster than booting the system from anSSD-based hard disk.

5.3 Dynamic Malware AnalysisA malware analysis system can be evaluated based on

three main factors - quality, e�ciency, and stealthiness. Qual-ity refers to the richness and completeness of the analysisresults produced. E�ciency measures the number of sam-ples that can be processed per unit time. Stealthiness refersto the undetectability of the presence of the analysis envi-ronment.

Quality

Stealthiness Efficiency

Figure 5: Malware analysis framework evaluation

There is a constant tension between these factors, whichis represented as a triangular surface in Figure 5. That is,while trying to achieve better results for one factor, the anal-ysis technique has to make compromises in remain two otherfactors. For example, an analysis system with an in-guestagent tends to produce high quality analysis results but it isless stealthy and prone to subversion. VMM and emulator-based out-of-OS analysis systems are stealthier against in-guest agent detection, but because of the semantic gap, somelevel of introspection is required, using the domain knowl-edge of the target OS [10, 20]. This makes analysis less e�-cient and limits the quality of the produced results. Qual-

Figure 2: From [Kirat, 2011]

Tradeoff between:

1. Low-artifact, semantically poor tools (Virtual MachineIntrospection)

2. High-artifact, semantically-rich frameworks (debuggers)

10 / 31

LO-PHI framework implementation

11 / 31

Goals

I No virtualization: run malware on bare metal machines!I Physical sensors and actuators

I Bridging the semantic gapI Physical sensors collect raw data

I Automated restore to pre-infection stateI Stealthiness: very few, undetectable artifactsI Extendability: support new OSs and filesystems

12 / 31

Threat model

Assumptions on our model of malware: they are limitations of theapproach.

1. Malicious modifications evident either in memory or on disk2. No infection delivered to hardware3. Instrumentation is in place before malware is executed

I Malware cannot analyze the system without LO-PHI in placeI Harder to compare and detect artifacts: no baseline

13 / 31

Sensors and actuators (1)

Live Disk Forensics - 11 CS & HH 11/5/2014

Physical Instrumentation

Power, Keyboard, Mouse

Memory Introspection

Network Tap

SATA Introspection

Semantic Analysis

Figure 3: Hardware instrumentation to inspect a bare-metal machine.

14 / 31

Sensors and actuators (2)

SensorsI Memory. Xilinx ML507 board connected to PCIe, reads and

writes arbitrary memory locations via DMA.I Disk. ML507 board intercepting all the traffic over SATA

interface. Sends SATA frames via Gigabit Ethernet and UDP.I Completely passive. . .I except when SATA data rate exceeds Ethernet bandwidth:

throttling of frames.

I Network interface. Mentioned in paper, but the technologyused is unclear.

ActuatorsAn Arduino Leonardo emulates USB keyboard and mouse.

15 / 31

Infrastructure

Restoring physical machinesI We cannot simply “restore a snapshot” like in VMsI Preboot Execute Environment (PXE) with CloneZilla

I Allows to restore the disk to a previously saved stateI No interaction with the OS

Scalable infrastructureI Job submission system: jobs are sent to a schedulerI The scheduler executes the routine on an appropriate machineI Python interface to control the machine and run malware

and analysis

16 / 31

Bridging the semantic gap via forensic analysis

Raw SATACapture

Disk Reconstruction(Custom Module)

File System Reconstruction(PyTSK + Custom Code) Filter Noise FS Modifications

(a) Disk Reconstruction

Memory Image(Clean)

Semantic Reconstruction(Volatility) OS Information

Memory Image(Dirty)

Semantic Reconstruction(Volatility) OS Information

ExtractDifferences Filter Noise Memory

Modifications

(b) Memory Reconstruction

Fig. 5: Binary analysis workflow. (Rounded nodes represent data and rectangles represent data manipulation.)

VII. EVALUATION AND ANALYSIS

In this section, we explain our methodology for seman-tic gap reconstruction (Section VII-A) and demonstrate thepracticality of LO-PHI with three targeted experiments. Theexperiments were constructed to demonstrate the following:

• The ability of LO-PHI to detect the behaviorselicited by real malware, confirmed with ground truth(Section VII-C)

• The ability to scale and extract meaningful resultsfrom unknown malware samples (Section VII-D)

• The ability to analyze malware samples that employanti-analysis and evasion techniques (Section VII-E)

For each binary, we determine the system changes thatoccurred during execution by forensically comparing the re-sulting clean and dirty states. Each such pair of datasetscontains a clean and a dirty raw memory snapshot respectivelyas well as a log of raw disk and network activity that occurredbetween clean and dirty states. We exclude the network traceanalysis from much of our discussion since it is a well-knowntechnique and not the focus of our work. Our analysis of abinary’s execution involves four steps: 1) bridging the semanticgap for both the clean and dirty states, 2) computing the deltabetween the two states, 3) filtering out actions that are notattributed to the binary, and 4) comparing the delta for physicalexecution and virtual execution to determine if the sampleemploys VM-detection techniques (if applicable). The processtaken for each binary is illustrated Figure 5. When appropriate,we also compare our results to those produce by Anubis [7]and Cuckoo Sandbox [36].

A. Semantic Gap Reconstruction

As previously mentioned, before any analysis can be con-ducted, we must first bridge the semantic gap, i.e., translateour memory snapshots and SATA captures, which contain low-level, raw, data into high-level, semantically-rich, information.

1) Memory: To extract operating-system-level modifica-tions from our memory captures, we run a number of Volatilityplugins on both clean and dirty memory snapshots to parsekernel structures and other objects. Some of the generalpurpose plugins include psscan, ldrmodules, modscan, andsockets, which extract the running processes, loaded dlls,kernel drivers, and open sockets resident in memory. Similarly,we also run more malware-focused plugins such as idt, gdt,

ssdt, svcscan, and callbacks which examine kernel descriptortables, registered services, and kernel callbacks.

2) Disk: The first step in our disk analysis is to first convertthe raw capture of the SATA activity into a 4-tuple containingthe disk operation (e.g., READ or WRITE), starting sector,total number of sectors, and data. Our physical drives, aswith most modern drives, used an optimization in the SATAspecification known as Native Command Queuing (NCQ) [23].NCQ reorders SATA Frame Information Structure (FIS) re-quests to achieve better performance by reducing extraneoushead movement and then asynchronously replies based onthe optimal path. Thus, to reconstruct the disk activity, ourSATA reconstruction module must faithfully model the SATAprotocol in order to track and restore the semantic ordering ofFIS packets before translating them to disk operations. Uponreconstructing the disk operations, these read/write transactionsare then translated into disk events (e.g., filesystem operations,Master Boot Record modification, slack space modification)using our analysis code which is built upon Sleuthkit andPyTSK [22]. Since Sleuthkit only operates on static diskimages, our module required numerous modifications to keepsystem state while processing a stream of disk operations.Intuitively, we build a model of our SUT’s disk drive andstep through each read and write transaction, updating thestate at each iteration and reporting appropriately. This entireprocess is visualized in Figure 5a. Unlike previous work [50],which was designed for NTFS, our approach is generalizableto any filesystem supported by Sleuthkit. A sample output fromcreating the file LO-PHI.txt on the desktop can be seen below:

MFT modification (Sector: 6321319)Filename /WINDOWS/. . . /drivetable.txt!/. . . /Desktop/New Text Document.txtAllocated 0 ! 1 Unallocated 1 ! 0 Size 132 ! 0Modified 2014-11-07 20:07:06 (1406250) ! 2015-02-19 15:47:17 (3281250)Accessed 2014-11-07 20:07:06 (1406250) ! 2015-02-19 15:47:17 (3281250)Changed 2014-11-07 20:07:06 (1406250) ! 2015-02-19 15:47:17 (3281250)Created 2014-11-07 20:07:06 (1406250) ! 2015-02-19 15:47:17 (3281250)

. . .MFT modification (Sector: 6321319)

Filename /. . . /Desktop/New Text Document.txt !/. . . /Desktop/LO-PHI.txtChanged 2015-02-19 15:47:17 (3281250) ! 2015-02-19 15:47:25 (3437500)

Note that we can infer from this output that the filesystemreused an old MFT entry for drivetable.txt and updated thefilename, allocation flags, size, and timestamps upon filecreation. A subsequent filename and timestamp update werethen observed once the new filename, LO-PHI.txt, was entered.

8

Figure 4: Binary analysis workflow. Rounded nodes represent data andrectangles represent data manipulation.

I Background noise was removed by analyzing non-malicioussoftware.

17 / 31

Example of output of the analysis toolboxOffset Name PID PPID0x86292438 AcroRd32.exe 1340 10480x86458818 AcroRd32.exe 1048 10080x86282be0 AdobeARM.exe 1480 10480x864562a0 $$ rk sketchy server.exe 1044 1008

(a) New Processes (pslist)

PID Port Protocol Address1048 1038 UDP 127.0.0.11044 21 TCP 0.0.0.0

(b) New Sockets (sockets)

Selector Base Limit Type DPL Gr Pr0x320 0x8003b6da 0x00000000 CallGate32 3 - P

(c) GDT Hooks (gdt)

Name Base Size Filehookssdt.sys 0xf7c5b000 0x1000 C: \. . . \lophi\hookssdt.sys

(d) Loaded Kernel Models (modscan)

Table Entry Index Address Name Module0 0x0000f7 0xf7c5b406 NtSetValueKey hookssdt.sys0 0x0000ad 0xf7c5b44c NtQuerySystemInformation hookssdt.sys0 0x000091 0xf7c5b554 NtQueryDirectoryFile hookssdt.sys

(e) SSDT Hooks (ssdt)

Created Filename/. . . /lophi/$$ rk sketchy server.exe/. . . /lophi/hookssdt.sys/. . . /lophi/sample 0742475e94904c41de1397af5c53dff8e.exe

(f) Disk Event Log (81 Entries Truncated)

Fig. 6: Post-filtered semantic output from rootkit experiment (Section VII-C1).

B. Filtering Background Noise

While the ability to provide a complete log of modificationsto the entire system is useful in its own right, it is likely morerelevant to extract only those events that are attributed to thebinary in question. To filter out the activity not attributed toa sample’s execution, we first build a controlled baseline forboth our physical and virtual SUTs by creating a dataset (10independent executions in our case) using a benign binary(rundll32.exe with no arguments). We then use our analysisframework to extract all of the system events for those trialsand created a filter based on the events that frequently occurredin this benign dataset. Two of our memory analysis modules,i.e., filescan and timers, had particularly high false positivesand proved less useful for our automated analysis. To reducefalse positives in our disk analysis, we decouple the filenamesfrom their respective master file table (MFT) record number.

C. Experiment 1: High-fidelity Output

To verify that LO-PHI is, in fact, capable of extracting be-haviors of malware, we first evaluated our system with knownmalware samples, for which we have ground truth. In our firstcase study, we evaluated a rootkit that we developed utilizingtechniques from The Rootkit Arsenal [15] (Section VII-C1).Similarly, we were able to obtain a set of 213 malware samplesthat were constructed in a cleanroom environment, and wereaccompanied by their source code with detailed annotations.All the binaries in this experiment were executed on bothphysical and virtual machines that were running Windows XP(32bit, Service Pack 3) as their operating system.

1) Homemade Rootkit: Our rootkit stealths itself by addinghooks to the Windows Global Descriptor Table (GDT) andSystem Service Dispatch Table (SSDT) that will hide anydirectory or running executable with the prefix $$ rk and thenopens a malicious FTP server. The rootkit module is embeddedinside a malicious PDF file that drops and loads a mali-cious driver file (hookssdt.sys) and the FTP server executable($$ rk sketchy server.exe). Figure 6 shows the complete post-filtered results obtained when running this rootkit through ourframework. Note that we received identical results for bothvirtual and physical machines, which exactly matches whatwe would expect given our ground truth. We clearly see ourrootkit drop the files to disk (Figure 6f), load the kernel model

(Figure 6d), hook the kernel (Figure 6e and Figure 6c), andthen execute our FTP server (Figure 6a and Figure 6b). Wehave omitted the creation of numerous temporary files byAdobe Acrobat Reader and Windows as well as accesses toexisting files (81 total events) in Figure 6f to save space,however all disk activity was successfully reconstructed. Notethat we can trivially detect the presence of the new processas we are examining physical memory and are not foiled byexecution-level hooks.

We also ran our rootkit on the Anubis and Cuckoo analysisframeworks. Anubis failed to execute the binary, likely dueto the lack of Acrobat Reader or some other dependencies.Cuckoo produced an analysis with very similar file-system-level output to ours, reporting 156 file events, compared toour 81 post filtered. However, we were unable to find ourlistening socket, or our GDT and SSDT hooks from analyzingtheir output. While our FTP server was definitely executed,and thus created a listening socket on port 21, it is possiblethat our kernel module may not have executed properly ontheir analysis framework. Nevertheless, we feel that our abilityto introspect memory to find these obvious tells of malware,is a notable distinction. Subsequently, the lack of executionfor such a simple rootkit also emphasizes the importance ofhaving a realistic software environment as well as a hardwareenvironment. We attempt to address this issue for our analysisin Section VII-E.

2) Labeled Malware: For the analysis of our 213 well-annotated malware samples, we first performed a blind analy-sis, and then later verified our findings with the labels. Notethat there were samples that exhibited more behaviors thanthose listed here, only the most interesting findings are shown.

a) VM-detection: We found that 66 of these sam-ples were labeled as employing either anti-VM or anti-debugging capabilities. However, none of the 66 anti-VM sam-ples performed QEMU-KVM detection; instead they focusedon VMWare, VirtualPC, and other virtualization suites. Asexpected, all of the samples executed their full payload in bothour virtual and physical analysis environments.

b) New Processes: We found that 79 of the samplescreated new long-running processes detected by our memoryanalysis. The most commonly created process was namedsvchost.exe, which occurred in 15 samples. In addition, 2

9

Figure 5: New processes.

Offset Name PID PPID0x86292438 AcroRd32.exe 1340 10480x86458818 AcroRd32.exe 1048 10080x86282be0 AdobeARM.exe 1480 10480x864562a0 $$ rk sketchy server.exe 1044 1008

(a) New Processes (pslist)

PID Port Protocol Address1048 1038 UDP 127.0.0.11044 21 TCP 0.0.0.0

(b) New Sockets (sockets)

Selector Base Limit Type DPL Gr Pr0x320 0x8003b6da 0x00000000 CallGate32 3 - P

(c) GDT Hooks (gdt)

Name Base Size Filehookssdt.sys 0xf7c5b000 0x1000 C: \. . . \lophi\hookssdt.sys

(d) Loaded Kernel Models (modscan)

Table Entry Index Address Name Module0 0x0000f7 0xf7c5b406 NtSetValueKey hookssdt.sys0 0x0000ad 0xf7c5b44c NtQuerySystemInformation hookssdt.sys0 0x000091 0xf7c5b554 NtQueryDirectoryFile hookssdt.sys

(e) SSDT Hooks (ssdt)

Created Filename/. . . /lophi/$$ rk sketchy server.exe/. . . /lophi/hookssdt.sys/. . . /lophi/sample 0742475e94904c41de1397af5c53dff8e.exe

(f) Disk Event Log (81 Entries Truncated)

Fig. 6: Post-filtered semantic output from rootkit experiment (Section VII-C1).

B. Filtering Background Noise

While the ability to provide a complete log of modificationsto the entire system is useful in its own right, it is likely morerelevant to extract only those events that are attributed to thebinary in question. To filter out the activity not attributed toa sample’s execution, we first build a controlled baseline forboth our physical and virtual SUTs by creating a dataset (10independent executions in our case) using a benign binary(rundll32.exe with no arguments). We then use our analysisframework to extract all of the system events for those trialsand created a filter based on the events that frequently occurredin this benign dataset. Two of our memory analysis modules,i.e., filescan and timers, had particularly high false positivesand proved less useful for our automated analysis. To reducefalse positives in our disk analysis, we decouple the filenamesfrom their respective master file table (MFT) record number.

C. Experiment 1: High-fidelity Output

To verify that LO-PHI is, in fact, capable of extracting be-haviors of malware, we first evaluated our system with knownmalware samples, for which we have ground truth. In our firstcase study, we evaluated a rootkit that we developed utilizingtechniques from The Rootkit Arsenal [15] (Section VII-C1).Similarly, we were able to obtain a set of 213 malware samplesthat were constructed in a cleanroom environment, and wereaccompanied by their source code with detailed annotations.All the binaries in this experiment were executed on bothphysical and virtual machines that were running Windows XP(32bit, Service Pack 3) as their operating system.

1) Homemade Rootkit: Our rootkit stealths itself by addinghooks to the Windows Global Descriptor Table (GDT) andSystem Service Dispatch Table (SSDT) that will hide anydirectory or running executable with the prefix $$ rk and thenopens a malicious FTP server. The rootkit module is embeddedinside a malicious PDF file that drops and loads a mali-cious driver file (hookssdt.sys) and the FTP server executable($$ rk sketchy server.exe). Figure 6 shows the complete post-filtered results obtained when running this rootkit through ourframework. Note that we received identical results for bothvirtual and physical machines, which exactly matches whatwe would expect given our ground truth. We clearly see ourrootkit drop the files to disk (Figure 6f), load the kernel model

(Figure 6d), hook the kernel (Figure 6e and Figure 6c), andthen execute our FTP server (Figure 6a and Figure 6b). Wehave omitted the creation of numerous temporary files byAdobe Acrobat Reader and Windows as well as accesses toexisting files (81 total events) in Figure 6f to save space,however all disk activity was successfully reconstructed. Notethat we can trivially detect the presence of the new processas we are examining physical memory and are not foiled byexecution-level hooks.

We also ran our rootkit on the Anubis and Cuckoo analysisframeworks. Anubis failed to execute the binary, likely dueto the lack of Acrobat Reader or some other dependencies.Cuckoo produced an analysis with very similar file-system-level output to ours, reporting 156 file events, compared toour 81 post filtered. However, we were unable to find ourlistening socket, or our GDT and SSDT hooks from analyzingtheir output. While our FTP server was definitely executed,and thus created a listening socket on port 21, it is possiblethat our kernel module may not have executed properly ontheir analysis framework. Nevertheless, we feel that our abilityto introspect memory to find these obvious tells of malware,is a notable distinction. Subsequently, the lack of executionfor such a simple rootkit also emphasizes the importance ofhaving a realistic software environment as well as a hardwareenvironment. We attempt to address this issue for our analysisin Section VII-E.

2) Labeled Malware: For the analysis of our 213 well-annotated malware samples, we first performed a blind analy-sis, and then later verified our findings with the labels. Notethat there were samples that exhibited more behaviors thanthose listed here, only the most interesting findings are shown.

a) VM-detection: We found that 66 of these sam-ples were labeled as employing either anti-VM or anti-debugging capabilities. However, none of the 66 anti-VM sam-ples performed QEMU-KVM detection; instead they focusedon VMWare, VirtualPC, and other virtualization suites. Asexpected, all of the samples executed their full payload in bothour virtual and physical analysis environments.

b) New Processes: We found that 79 of the samplescreated new long-running processes detected by our memoryanalysis. The most commonly created process was namedsvchost.exe, which occurred in 15 samples. In addition, 2

9

Figure 6: New sockets.

Offset Name PID PPID0x86292438 AcroRd32.exe 1340 10480x86458818 AcroRd32.exe 1048 10080x86282be0 AdobeARM.exe 1480 10480x864562a0 $$ rk sketchy server.exe 1044 1008

(a) New Processes (pslist)

PID Port Protocol Address1048 1038 UDP 127.0.0.11044 21 TCP 0.0.0.0

(b) New Sockets (sockets)

Selector Base Limit Type DPL Gr Pr0x320 0x8003b6da 0x00000000 CallGate32 3 - P

(c) GDT Hooks (gdt)

Name Base Size Filehookssdt.sys 0xf7c5b000 0x1000 C: \. . . \lophi\hookssdt.sys

(d) Loaded Kernel Models (modscan)

Table Entry Index Address Name Module0 0x0000f7 0xf7c5b406 NtSetValueKey hookssdt.sys0 0x0000ad 0xf7c5b44c NtQuerySystemInformation hookssdt.sys0 0x000091 0xf7c5b554 NtQueryDirectoryFile hookssdt.sys

(e) SSDT Hooks (ssdt)

Created Filename/. . . /lophi/$$ rk sketchy server.exe/. . . /lophi/hookssdt.sys/. . . /lophi/sample 0742475e94904c41de1397af5c53dff8e.exe

(f) Disk Event Log (81 Entries Truncated)

Fig. 6: Post-filtered semantic output from rootkit experiment (Section VII-C1).

B. Filtering Background Noise

While the ability to provide a complete log of modificationsto the entire system is useful in its own right, it is likely morerelevant to extract only those events that are attributed to thebinary in question. To filter out the activity not attributed toa sample’s execution, we first build a controlled baseline forboth our physical and virtual SUTs by creating a dataset (10independent executions in our case) using a benign binary(rundll32.exe with no arguments). We then use our analysisframework to extract all of the system events for those trialsand created a filter based on the events that frequently occurredin this benign dataset. Two of our memory analysis modules,i.e., filescan and timers, had particularly high false positivesand proved less useful for our automated analysis. To reducefalse positives in our disk analysis, we decouple the filenamesfrom their respective master file table (MFT) record number.

C. Experiment 1: High-fidelity Output

To verify that LO-PHI is, in fact, capable of extracting be-haviors of malware, we first evaluated our system with knownmalware samples, for which we have ground truth. In our firstcase study, we evaluated a rootkit that we developed utilizingtechniques from The Rootkit Arsenal [15] (Section VII-C1).Similarly, we were able to obtain a set of 213 malware samplesthat were constructed in a cleanroom environment, and wereaccompanied by their source code with detailed annotations.All the binaries in this experiment were executed on bothphysical and virtual machines that were running Windows XP(32bit, Service Pack 3) as their operating system.

1) Homemade Rootkit: Our rootkit stealths itself by addinghooks to the Windows Global Descriptor Table (GDT) andSystem Service Dispatch Table (SSDT) that will hide anydirectory or running executable with the prefix $$ rk and thenopens a malicious FTP server. The rootkit module is embeddedinside a malicious PDF file that drops and loads a mali-cious driver file (hookssdt.sys) and the FTP server executable($$ rk sketchy server.exe). Figure 6 shows the complete post-filtered results obtained when running this rootkit through ourframework. Note that we received identical results for bothvirtual and physical machines, which exactly matches whatwe would expect given our ground truth. We clearly see ourrootkit drop the files to disk (Figure 6f), load the kernel model

(Figure 6d), hook the kernel (Figure 6e and Figure 6c), andthen execute our FTP server (Figure 6a and Figure 6b). Wehave omitted the creation of numerous temporary files byAdobe Acrobat Reader and Windows as well as accesses toexisting files (81 total events) in Figure 6f to save space,however all disk activity was successfully reconstructed. Notethat we can trivially detect the presence of the new processas we are examining physical memory and are not foiled byexecution-level hooks.

We also ran our rootkit on the Anubis and Cuckoo analysisframeworks. Anubis failed to execute the binary, likely dueto the lack of Acrobat Reader or some other dependencies.Cuckoo produced an analysis with very similar file-system-level output to ours, reporting 156 file events, compared toour 81 post filtered. However, we were unable to find ourlistening socket, or our GDT and SSDT hooks from analyzingtheir output. While our FTP server was definitely executed,and thus created a listening socket on port 21, it is possiblethat our kernel module may not have executed properly ontheir analysis framework. Nevertheless, we feel that our abilityto introspect memory to find these obvious tells of malware,is a notable distinction. Subsequently, the lack of executionfor such a simple rootkit also emphasizes the importance ofhaving a realistic software environment as well as a hardwareenvironment. We attempt to address this issue for our analysisin Section VII-E.

2) Labeled Malware: For the analysis of our 213 well-annotated malware samples, we first performed a blind analy-sis, and then later verified our findings with the labels. Notethat there were samples that exhibited more behaviors thanthose listed here, only the most interesting findings are shown.

a) VM-detection: We found that 66 of these sam-ples were labeled as employing either anti-VM or anti-debugging capabilities. However, none of the 66 anti-VM sam-ples performed QEMU-KVM detection; instead they focusedon VMWare, VirtualPC, and other virtualization suites. Asexpected, all of the samples executed their full payload in bothour virtual and physical analysis environments.

b) New Processes: We found that 79 of the samplescreated new long-running processes detected by our memoryanalysis. The most commonly created process was namedsvchost.exe, which occurred in 15 samples. In addition, 2

9

Figure 7: Disk event log.18 / 31

Experiments

19 / 31

Experiment on evasive malware

Malware previously labeled as “evasive” was executed on Windows 7and analyzed using LO-PHI.

I Many samples clearly exhibited typical malware behavior.I Some samples failed because of no network connection, or

wrong Windows version.

[. . . ] we feel that our findings are more that sufficient toshowcase LO-PHI’s ability to analyze evasive malware,without being subverted, and subsequently producehigh-fidelity results for further analysis.

20 / 31

Timing1 # R e s e t our d i s k u s i n g PXE2 machine . m a c h i n e r e s e t ( )3 machine . power on ( )4 # Wait f o r OS t o appear on ne twork5 whi le not machine . n e t w o r k g e t s t a t u s ( ) :6 t ime . s l e e p ( 1 )7 # Al low t i m e f o r OS t o c o n t i n u e l o a d i n g8 t ime . s l e e p (OS BOOT WAIT)9 # S t a r t d i s k c a p t u r e

10 d i s k t a p . s t a r t ( )11 # Send key p r e s s e s t o download b i n a r y12 machine . k e y p r e s s s e n d ( f t p s c r i p t )13 # Dump memory ( c l e a n )14 machine . memory dump ( m e m o r y f i l e c l e a n )15 # S t a r t c o l l e c t i o n ne twork t r a f f i c16 n e t w o r k t a p . s t a r t ( )17 # Get a l i s t o f c u r r e n t v i s i b l e b u t t o n s18 b u t t o n c l i c k e r . u p d a t e b u t t o n s ( )19 # S t a r t our b i n a r y and c l i c k any b u t t o n s20 machine . k e y p r e s s s e n d ( ’SPECIAL :RETURN’ )21 # Move our mouse t o i m i t a t e a human22 machine . mouse wiggle ( True )23 # Al low b i n a r y t o e x e c u t e ( I n i t i a l )24 t ime . s l e e p (MALWARE START TIME)25 # Dump memory ( i n t e r i m )26 machine . memory dump ( m e m o r y f i l e i n t e r i m )27 # Take a s c r e e n s h o t ( B e f o r e c l i c k i n g b u t t o n s )28 machine . s c r e e n s h o t ( s c r e e n s h o t o n e )29 # C l i c k any new b u t t o n s t h a t appeared30 b u t t o n c l i c k e r . c l i c k b u t t o n s ( new only=True )31 # Al low b i n a r y t o e x e c u t e (3 min t o t a l )32 t ime . s l e e p (MALWARE EXECUTION TIME�e l a p s e d t i m e )33 # Take a f i n a l s c r e e n s h o t34 machine . s c r e e n s h o t ( s c r e e n s h o t t w o )35 # Dump memory ( D i r t y )36 machine . memory dump ( m e m o r y f i l e d i r t y )37 # Shutdown Machine38 machine . power shutdown ( )

Fig. 3: Python script for running a malware sample and collecting theappropriate raw data for analysis.

Because of the duality of our framework we were ableto write one simple script (see Figure 3) that will: 1) resetour machine to a clean state, 2) take a memory image beforeand after execution, 3) attempt to click any graphical buttons,4) capture screenshots, and 5) capture all disk and networkactivity throughout the execution. To download and execute anarbitrary binary (Figure 3, line 12), our implementation useshotkeys to open a command line interface, executes a recursivefile-transfer protocol (FTP) download to retrieve the files tobe analyzed, and then runs a batch file to execute the binary.From this data, we reconstruct the changes in system memory,in addition to a complete capture of disk and network activitygenerated by the binary. To identify any graphical buttons thatthe malware may present, we use the Volatility “windows”module to identify all visible windows that have an atom classof 0xc061 or an atom superclass of 0xc017, which indicatea button, and then use our actuator to move the mouse tothe appropriate location and click it. Our analysis frameworkalso attempts to remove any typical analysis-based artifacts byusing a random file name and continuously moving the mouseduring the execution of the binary. Similarly, when possible,i.e., the system is not hung, we also properly shutdown thesystem at the end of the analysis to force any cached diskactivity to be flushed.

In our analysis setup, both the physical and virtual envi-ronments had a 10 GB partition for the operating system and1 GB of volatile memory. The operating system was placed

Fig. 4: Time spent in each step of binary analysis. Both environmentswere booting a 10 GB Windows 7 (64-bit) hibernate image and wererunning on a system with 1 GB of volatile memory.

into a “hibernate” state to minimize the variance betweenexecutions and also reduce the time required to boot thesystem. To minimize the space requirements of our system,we compress our memory images before storing them in ourdatabases. While this adds a significant amount of time to ouranalysis (approximately 2 minutes), it significantly reduces thestorage requirement. Finally, the virtual machine’s networkswere logically divided to ensure that samples did not interferewith each other, and the physical environment consisted ofonly one machine.

The respective runtimes for each portion of our analysis canbe seen in Figure 4. We ensured that every binary executed forat least 3 minutes before retrieving our final memory imageand resetting the system. Screenshots were obtained usingVolatility’s screenshot module on physical machines and wereextracted from the captured memory images. Note that most ofthe time taken in the physical case is due to our resetting of thesystem state using Clonezilla, waiting for the system to boot,and memory acquisition. The resetting and boot process couldbe decreased significantly by writing a custom PXE loader, orcompletely mitigated by implementing copy-on-write into ourFPGA. Similarly, the memory acquisition times could be morecomparable to the virtual case, if not faster, by optimizing ourPCIe implementation. Finally, system snapshots could reducethe time spent setting up the environment to mere seconds.While snapshots are trivial with virtual machines, it is still anopen problem for physical machines.

We note that LO-PHI may miss any temporal memorymodifications made by the binary between our clean anddirty memory images. To analyze the transient behavior ofa binary, LO-PHI could be used to continuously poll thesystems memory during execution. However, while this hasthe potential to produce a lot more fidelity, we do not feelthat our current polling rates are fast enough to warrant thetradeoff between the produced DMA artifacts and usefulnessof the output. We hope to explore this area of research inmore detail in the future as we improve our memory capturecapabilities.

7

Figure 8: Time spent in each step of binary analysis. Both environmentswere booting a 10 GB Windows 7 (64-bit) hibernate image and wererunning on a system with 1 GB of volatile memory.

21 / 31

Critique

22 / 31

Known limitations

I Newer chipsets use IOMMUs, disabling DMA from peripheralsI Current memory acquisition technique will become unusable

I Smearing: the memory can change during the acquisitionI Inconsistent statesI Faster polling rates can help

I Filesystem caching: some data will not pass through SATAinterface

I Malware could write a file to disk cache, execute and delete itbefore the cached is flushed to disk.

I However, the effects would be visible in memory.

23 / 31

Issues with the technique and the experiments

I The malware is left to run only 3 minutes.I Many malwares need much more time to fully uncover their

effects (e.g. ransomware).

I No memory polling during the execution of the malwareI Only snapshot before and after the executionI Temporary data used by the malware is never seen

I Assumption: malware does not modify BIOS or firmware.I But if it does, the physical machine could not be recoverable.I Costly!

I No Internet access: the authors always run the malware ondisconnected machines.

I Most malware becomes useless without Command&Controlinfrastructure.

I Network access could expose further, unseen, artifacts.

24 / 31

Methodological issues

I The article claims that the artifacts from LO-PHI are unusableby malware because there’s no baseline (i.e. the malwarecannot see the machine before the installation of LO-PHI).

I This can also be true for traditional approaches.

I No statistical test used to discover whether difference indisk/memory throughput is statistically significant (presence ofartifacts).

I Very simple and standard procedure, should really be done in ascientific paper.

I Network analysis technique is not described and unclear.I We exclude the network trace analysis from much of our

discussion since it is a well-known technique and not the focusof our work.

25 / 31

Memory throughput with and without LO-PHI

(a) Physical machine (Polling at 14MB/sec) (b) Virtual machine (Polling at 160MB/sec)

Fig. 1: Average memory throughput comparison as reported by RAMSpeed, with and without instrumentation on both physical and virtualmachines. (500 samples for each box plot)

At first glance, Figure 1 may seem to indicate that our in-strumentation has a discernible effect on the system; however,the deviation from the uninstrumented median is only 0.4%in the worst case (SSE in Figure 1a). Despite our best effortsto create a controlled experiment, i.e., running RAMSpeed ona fresh install of Windows with no other processes, we wereunable to to definitively attribute any deviation to our pollingof memory. While our memory instrumentation certainly hassome impact on the system, the rates at which we are pollingmemory do not appear to be frequent enough to predictablydegrade performance. This result appears to indicate that sys-tems like ours could poll at significantly higher rates while stillremaining undetectable. For example, PCIe implementationscan achieve DMA read speeds of 3 GB/sec [6], which couldpermit a new class entirely of introspection capabilities. To thisend, we have successfully achieved rates as fast as 60 MB/secusing SLOTSCREAMER [32]; however, the implementationis not yet stable enough to incorporate into our framework.Nevertheless, in order to detect any deviation in performance,the software being analyzed would need to have the ability tobaseline our system, which is not feasible in our experiments.

While performance concerns are a universal problem withinstrumentation, adding hardware to a physical configurationhas numerous additional artifacts that must also be addressed.To utilize PCIe, we must enumerate our card on the bus,which means that the BIOS and operating system are able tosee our hardware. This inevitably reveals our presence on themachine; nevertheless, mitigations do exist (e.g., masqueradingas a different device). To avoid detection, our card couldtrivially use a different hardware identifier every time toavoid signature-based detection. Even with a masked hardwareidentifier however, Stewin et al. [66] demonstrated that allof these DMA-based approaches will reveal some artifactsthat are exposed in the CPU performance counters. Similartechniques could be employed by malware authors in bothphysical and virtual environments to detect the presence ofa polling-based memory acquisition system such as ours.These anti-analysis techniques could necessitate the need formore sophisticated acquisition techniques, some of which areproposed in Section IX.

B. Disk Artifacts

To quantify the performance impact of our disk instru-mentation, we similarly employed a popular disk benchmark-ing utility, IOZone [1]. While IOZone’s primary purpose isto benchmark the higher-level filesystem, any performanceimpacts on disk throughput should nonetheless be visible tothe tool. We used the same setup as the previous memoryexperiments and ran IOZone 50 times for each case, i.e., withand without our instrumentation, monitoring the read and writethroughput with a record size of 16MB and file sizes rangingfrom 16MB to 2GB (the total amount of RAM on SUT).

Our hardware should only be visible when we intentionallydelay SATA frames to meet the constraints of our gigabitEthernet link. We designed the system with this delay tominimize our packet loss since UDP does not guaranteedelivery of packets. In practice, we rarely observed the systemcross this threshold; however, IOZone is explicitly made tostress the limits of a file system. For smaller files, cachingmasks most of our impact as these cache hits limit theaccesses that actually reach the disk. The caching effect ismore prevalent when looking at the raw data rates (e.g., themedian uninstrumented read rate was 2.2GB/sec for the 16MBexperiment and 46.2MB/sec for 2GB case).

The discrepancies between the read and write distributionsare attributed to the underlying New Technology File Sys-tem (NTFS) and optimizations in the hard drive firmware.Figure 2b shows that our instrumentation is essentially in-distinguishable from the base case when reading, the worstcase being a degradation of 3.7% for 2GB files. With writeshowever, where caching offers no benefit, the effects of our in-strumentation are clearly visible, with a maximum performancedegradation of 14.5%. Under typical operating conditions,throughputs that reveal our degradation are quite rare. In theseexperiments, the UDP data rates observed from our sensoraveraged 2.4MB/sec with burst speeds reaching as high as82.5MB/sec, which directly coincide with the rates observed inFigure 2a, confirming that we are only visible when throttlingSATA to meet the constraints of the Ethernet connection.

In the case of virtual machines, we would expect tohave no detectable artifacts on a properly provisioned host

5

Figure 9: Average memory throughput comparison as reported byRAMSpeed, with and without instrumentation. Deviation fromuninstrumented trial is only 0.4% in worst case. No statistical test used.

26 / 31

Another point. . .

I Virtualization is increasingly used in production contexts.I Just think about cloud computing!

I Attackers will want to target those VMs [Garfinkel, 2007]I Malware that deactivates in VMs will be less common

27 / 31

Related and future work

28 / 31

Related work

Traditional dynamic analysisMany dynamic malware analysis tools rely on virtualization: Ether,BitBlaze, Anubis, V2E, HyperDbg, SPIDER.

I We already saw the limitations of VM approaches: artifacts

Bare-metal dynamic analysisBareBox [Kirat, 2011]: malware analysis framework based on abare-metal machine without virtualization or emulations

I Only targets user-mode malwareI Only disk analysis (no memory tools)

29 / 31

Future evolutions

Automated, repeated analysesI Disk restore phase is the lengthiest (> 6 min)I The resetting and boot process could be decreased significantly

by writing a custom PXE loader, or completely mitigated byimplementing copy-on-write into our FPGA.

I While snapshots are trivial with virtual machines, it is still anopen problem for physical machines.

ExtensionsI Analyze transient behavior of binary

I Continuous memory pollingI Need to deal with DMA artifacts

I Cover malware that infects hardware (BIOS, firmware)

30 / 31

ReferencesChen X., Andersen J, Morley M., Bailey M., Nazario J.Towards an understanding of anti-virtualization andanti-debugging behavior in modern malwareIn Proceedings of the International Conference on DependableSystems and Networks (2008)doi: 10.1109/DSN.2008.4630086

Kirat D., Vigna G., Kruegel C.BareBox: Efficient Malware Analysis on Bare-metalProceedings of the 27th Annual Computer Security ApplicationsConference (2001)doi: 10.1145/2076732.2076790

Garfinkel T., Adams K., Warfield A., Franklin J.Compatibility is Not Transparency: VMM Detection Myths andRealitiesProceedings of the 11th USENIX Workshop on Hot Topics inOperating Systems (2007)

31 / 31