design of x86 emulator for generic unpacking chandra prakash ([email protected])
TRANSCRIPT
Design of x86 Emulator for Generic Unpacking
Chandra Prakash(chandrap@sunbelt-
software.com)
The problem
Large number of detections are still based on some static signature, e.g., MD5, CRC32 etc.
Malware has cleverly evolved to evade signature based detections by use of packers
The problem, contd… It is possible to write custom packing
routines for each packer Cryptanalysis or X-Ray can also be used But, the number of packers and
variations within each packer type are too many, e.g., Current version range for UPX is 1.x–3.x and FSG is 1.x-2.x
Moreover, there can be recursive layers of packing done
A Solution - Emulation
Due to nature of the problem, it is desirable to have a general purpose solution
Emulation provides a “fairly” general purpose solution that leads to the term Generic Unpacking
What is Emulation? Wikipedia definition is pretty clear
“An emulator duplicates (provides an emulation of) the functions of one system using a different system, so that the second system behaves like (and appears to be) the first system. This focus on exact reproduction of external behavior is in contrast to simulation, which can concern an abstract model of the system being simulated, often considering internal state.”
Emulation – where else is it used? Supporting cross-platform
applications Controlled and secure execution of
un-trusted applications And off course, Dynamic behavioral
analysis of malware and packed malware detection via generic unpacking
Etc.
Emulation – to what degree? Full emulation – Emulate everything;
Application as well as the Operating System E.g., VMWare and VirtualPC
Application Only - Emulate application level instruction set and System Call interface E.g., Wow64, Win32 emulation on 64-bit
Windows Our emulator for Generic Unpacking is
Application Only
Emulator Components A software implementation of the subset of
hardware, operating system and application environment needed for running an application.
The hardware components include: the CPU, registers, interrupt vector table. The operating system components include: PE loader, virtual memory manager, structured exception handling(SEH). The
Application environment include: input parameter and environmental variable support, heap, stack, process environment block(PEB), thread information block (TIB), function hooks for spoofing execution references into system dll(s)
Emulator Components
+fetch()+executeOneInstruction()
X86CPU
+readByte()+writeByte()+virtualAlloc()+virtualFree()+virtualProtect()
MemoryManager
+readByte()+writeByte()
-base-size
MemoryRegion
+readByte()+writeByte()+generateAccessViolationException()
-allocationType-protectionType-startPage-pageCount
MemoryBlockDescriptor
+parseImage()+loadImage()
PELoader
1
*
1
*
+resize()
-top-bottom
Stack
+heapCreate()+heapDestroy()
Heap+run()
Process
Thread
«struct»X86Registers
«struct»PEB
«struct»PEB_LDR_DATA
«struct»TIB
1
*
+initiailize()+loadPEImage()+createProcess()
System
+addHook()+findHook()
HookList
1
1
SEHHandler
1
1
1
1
1
*
Emu Components - PE Loader The very first step in a target’s emulation Create a memory-mapped image as per
Windows PE specifications. Calculate virtual mapped size Allocate contiguous buffer based on
virtual mapped size and the copy PE headers and section data in aligned sections
Fix imports from primary module Fix relocations
Emu Components - Registers There are eight 32-bit general purpose
registers (EAX, EBX, ECX, EDX, EBP, ESP, ESI, EDI)
Six 16-bit segment registers (CS, SS, DS, ES, FS, GS), DR0-DR3, DR6, DR7 hardware debug registers
EFLAGS and EIP registers Added benefit to also provide support
for FPU instructions and extensions to x86 architecture, such as MMX, SSE, SSE2, SSE3[10] and 3DNow! instructions
Emu Components - CPU Fetch instructions from the virtual
memory address space of the target Decode instruction; find instruction
type, get operands Execute instruction; calculate
results and store Move on to the next instruction as
indicated by EIP
Emu Components – Interrupt Handling
INT N generates interrupt, with N range as 0-255
Execution of INT N results in a software exception in the application
From user mode only a subset of these are allowed, all others result in access violation exception
Emu Components – Interrupt Handling…contd
User mode Exceptions for INT N as noted on Windows XP-SP2
Interrupt Number (N)
Exception thrown
3, 2d Breakpoint
4 Integer Overflow
2a, 2b, 2c, 2e None
All others Exception Violation
Emu Components – Virtual Memory Manager Manages Virtual Memory used by the target at
the very lowest level Maintains memory regions
Each region consist of a contiguous sequence of pages, e.g., PE image region
Each page has its own allocation and protection characteristics
Allocation type include reserved, committed and free Protection type include read, write, execute etc.
Access violation generated when a memory reference is not compatible to the allocation and protection type for the region
SEH handling Most commonly used to obfuscate
execution path by deliberate generation and handling of software exceptions.
Typically used instructions are: Single step (INT1) and break point
(INT3) instructions Arithmetic divide or integer overflow
exceptions that are generated by DIV/IDIV and INTO instruction.
Stack The stack is a contiguous memory
region that serves among other things as a memory work area for parameters passed in function calls and SEH chain.
There exists one stack for each thread. It is implemented in an inverted manner
so that it grows in the direction of decreasing memory address.
The stack parameters, e.g., base, limit, address of top level exception handler frame, should be appropriately set in TIB
Heap Heap enables efficient memory allocations
of much lower granularity as opposed to page granular allocations of VirtualAlloc call.
To support Win32 heap related calls made by the target, e.g., HeapAlloc, HeapFree, etc., a simulation for the same needs to be provided.
The heap is implemented as a wrapper around page granular memory allocation calls.
Thread Information Block(TIB) For each thread there is a TIB structure stored at the address
indicated by FS:[18h] in each thread. +0x000 ExceptionList : Ptr32 _EXCEPTION_REGISTRATION_RECORD +0x004 StackBase : Ptr32 Void +0x008 StackLimit : Ptr32 Void +0x00c SubSystemTib : Ptr32 Void +0x010 FiberData : Ptr32 Void +0x010 Version : Uint4B +0x014 ArbitraryUserPointer : Ptr32 Void +0x018 Self : Ptr32 _NT_TIB
The first field ExceptionList in TIB contains address of the top level exception handler frame represented by EXCEPTION_REGISTRATION_RECORD structure.
StackBase and StackLimit contain lower bound and upper bound of the thread’s stack.
Address of PEB can be obtained as FS:[30h]
Process Environment Block (PEB) For each user mode process there is one PEB Some of the important fields accessed by malware are:
BeingDebugged, ImageBaseAddress, InLoadOrderModuleList, InMemoryOrderModuleList and InInitializationOrderModuleList of PEB_LDR_DATA
The IsDebuggerPresent Win32 API simply returns value in BeingDebugged field of PEB. This is used by malware to detect debugger’s presence as one of the anti-debugging tricks
0x002 BeingDebugged : UChar //In PEB The sorted list of modules is maintained in three different
LIST_ENTRY type data structures in PEB_LDR_DATA +0x00c InLoadOrderModuleList : _LIST_ENTRY +0x014 InMemoryOrderModuleList : _LIST_ENTRY +0x01c InInitializationOrderModuleList : _LIST_ENTRY
Function hooks In application-only emulator, any system call
made by malware in a dependent system module like kernel32.dll is intercepted and its corresponding spoofed implementation provided
Some of the functions include: LoadLibraryA/W, GetProcAddresss, GetModuleHandleA/W, VirtualAlloc, VirtualFree, HeapAlloc, HeapFree, GetVersionExA/W etc.
Also a default un-emulated function hook should also be provided that gets called when an un-implemented import function is encountered
Stop Conditions Ideally emulator should be stopped at OEP Finding exact OEP in a generic way is non-
trivial Typical conditions other than the target
initiated explicit termination are: Encountering an un-emulated system call in a
dependent module. Unhandled exception for which no SEH handler
was found. Some of these exceptions include invalid memory read, write, execute, divide by zero, integer overflow.
Stop Conditions…Contd Encountering an un-emulated or illegal
instruction. A configured timeout. Maximum number of instructions being
reached. Attempt to load a dll that could not be
located. Too many dlls being loaded by the
target in explicit load module.
Emulator fine tuning due to malware unique characteristics Practical constraints due to performance
optimizations and undocumented features would allow only limited implementation of the emulator.
Once the core emulator system is ready, developing a robust emulator is an iterative process driven by minor fine tuning of it for unique characteristics of supported packers and symptoms exhibited by the malware test-bed.
Examples that follow describe some of the cases experienced with malware samples that lead to the improvement of our emulator.
The cases described in these examples are no way complete!
Example 1 – Setting Initial Stack
0041C25A CALL 0041C25F0041C25F PUSH EBP0041C260 MOV EBX,DWORD PTR SS:[ESP+8]0041C264 MOV EBP,DWORD PTR SS:[ESP+4]0041C268 SUB DWORD PTR SS:[ESP+4],1A4AF
At address 0041C260, the MOV instruction references an address ([ESP+8]) at the top of initial stack.
This address is the return address after the CALL instruction in kernel32.dll that “calls” the malware entry point.
The return address actually ends up calling ExitProcess.
Example 2 – Module load address alignment
004A1584 MOV EBX,DWORD PTR SS:[ESP+24] ; EBX=77E8141A004A1588 AND EBX,FFE00000 ; EBX=77E00000 . . .004A16C4 ADD EBX,10000004A16CA JE SHORT 004A16F7004A16CC CMP WORD PTR DS:[EBX],5A4D004A16D1 JNZ SHORT 004A16C4
At 004A16C4 EBX value is incremented by system allocation granularity.
At 004A16CC it compares the content of value located at the address in EBX with WORD type 5A4D (ascii ‘MZ’), which is the startup marker for a PE image.
If the address of the startup marker is found in the address pointed by EBX, execution follows to location 004A16C4.
Example 3 – Startup Register Values
31428200 PUSH ED01C39031428205 MOV EAX,ESP31428207 CALL EAX0012FFC0 NOP0012FFC1 RETN31428209 XCHG EAX,EBX ; EAX=7FFDF000,
EBX=0012FFC03142820A POP EBX
At 31428209 EBX is referenced whose value is equal to the PEB address of the program
Example 4 – Handling DLL emulation For the correct emulation of a dll,
before the entry point function DllMain gets called, its input parameters must be set in the stack as in Windows.
BOOL WINAPI DllMain( HINSTANCE hinstDLL, DWORD fdwReason, LPVOID lpvReserved);
Example 5 – Setting register values before calling SEH handler
0048C093 CMP AL,40048C095 JNZ SHORT 0048C09B0048C097 NOP0048C098 NOP0048C099 RETN
SEH handler’s second instruction at 0048C095 has a conditional jump instruction depending on whether AL is zero or not.
In real Windows, EAX is set to zero just before SEH handler gets control.
Therefore, before SEH handler gets control other registers should be set up as they are set in Windows.
Example 6 – Setting top level exception handler in SEH
004141EF SUB EDX,EDX004141F1 MOV EAX,DWORD PTR FS:[EDX]004141F4 MOV ESP,DWORD PTR DS:[EAX]004141F6 POP DWORD PTR FS:[EDX]004141F9 POP EAX004141FA POP EBP004141FB RETN
Windows also registers another handler on top before application handler gets control
Malware had already configured return a address on the stack that gets executed after RETN at 004141FB
At 00414F4 it skips over the top level SEH handler and positions ESP to the SEH frame for this handler
At 004141F6 the two top SEH handlers are torn down and after 00414FB execution resumes at location specified by ESP, that was last updated at 00414DFA
Example 7 – Check for BeingDebugged field in PEB
3142821B MOV EAX, DWORD PTR FS:[18]31428220 MOV EAX, DWORD PTR DS:[EAX+30]31428223 MOVZX EAX, BYTE PTR DS:[EAX+2]31428227 CMP EAX, 03142822A JNZ SHORT 3142826E3142822C CALL 3142823131428231 POP EBP
At 3142821B address of TIB is obtained which is used to get address of PEB at 31428220
At 31428223 BeingDebugged field of PEB is checked to evaluate the condition of the branch instruction at 3142822A
Example 8 – Check for loader lists in PEB
0044D0A5 MOV EAX,DWORD PTR FS:[30]0044D0AB TEST EAX,EAX0044D0AD JS SHORT 0044D0BB0044D0AF MOV EAX,DWORD PTR DS:[EAX+C] 0044D0B2 MOV ESI,DWORD PTR DS:[EAX+1C] 0044D0B5 LODS DWORD PTR DS:[ESI]
At 0044DA05 PEB is referenced at 0044DA05. At 0044D0AF, 0044D0B2 and 0044D0B5,
PEB_LDR_DATA, InInitializationOrderModuleList and InInitializationOrderModuleList.Flink respectively are referenced
The malware happens to be referencing the kernel32.dll load information in its dependent module list sorted on initialization order
Example 9 – Reference to Thread Local Storage
004033FA MOV EAX,DWORD PTR DS:[4503D4]00403400 TEST CL,CL00403402 JNZ SHORT 0040341A00403404 MOV EDX,DWORD PTR FS:[2C]0040340B MOV EAX,DWORD PTR DS:[EDX+EAX*4]0040340E RETN
At 00403404 the beginning of thread local storage pointer array value in FS:[2Ch] is copied over in EDX
The next instruction at 0040340B returns in EAX value of a TLS pointer as indexed by previous value in EAX
Example 10 – Normalizing malformed PEs in loader All Win32 PE executables are expected
to follow the PE format specifications in the strictest sense
Yet, it is seen that many malware samples do not conform to these formal guidelines and are still allowed to be run by the Windows loader.
In general a malware should be loaded by the emulator as long as Windows loader accepts it by relaxing constraints on these kind of aberrations
Example 10 – Normalizing malformed PEs in loader…contd
Structure Field Value
Dos Header e_lfanew 0x10
Optional Header
SizeOfCode 0x4c454e52
Optional Header
SizeOfInitializedData
0x442e3233
Optional Header
AddressOfEntryPoint
0x11a4
Section Header
PointerToRawData 0x10
Emulator Performance Optimizations
In plain emulation instructions are executed in software
Plain emulation is hundreds of times slower than native execution
Is not well suited for malware that require emulation for hundreds of millions of instructions
Emulator Performance Optimizations – Dynamic Binary Translation (DBT)
Frequently executed instructions, e.g., decryption loop, are translated into native instructions
Repeat execution of same set of instructions above a certain threshold causes their translated counterpart to be executed
DBT is only about ten times slower than native execution
Some more DBT details Code is partitioned into a sequence of
Basics Blocks (BB) Each BB is self contained and does
not contain any branch instructions For each BB corresponding translation
of native instruction obtained There is a performance hit at the time
of translation but that’s one time
Page fault handler based unpacking All memory writes from a packed program are
monitored from kernel until an execute is issued in the modified monitored memory regions
The page fault handler based unpacking system yields maximum speed improvements as the malware is allowed to run natively on the host machine and in that sense does not require any kind of emulation.
But, its implementation is discouraged as it requires un-conventional ways of modification of page fault interrupt handler in the kernel and may not even work on 64-bit Vista because of patch guard protection.
Thank You