fast dynamic binary translation for the kernel piyus kedia and sorav bansal iit delhi
TRANSCRIPT
![Page 1: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/1.jpg)
Fast Dynamic Binary Translationfor the Kernel
Piyus Kedia and Sorav BansalIIT Delhi
![Page 2: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/2.jpg)
Applications of Dynamic Binary Translation (DBT)
OS Virtualization Testing and Verification of Compiled Programs Profiling and Debugging Software Fault Isolation Dynamic Optimizations Program Shepherding … and more
![Page 3: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/3.jpg)
A Short Introduction toDynamic Binary Translation (DBT)
Dispatcher
Execute Block
Start
Translate Block
Native codeBlock terminates with branch to dispatcher instruction
![Page 4: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/4.jpg)
Code Cache
Dispatcher
cached?
Execute fromCode Cache
Start
Translate Block
Native code
Store in code cache
no
yes
![Page 5: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/5.jpg)
DBT Overheads
• User-level DBT well understood• Near-native performance for application-level workloads
• DBT for the Kernel requires more mechanisms• Efficiently handling exceptions and interrupts• Case studies:
• VMware’s Software Virtualization• DynamoRio-Kernel (DRK) [ASPLOS ’12]
![Page 6: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/6.jpg)
Interposition on Starting (Entry) Points
Dispatcher
cached?
Execute fromCode Cache
Start
Translate Block
Native code
Store in code cache
no
yes
Start
![Page 7: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/7.jpg)
IDT now points to the dispatcher
Dispatcher
cached?
Execute fromCode Cache
Translate Block
Native code
Store in code cache
no
yes
Interrupt Descriptor Table
![Page 8: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/8.jpg)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)
![Page 9: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/9.jpg)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)
CS registerPC
Flags
Guest Stack
SP
CS registerNative PC
Flags
Guest Stack
SP
![Page 10: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/10.jpg)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:
2. Emulates Precise Exceptions1. Converts interrupt state on stack to native values (e.g., PC)
![Page 11: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/11.jpg)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions
Precise ExceptionsBefore the execution of an exception handler, all instructions up to the executing instruction should have executed, and everything afterwards
must not have executed.
![Page 12: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/12.jpg)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions
• Rolls back partially executed translations
Precise Exceptions
pushstoreadd sub load mov pop
Executed Exception handlerexecutes
![Page 13: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/13.jpg)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions
• Rollback partially executed translations3. Emulates Precise Interrupts
![Page 14: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/14.jpg)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions
• Rollback partially executed translations3. Emulates Precise Interrupts
Precise Interrupts
pushstoreadd sub load mov pop
Interrupt handlerexecutes
Executed
• Delays interrupt delivery till start of next native instruction
![Page 15: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/15.jpg)
Effect on Performance
Applications with high interrupt and exception activityexhibit large DBT overheads
![Page 16: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/16.jpg)
VMware’s Software Virtualization Overheads
SpecInt kernel-compile apache0
20
40
60
80
100
120
140
2.9
27.11
123.48
Perc
enta
ge O
verh
ead
over
Nati
ve
benchmarks
Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.
![Page 17: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/17.jpg)
VMware’s Software Virtualization Overheads
SpecInt kernel-compile apache 2D-graphics large-RAM forkwait0
100
200
300
400
500
600
700
2.927.11
123.48
57.8191.68
603.44
Perc
enta
ge O
verh
ead
over
Nati
ve
benchmarks -m benchmarks
Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.
![Page 18: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/18.jpg)
VMware’s Software Virtualization Overheads
SpecInt kernel-compile apache 2D-graphics large-RAM forkwait divzero syscall0
100
200
300
400
500
600
700
800
900
2.9 27.11
123.4857.81
91.68
603.44
262.54
853.72
Perc
enta
ge O
verh
ead
over
Nati
ve
benchmarks -m benchmarks nano-benchmarks
Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.
![Page 19: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/19.jpg)
Dynamo-Rio Kernel (DRK) Overheads
Data from “Comprehensive Kernel Instrumentation via Dynamic Binary Translation”P. Feiner, A.D. Brown, A. Goel, U. Toronto, ASPLOS 2012.
fileserver webserver webproxy varmail apachebench0
50
100
150
200
250
300
350
400
212.3
351.85325.37
44.44
184.13
Perc
enta
ge O
verh
ead
over
Nati
ve
![Page 20: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/20.jpg)
DRK vs BTKernel
fileserver webserver webproxy varmail apachebench0
50
100
150
200
250
300
350
400
212.3
351.85325.37
44.44
184.13
0.36 2.19 2.44 10.6 0.42
Perc
enta
ge O
verh
ead
over
Nati
ve
![Page 21: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/21.jpg)
Fully Transparent Execution is not required
• The OS kernel rarely relies on precise exceptions
• The OS kernel rarely relies on precise interrupts
• The OS kernel seldom inspects the PC address pushed on stack. It is only used at the time of returning from interrupt using the iret instruction.
![Page 22: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/22.jpg)
Faster Execution is Possible
• Leave code cache addresses in kernel stacks.• An interrupt/exception directly jumps into the code cache, bypassing the
dispatcher.
• Allow imprecise interrupts and exceptions.
• Handle special cases specially.
![Page 23: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/23.jpg)
IDT now points to the code cache
Dispatcher
cached?
Execute fromCode Cache
Translate Block
Native code
Store in code cache
no
yes
Interrupt Descriptor Table
![Page 24: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/24.jpg)
IDT now points to the code cache
Dispatcher
cached?
Execute fromCode Cache
Translate Block
Native code
Store in code cache
no
yes
Interrupt Descriptor Table
![Page 25: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/25.jpg)
Correctness Concerns
1. Read / Write of the interrupted PC address on stack will return incorrect values.• Fortunately, this is rare in practice and can be handled specially
![Page 26: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/26.jpg)
Read of an interrupted PC address
CS registertranslated PC
Flags
Guest Stack
SPload addr
Examples:
1. Exception Tables in Linux page fault handler
![Page 27: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/27.jpg)
Exception Tables in Linux
• Page faults are allowed in certain functions• e.g., copy_from_user(), copy_to_user().
• An exception table is constructed at compile time• contains the range of PC addresses that are allowed to page fault.
• At runtime, the faulting PC value is compared against the exception table• Panic only if PC not present in exception table
![Page 28: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/28.jpg)
Read of an Interrupted PC address
CS registertranslated PC
Flags
Guest Stack
SPload addr
Problem:The faulting PC value is now a code-cache address.
Solution:Dispatcher adds potentially faulting code cache addresses to the exception table
![Page 29: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/29.jpg)
Read of an Interrupted PC address
CS registertranslated PC
Flags
Guest Stack
SPload addr
Examples:
1. Exception Tables in Linux
2. MS Windows NT Structured Exception Handling__try / __except constructs in C/C++
![Page 30: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/30.jpg)
__try / __except blocks in MS Windows NT
__try { <potentially faulting code>} __except { <fault handler>}
__try { copy_from_user();} __except { signal_process()}
Syntax: Example Usage:
Also implemented using exception tables in the Windows kernel
![Page 31: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/31.jpg)
More examples in paperIn our experience, all such cases can be nicely handled!
![Page 32: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/32.jpg)
Correctness Concerns
1. Read / Write of the faulting PC address on stack will return incorrect values.
2. Code-cache addresses will now live in kernel stacks.• What if code-cache addresses become invalid?
![Page 33: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/33.jpg)
Code Cache Addresses can now live in Kernel Data Structures
CS registertranslated PC
Flags
Thread 1 Stack
SPCode Cache
Thread 2 Stack
SPContext Switch
![Page 34: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/34.jpg)
Code Cache Addresses can now live in Kernel Data Structures
• Disallow Cache Replacement• Code Cache of around 10MB suffices for Linux
• Do not move or modify code cache blocks, once they are created• Ensures that a code cache address remains valid for the execution lifetime
• If the code cache gets full, switchoff and switch-back on the translator• Switchoff implemented by reverting to original IDT and other entry points.• This results in effectively flushing the code cache and starting afresh
![Page 35: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/35.jpg)
Dynamic Switchon / Switchoff
• Replace all entry points with shadow / original values• e.g., for switchoff, replace shadow interrupt descriptor table with original
• Iterate over the kernel’s list of threads• Identify PC values in thread stacks and convert them to code cache / native
values
• Translator reboot (switchoff followed by switchon) flushes the code cache
![Page 36: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/36.jpg)
Correctness Concerns
1. Read / Write of the faulting PC address on stack will return incorrect values.
2. Code-cache addresses will now live in kernel stacks. What if code-cache addresses become invalid?
3. Imprecise Interrupts and Exceptions.
![Page 37: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/37.jpg)
Imprecise Exceptions and Interrupts
Interestingly, an OS kernel typically never depends on precise exceptions and interrupts.
![Page 38: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/38.jpg)
Reentrancy and Concurrency
Direct entries into the code cache introduce new reentrancy and concurrency issues
Detailed discussion in the paper.
![Page 39: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/39.jpg)
Optimizations that worked
• L1 cache-aware Code Cache Layout
• Function call/return optimization
![Page 40: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/40.jpg)
Code Cache Layout for Direct Branch Chaining
Dispatcher
Code Cache Edge Cache
Edge code:• executed only once, on the first execution of the block.• However, shares the same cache lines as all other code.Allocate edge code from a separate memory pool for better cache locality.
Edge code for branching to dispatcher
![Page 41: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/41.jpg)
Function call/return optimization
Use identity translations for ‘call’ and ‘ret’ instructionsinstead of treating ‘ret’ as another indirect branch.
Involves careful handling of call instructions with indirect targets(discussed in the paper)
![Page 42: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/42.jpg)
Experiments
• BTKernel Performance vs. Native
• BTKernel Statistics
• Experience with some applications
![Page 43: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/43.jpg)
Apache 1, 2, 4, 8 and 12 processors
1 2 4 8 120
2000
4000
6000
8000
10000
12000
14000
Native BTKernel BTKernel-no-callret
Number of Processors
Thro
ughp
ut (M
Bps)
Higher is better
![Page 44: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/44.jpg)
Fileserver 1, 4, 8, 12 processors
1 4 8 120
500010000150002000025000300003500040000
Native BTKernel BTKernel-no-callret
Number of Processors
Thro
ughp
ut (o
ps/s
)
Higher is better
![Page 45: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/45.jpg)
lmbench fork operations
execve exit sh0
200400600800
100012001400
Native BTKernel BTKernel-no-callret
lmbench microbenchmark
Tim
e (M
icro
seco
nds)
Lower is better
![Page 46: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/46.jpg)
Number of Dispatcher Exits
Without call/ret optimization
With call/ret optimization
Instructions Dispatcher Exits
Instructions Dispatcher Exits
Apache 56 b 7 m 59 b 125
Linux Build 570 b 72 m 590 b 33059
m = millionb = billion
![Page 47: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/47.jpg)
Applications
• We implemented Shadow Memory for a Linux guest• Identifies the CPU-private (read/write) and CPU-shared (read/write) bytes in
kernel address space
• Overheads range from 20% - 300%
• Significant improvement over the 10x overheads reported in DRK
![Page 48: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/48.jpg)
Summary and Conclusion
• Avoid back-and-forth translation between native and translated values of interrupted PC• Relax precision requirements on exceptions and interrupts• Use cache-aware layout for the code cache• Use identity translations for the function call/ret instructions
Near-Native performance DBT implementation for unmodified LinuxAvailability: https://github.com/piyus/btkernel
![Page 49: Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649dbc5503460f94aada57/html5/thumbnails/49.jpg)
Thank You.