a dynamic binary-rewriting approach to software transactional memory
DESCRIPTION
Marek Olszewski. Jeremy Cutler. Greg Steffan. A Dynamic Binary-Rewriting Approach to Software Transactional Memory. appeared in PACT 2007, Brasov, Romania University of Toronto. The Parallel Programming Challenge. Coarse-grained locking Easy to program Scales poorly - PowerPoint PPT PresentationTRANSCRIPT
A Dynamic Binary-Rewriting Approach to Software Transactional Memory
appeared in PACT 2007, Brasov, Romania
University of Toronto
Marek Olszewski Jeremy Cutler Greg Steffan
2
The Parallel Programming Challenge Coarse-grained locking
Easy to program Scales poorly
Fine-grained locking Scales well Hard to get right
eg., deadlock, priority inversion, etc. The promise of Transactional Memory
As easy to program as coarse-grained locking Performance/scalability of fine-grained locking
3
Transactional Memory (TM)Source Code:
...atomic { ... access_shared_data(); ...}...
TM System
Specifies threads/transactions in source code
...atomic { ... access_shared_data(); ...}...
...atomic { ... access_shared_data(); ...}...
Transactions:
Executes transactions optimistically in parallel
Programmer:TM System:
1) Checkpoints execution2) Detects conflicts
? ?
3) Commits or aborts and re-executes
4
TM Implementations Flavors of TM:
Hardware (HTM), Software (STM), Hybrid (HyTM) STM is especially compelling
Exploit current commodity hardware (multicores) Learn about real TM systems and apps
Current STM Systems: Java: DSTM, ASTM C or C++: McRT icc, TL2, RSTM, OSTM
object-based or programmer intensive (or both)
Our focus: arbitrary C/C++, realistic environment
5
my_app
my_app
Programming with STM
#include <glib.h>
GTree *tree;
...atomic {g_tree_insert(tree &key, &val);}...
STM Compiler
Source Code:
Executable:
Shared Library:
glib
Running Application:
Not handled by current compiler/library-based STMs
Loader
kernel
“Legacy Locks”Pre-compiled BinarySystem Calls
6
JudoSTM: An Overview Key design choices:
1) Dynamic Binary Rewriting (DBR) insert instrumentation to implement STM
2) Value-based conflict detection
Resulting key features:1) Privileged transactions (support system calls)2) Legacy lock elision3) Efficient invisible readers
7
JudoSTM Design Choice 1
Dynamic Binary Rewriting (DBR) Judo DBR Framework (user-space version of JIFL†)
† JIT Instrumentation - A Novel Approach To Dynamically Instrument Operating Systems, SIGOPS EuroSys 2007
8
Dynamic Binary RewritingOriginal Code: Code Cache:
bb1
Judo
bb3bb2
bb4
bb1 bb1
9
Dynamic Binary RewritingOriginal Code: Code Cache:
bb1
Judo
bb3bb2
bb4
bb1
bb2 bb2
10
Dynamic Binary RewritingOriginal Code: Code Cache:
bb1
Judo
bb3bb2
bb4
bb1
bb2
bb4 bb4
bb1
bb2
11
Judo - Performance0.
93
1.26
1.06
1.04
1.03
1.41
0.95
1.25
1.53
1.15
1.03
2.20
1.11
1.05 1.07
1.50
1.05
1.41
1.31
1.26
0.80
1.00
1.20
1.40
1.60
1.80
2.00
2.20
2.40
mcf gcc vpr gzip bzip2 vortex twolf crafty eon GeometricMean
Judo DynamoRIO
Nor
mal
ized
Run
time
Ove
rhea
d
Overhead low enough to implement STM?
12
DBR-Based STM Goal: Perform These Efficiently For all non-stack write instructions
Track write addresses and values (write-set) Write-buffer the values from regular memory
For all non-stack read instructions Redirect to the write-buffer If miss: track read addr.s and values (read-set)
When a transaction completes:1) Acquire commit lock(s)2) Validate read-set (value-based conflict detection)3) Commit write-set to memory4) Release commit lock(s)
13
DBR: Attractive Properties for STM Performance: overheads are amortized
code cache Can handle arbitrary code and shared libraries
any/all code is transactionalized as it executes Sandboxed Transactions
Typical STM: inconsistent values could stray execution
i.e., stray to non-transactionalized code (very bad!) solution: frequent & costly read-set validation
DBR-based STM: any/all code is transactionalized as it executes
Tough problems for conventional STMs addressed by DBR
14
JudoSTM Design Choice 2
Value-Based Conflict Detection (as opposed to location-based)
15
Location-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
0 0 0Strip versions:
Strip versions:
Strip versions:
Strips
2 3 56 2 3 5
16
Location-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:0 0 0
Legend:Read Written
02 3 56
Strip versions:
Strip versions:
Strip versions:
Transaction 1:
2 3 5
2 3 5
0
17
Location-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:0 0 0
Legend:Read Written
02 3 56
Strip versions:
Strip versions:
Strip versions:Transaction 2:
2 3 50
026
6 9
18
Location-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:0 0 0
Legend:Read Written
02 3 56
Strip versions:
Strip versions:
Strip versions:Transaction 2:
2 3 50
0
26
6 9
Commit step 1) Validate Read Set
Commit step 2) Publish Writes (and inc version #s)
9
1
19
Location-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:0 1 0
Legend:Read Written
02 3 56
Strip versions:
Strip versions:
Strip versions:
Transaction 1: 2 3 50
0
96
Commit step 1) Validate Read Set
Abort!
Note: all transactions must maintain strip version #s
20
Value-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
2 3 56
Transaction 1:
2 3 5
2 3 5
21
Value-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
2 3 56
Transaction 2:
2 3 5
26
6 9
22
Value-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
2 3 56
Transaction 2:
2 3 5
26
6 9
Commit step 1) Validate Read Set
Commit step 2) Publish Writes
9
23
Value-Based Conflict Detection
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
2 3 56
Transaction 1: 2 3 5
96
Commit step 1) Validate Read SetAbort!
Note: no version information to maintain
24
Privileged transactions Can execute (but not roll back) system calls Grab commit lock(s) when about to make a syscall
Release when transaction completes Only one privileged transaction exists at a time
JudoSTM Feature 1:
25
Privileged Transactions
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
2 3 56
Transaction 1:
2 3 5
2 3 5
26
Privileged Transactions
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
2 3 56
Transaction 2:
2 3 5
26
9
Privileged: can write directly to memory(privileged, syscalls)
may be uninstrumented
27
Privileged Transactions
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
2 3 56
Transaction 1: 2 3 5
96
Commit step 1) Validate Read SetAbort!
Value-based conflict detection facilitates system calls within transactions!
28
Legacy Lock Elision Safely ignore locks within legacy code
JudoSTM Feature 2:
29
Legacy Lock Elision
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
20
Transaction 1:
5Lock: 26
Read/Write
lock acquire
0
01
30
Legacy Lock Elision
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
20
Transaction 2:
5Lock: 26
Read/Write
0
01
01lock acquire
31
Legacy Lock Elision
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
20
Transaction 2:
5Lock: 66
9
Read/Write
0
01
01 6lock release
0
32
Legacy Lock Elision
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
20
Transaction 2:
5Lock: 66
9
Read/Write
0
01
01 60
Commit step 1) Validate Read Set
Commit step 2) Publish Writes
0 9
silent store
33
Legacy Lock Elision
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
50
Transaction 2:
5
7
Lock: 66
Read/Write
0
01 5lock release
0
9
34
Legacy Lock Elision
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
50
Transaction 2:
5
7
Lock: 66
Read/Write
0
01 50Commit step 1) Validate Read Set
9
35
Legacy Lock Elision
Transaction 1:
Transaction 2:
Main Memory:
Legend:Read Written
50
Transaction 2:
5
7
Lock: 66
Read/Write
0
01 50Commit step 2) Publish Writes
0 7
9
Value-based conflict detection facilitates the elision of legacy locks!
36
JudoSTM Feature 3:
Efficient Invisible Readers
37
Supporting Invisible Readers Invisible Readers: don’t report reads to others
good performance but can lead to inconsistent read data: errors!
Data errors: segfault, divide by zero Cheap solution: catch with trap/signal handlers
Control errors: jump to non-instrumented code Typical solution: verify read-set after every load
Expensive! O(N2) DBR solution: prevented by sandboxing
DBR instruments all code as it executes
38
JudoSTM Details
Implementation
39
(reminder)Goal: Perform These Efficiently
For all non-stack write instructions Track write addresses and values (write-set) Buffer the values from regular memory
For all non-stack read instructions Redirect to the write-buffer If miss: track read addr.s and values (read-set)
When a transaction completes:1) Acquire commit lock(s)2) Validate read-set (value-based conflict detection)3) Commit write-set to memory4) Release commit lock(s)
40
Read/Write Buffer Implementation
Read Hashtable:
Read Buffer:
Write Hashtable:
Write Buffer:
Linear probed open-addressed hashtables
Address Address
Efficient lookup: 5 insts for a hit (+ state-saving?)Efficient validate and commit?
41
Efficient Commit: Executable Write-Buffer
movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 ret
Write Hashtable:
Top ptr
Write Buffer:
Pre-allocated buffer of move instructionsEmit value-address pairs as transaction executes
42
Efficient Commit: Executable Write-Buffer
movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000025,0x80B10BB8 ret
Write Hashtable:
Top ptr
Write Buffer:
Pre-allocated buffer of move instructionsEmit value-address pairs as transaction executes
43
Efficient Commit: Executable Write-Buffer
movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x0000ab42,0x80B10BCCmovl $0x00000025,0x80B10BB8 ret
Write Hashtable:
Top ptr
Write Buffer:
Pre-allocated buffer of move instructionsEmit value-address pairs as transaction executes
44
Efficient Commit: Executable Write-Buffer
movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x80B10CFC,0x80B10CA4 movl $0x0000ab42,0x80B10BCCmovl $0x00000025,0x80B10BB8 ret
Write Hashtable:
Top ptr
Write Buffer:
Pre-allocated buffer of move instructionsEmit value-address pairs as transaction executes
45
Efficient Commit: Executable Write-Buffer
movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x00000000,0x00000000 movl $0x00000000,0x00000000movl $0x80B10CFC,0x80B10CA4 movl $0x0000ab42,0x80B10BCCmovl $0x00000025,0x80B10BB8 ret
Write Hashtable:
Top ptr
Write Buffer:
Execute the write-buffer to commit!
46
cmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000000, 0x00000000jne,pn judostm_trans_abortret
Read Hashtable: Read Buffer:
Efficient Validation: Executable Read-Buffer
Top ptr
Pre-allocated buffer of compare & jump instructionsEmit value-address pairs as transaction executes
47
cmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000a34, 0x80B10CA4jne,pn judostm_trans_abortret
Read Hashtable: Read Buffer:
Efficient Validation: Executable Read-Buffer
Top ptr
Pre-allocated buffer of compare & jump instructionsEmit value-address pairs as transaction executes
48
cmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000005, 0x80B10BB8jne,pn judostm_trans_abortcmp $0x00000a34, 0x80B10CA4jne,pn judostm_trans_abortret
Read Hashtable: Read Buffer:
Efficient Validation: Executable Read-Buffer
Top ptr
Pre-allocated buffer of compare & jump instructionsEmit value-address pairs as transaction executes
49
cmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000100, 0x80B10BCCjne,pn judostm_trans_abortcmp $0x00000005, 0x80B10BB8jne,pn judostm_trans_abortcmp $0x00000a34, 0x80B10CA4jne,pn judostm_trans_abortret
Read Hashtable: Read Buffer:
Efficient Validation: Executable Read-Buffer
Top ptr
Pre-allocated buffer of compare & jump instructionsEmit value-address pairs as transaction executes
50
cmp $0x00000000, 0x00000000jne,pn judostm_trans_abortcmp $0x00000100, 0x80B10BCCjne,pn judostm_trans_abortcmp $0x00000005, 0x80B10BB8jne,pn judostm_trans_abortcmp $0x00000a34, 0x80B10CA4jne,pn judostm_trans_abortret
Read Hashtable: Read Buffer:
Efficient Validation: Executable Read-Buffer
Top ptr
Execute the read-buffer to validate the read-set!
51
Evaluation
JudoSTM performance Comparison with Rochester’s RSTM†
† http://www.cs.rochester.edu/research/synchronization/rstm
52
RSTM vs JudoSTM: DesignRSTM JudoSTM
Language C++ C/C++
Programming model
Library API, rewrite code
atomic{…}
Conflict detection
Object-level location-based
Value-based
Memory Allocation
Custom “Hoard” scalable parallel allocator
Fast commit Object-cloning & pointer-switching
Executable write-buffer
JudoSTM more flexible, less intrusive; but performance?
53
Experimental Framework RSTM micro-benchmarks
Linked List, Hash Table, RBTree Equal mix of insert, remove, and lookup Measure throughput (transactions/sec)
Test platform 4-way SMP Intel Pentium 4 Xeon - 2.8GHz L1d/L2/L3 cache sizes: 8KB/512KB/2MB Linux 2.6.17.13
with per thread signal handler support
54
Linked List
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
1 2 3 4Processors
Tran
sact
ions
/ S
econ
d (m
illio
ns)
Coarse-Grained Locking
Fine-Grained Locking
RSTM
Judo (Single Lock)
Judo (Distributed Lock)
Coarse-grained locking best, but not scaling
55
Linked List – Zoomed in
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4Processors
Tran
sact
ions
/ S
econ
d (m
illio
ns)
Coarse-Grained Locking
Fine-Grained Locking
RSTM
Judo (Single Lock)
Judo (Distributed Lock)
Single-lock JudoSTM scaling nicely ; RSTM flatlined
56
Hash Table
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
1 2 3 4Processors
Tran
sact
ions
/ S
econ
d (m
illion
s)
Coarse-GrainedLocking
Fine-GrainedLocking
RSTM
Judo (SingleLock)
Judo (DistributedLock)
Distributed-lock JudoSTM beats CG-locking, tracks RSTM
57
RBTree
0.0
1.0
2.0
3.0
4.0
5.0
6.0
1 2 3 4Processors
Tran
sact
ions
/ S
econ
d (m
illion
s)Coarse-Grained Locking
RSTM
Judo (Single Lock)
Judo (Distributed Lock)
JudoSTM on track to scale past CG-locking; RSTM flatlined
58
Conclusions Judo: highly-efficient DBR framework
Beats DynamoRIO on SPEC benchmarks JudoSTM: First STM based on DBR
Value-based conflict detection Executable read/write buffers
Desirable features: Efficient invisible readers (sandboxing) Legacy lock elision Privileged transactions (system call support) Performance comparable to RSTM
Facilitates STM for real programs & environments!
59
Backups
60
JudoSTM Details
Programming with JudoSTM
61
my_app
my_app
Programming with JudoSTM
#include <glib.h>#include <judostm.h>GTree *tree;
...atomic { g_tree_insert(tree &key, &val);}...
Source Code:Executable:
Shared Library:
glib
kernel
loader
Running Application:
#include <glib.h>#include <judostm.h>GTree *tree;
...
g_tree_insert(tree &key, &val);
...
Library:
judoSTM
Instrumentedmy_app +
glib
Code Cache
#include <glib.h>#include <judostm.h>GTree *tree;
...atomic { g_tree_insert(tree &key, &val);}...
#include <glib.h>#include <judostm.h>GTree *tree;
...judostm_start() g_tree_insert(tree &key, &val);judostm_stop()...
gcc
Easy to use, with no compiler support!
#ifndef JUDOSTM_H#define JUDOSTM_H
extern void judostm_start(void);extern void judostm_stop(void);
#define atomic \ asm __volatile__ ("":::"eax", "ecx", "edx", "ebx", "edi", \ "esi", "flags", "memory");\ int __count = 0; \ judostm_start();\ for (; __count < 1; judostm_stop(), __count++)
#endif