1 optimizing memory transactions tim harris, maurice herlihy, jim larus, virendra marathe, simon...
TRANSCRIPT
1
Optimizing memory transactionsOptimizing memory transactions
Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar, Satnam Singh, David Tarditi
2
Moore’s lawMoore’s law
1970
1980
1990
2000
2005
1,000
1,000,000
1,000,000,000
4004
8086
Intel 486
Intel 386
286
IntelPentium
IntelPentium II
IntelPentium 4 Intel
Itanium 2
Source: table from http://www.intel.com/technology/mooreslaw/index.htm
Pipelined execution
Pipelined execution
Superscalar processors
Superscalar processors
Data-parallel
extensions
Data-parallel
extensions
Increasing ILP through speculation
Increasing ILP through speculation
Increases in clock frequency
Increases in clock frequency
Increases in ILP
Increases in ILP
Parallelism
Parallelism
3
Source: viewed 15 Nov 2006
4
Making use of parallelism hardware is Making use of parallelism hardware is hardhard• How do we share data between concurrent How do we share data between concurrent
threads?threads?
• How do we find things to run in parallel?How do we find things to run in parallel?
Both of these are difficult problems: in this Both of these are difficult problems: in this talk I’ll focus on the first problemtalk I’ll focus on the first problem
Server workloads often provide a
natural source of parallelism: deal
with different clients concurrently
Traditional HPC is reasonably understood too: problems can be
partitioned, or parallel building blocks
composed
Programmers aren’t good at thinking about
thread interleavings
…and locks provide a complicated and non-composable
solution
5
OverviewOverview
• IntroductionIntroduction
• Problems with locksProblems with locks
• Controlling concurrent accessControlling concurrent access
• Research challengesResearch challenges
• ConclusionsConclusions
6
11
14
3
567
12 6 30
PushRight / PopRight work on the right hand
end
PushLeft / PopLeft work on the left hand
end
• Make it thread-safe by adding a lockMake it thread-safe by adding a lock– Needlessly serializes operations when the queue is longNeedlessly serializes operations when the queue is long
• One lock for each end?One lock for each end?– Much more complex when the queue is short (0, 1, 2 items)Much more complex when the queue is short (0, 1, 2 items)
A deceptively simple problemA deceptively simple problem
7
Locks are brokenLocks are broken
• RacesRaces: due to forgotten locks : due to forgotten locks
• DeadlockDeadlock: locks acquired in “wrong” order. : locks acquired in “wrong” order.
• Lost wakeupsLost wakeups: : forgotten notify to condition forgotten notify to condition variablevariable
• Error recovery trickyError recovery tricky: need to restore : need to restore invariants and release locks in exception invariants and release locks in exception handlershandlers
• Simplicity vs scalability tensionSimplicity vs scalability tension
• Lack of progress guaranteesLack of progress guarantees
• ……but worst of all… but worst of all… locks do not composelocks do not compose
8
Atomic memory transactionsAtomic memory transactions
• To a first approximation, just write the sequential To a first approximation, just write the sequential code, and wrap code, and wrap atomicatomic around it around it
• All-or-nothing semantics: All-or-nothing semantics: atomicatomic commit commit• Atomic block executes in Atomic block executes in isolationisolation• Cannot deadlock (there are no locks!)Cannot deadlock (there are no locks!)• Atomicity makes error recovery easy Atomicity makes error recovery easy
(e.g. exception thrown inside the (e.g. exception thrown inside the GetGet code) code)
Item Get() {atomic { ... sequential get code ... }
}
9
Atomic blocks compose (locks do not)Atomic blocks compose (locks do not)
• Guarantees to get two consecutive itemsGuarantees to get two consecutive items
• Get() is unchanged Get() is unchanged (the nested atomic is a no-op)(the nested atomic is a no-op)
• Cannot be achieved with locks (except Cannot be achieved with locks (except by breaking the Get() abstraction)by breaking the Get() abstraction)
void GetTwo() {atomic { i1 = Get(); i2 = Get(); }DoSomething( i1, i2 );
}
10
Blocking: how does Get() wait for data?Blocking: how does Get() wait for data?
• retryretry means “abandon execution of the atomic block means “abandon execution of the atomic block and re-run it (when there is a chance it’ll complete)”and re-run it (when there is a chance it’ll complete)”
• No lost wake-upsNo lost wake-ups
• No consequential change to GetTwo(), No consequential change to GetTwo(), even though even though GetTwo must wait for there to be two items in the GetTwo must wait for there to be two items in the queuequeue
Item Get() {atomic {
if (size==0) { retry; }else { ...remove item from queue... }
} }
11
Choice: waiting for either of two queuesChoice: waiting for either of two queues
• dodo {...this...} {...this...} orelseorelse {...that...} tries to run “this” {...that...} tries to run “this”• If “this” retries, it runs “that” insteadIf “this” retries, it runs “that” instead• If both retry, the do-block retries. GetEither() will If both retry, the do-block retries. GetEither() will
thereby wait for there to be an item in thereby wait for there to be an item in eithereither queue queue
Q1 Q2
R
void GetEither() {atomic {
do { i = Q1.Get(); }orelse { i = Q2.Get(); }
R.Put( i );} }
12
Why is ‘orelse’ left-biased? Blocking v Why is ‘orelse’ left-biased? Blocking v polling:polling:
• No need to write separate blocking / polling No need to write separate blocking / polling methodsmethods
• Can extend to deal with timeouts by using a Can extend to deal with timeouts by using a ‘wakeup’ message on a shared memory buffer‘wakeup’ message on a shared memory buffer
Item TryGet() {atomic {
do {return Get();
} orelse {return null;
} } }
13
Choice: a server waiting on two clientsChoice: a server waiting on two clients
atomicatomic { {Item i;Item i;dodo { {
i = q1.PopLeft(3)i = q1.PopLeft(3)} } orelseorelse { {
i = q2.PopLeft (3) i = q2.PopLeft (3) } } // Process item// Process item
}}
atomicatomic { {dodo { {
WaitForShutdownRequest();WaitForShutdownRequest();// Process shutdown // Process shutdown
requestrequest} } orelseorelse { {
}}}}
If q1.PopLeft can’t finish then undo its side effects and
try q2.PopLeft instead
Unlike WaitAny or Select we can compose code
using retry – e.g. wait for a shutdown request, or
for client requests
Try this first – if it gets an item without hitting retry
then we’re done
14
InvariantsInvariants
• Suppose a queue should always have at least Suppose a queue should always have at least one element in it; how can we check this?one element in it; how can we check this?– Find all the places where queues are modified and put in Find all the places where queues are modified and put in
a check?a check?– Define ‘NonEmptyQueue’ which wraps the ordinary Define ‘NonEmptyQueue’ which wraps the ordinary
queue operations?queue operations?
• The check is duplicated (and possibly forgotten)The check is duplicated (and possibly forgotten)• The invariant may be associated with multiple The invariant may be associated with multiple
data structures (“queue X contains fewer data structures (“queue X contains fewer elements than queue Y”)elements than queue Y”)
• What if different queues have different What if different queues have different constraints?constraints?
15
• Define invariants that Define invariants that must be preserved must be preserved by successful atomic blocksby successful atomic blocks
InvariantsInvariants
Queue NewNeverEmptyQueue(Item i) {atomic {
Queue q = new Queue();q.Put(i);always { !q.IsEmpty(); }return q;
} }
Computation to check the
invariant: must be true when
proposed, and preserved by top-
level atomic blocks
Queue q = NewNeverEmptyQueue(i);atomic { q.Get(); }
Invariant failure: would leave q
empty
16
Summary so farSummary so far
With locks, you think about:With locks, you think about:• Which lock protects which data? What data can be Which lock protects which data? What data can be
mutated when by other threads? Which condition mutated when by other threads? Which condition variables must be notified when? variables must be notified when?
• None of this is explicit in the source codeNone of this is explicit in the source code
With atomic blocks you think aboutWith atomic blocks you think about• Purely sequential reasoningPurely sequential reasoning within a block, which is within a block, which is
dramatically easierdramatically easier• What are the What are the invariantsinvariants (e.g. the tree is balanced)? (e.g. the tree is balanced)?• Each atomic block maintains the invariantsEach atomic block maintains the invariants• Much easier setting for static analysis toolsMuch easier setting for static analysis tools
17
Summary so farSummary so far
• Atomic blocks (atomic, retry, orElse, always) are a Atomic blocks (atomic, retry, orElse, always) are a real step forwardreal step forward
• It’s like using a high-level language instead of It’s like using a high-level language instead of assembly codeassembly code: whole classes of low-level errors : whole classes of low-level errors are eliminatedare eliminated
• Not a silver bullet: Not a silver bullet:
– you can still write buggy programs; you can still write buggy programs;
– concurrent programs are still harder to write concurrent programs are still harder to write than sequential ones; than sequential ones;
– aimed at shared memory aimed at shared memory
• But the improvement remains substantialBut the improvement remains substantial
18
• Direct-update STM Direct-update STM (but: see programming model implications (but: see programming model implications later)later)
– Locking to allow transactions to make updates in place in Locking to allow transactions to make updates in place in the heapthe heap
– Log overwritten valuesLog overwritten values– Reads naturally see earlier writes that the transaction has Reads naturally see earlier writes that the transaction has
made (either use opt or pess concurrency control)made (either use opt or pess concurrency control)– Makes successful Makes successful commit commit operations faster at the cost of operations faster at the cost of
extra work on contention or when a transaction extra work on contention or when a transaction abortsaborts
• Compiler integrationCompiler integration– Decompose the transactional memory operations into Decompose the transactional memory operations into
primitivesprimitives– Expose the primitives to compiler optimization (e.g. to hoist Expose the primitives to compiler optimization (e.g. to hoist
concurrency control operations out of a loop)concurrency control operations out of a loop)
• Runtime system integrationRuntime system integration– Integration with the garbage collector or runtime system Integration with the garbage collector or runtime system
components to scale to atomic blocks containing 100M components to scale to atomic blocks containing 100M accessesaccesses
Can they be implemented with acceptable Can they be implemented with acceptable perf?perf?
19
Example: contention between transactionsExample: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
T1’s log:
ver = 200
val = 40
c2
ver = 100
val = 10
c1
T2’s log:
20
Example: contention between transactionsExample: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100
T1’s log:
ver = 200
val = 40
c2
ver = 100
val = 10
c1
T2’s log:
T1 reads from c1: logs that it
saw version 100
21
Example: contention between transactionsExample: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100
T1’s log:
ver = 200
val = 40
c2
ver = 100
val = 10
c1
c1.ver=100
T2’s log:
T2 also reads from c1: logs that it saw
version 100
22
Example: contention between transactionsExample: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100c2.ver=200
T1’s log:
ver = 200
val = 40
c2
ver = 100
val = 10
c1
c1.ver=100
T2’s log:
Suppose T1 now reads from c2,
sees it at version 200
23
Example: contention between transactionsExample: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100c2.ver=200
T1’s log:
ver = 200
val = 40
c2
locked:T2
val = 10
c1
c1.ver=100lock: c1, 100
T2’s log:
Before updating c1, thread T2 must lock it:
record old version number
24
Example: contention between transactionsExample: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100c2.ver=200
T1’s log:
ver = 200
val = 40
c2
locked:T2
val = 11
c1
c1.ver=100lock: c1, 100c1.val=10
T2’s log:
(1) Before updating c1.val, thread T2 must
log the data it’s going to overwrite
(2) After logging the old value, T2 makes its
update in place to c1
25
Example: contention between transactionsExample: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100c2.ver=200
T1’s log:
ver = 200
val = 40
c2
ver=101
val = 10
c1
T2’s log: c1.ver=100lock: c1, 100c1.val=10 (1) Check the version we
locked matches the version we previously
read
(2) T2’s transaction commits successfully. Unlock the object,
installing the new version number
26
Example: contention between transactionsExample: contention between transactions
atomic { t = c1.val; t ++; c1.val = t;}
Thread T2
int t = 0;atomic { t += c1.val; t += c2.val;}
Thread T1
c1.ver=100c2.ver=100
T1’s log:
ver = 200
val = 40
c2
ver=101
val = 10
c1
T2’s log:
(1) T1 attempts to commit. Check the versions it read are still up-to-
date.(2) Object c1 was updated from
version 100 to 101, so T1’s transaction is aborted and re-run.
27
• We expose decomposed log-writing operations in the We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL)compiler’s internal intermediate code (no change to MSIL)
– OpenForRead – before the first time we read from an object OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples)(e.g. c1 or c2 in the examples)
– OpenForUpdate – before the first time we update an objectOpenForUpdate – before the first time we update an object
– LogOldValue – before the first time we write to a given fieldLogOldValue – before the first time we write to a given field
atomic { … t += n.value; n = n.next; …}
OpenForRead(n);t = n.value;OpenForRead(n);n = n.next;
OpenForRead(n);t = n.value;n = n.next;
Source code Basic intermediate code Optimized intermediate code
Compiler integrationCompiler integration
28
Compiler integration – avoiding upgradesCompiler integration – avoiding upgrades
atomic { … c1.val ++; …}
OpenForRead(c1);temp1 = c1.val;temp1 ++;OpenForUpdate(c1);LogOldValue(&c1.val);c1.val = temp1;
OpenForUpdate(c1);temp1 = c1.val;temp1 ++;LogOldValue(&c1.val);c1.val = temp1
Source co
de
Compiler’s
intermediate co
de
Optimize
d
intermediate co
de
29
Compiler integration – other optimizationsCompiler integration – other optimizations
• Hoist OpenFor* and Log* operations out from loops Hoist OpenFor* and Log* operations out from loops
• Avoid OpenFor* and Log* operations on objects Avoid OpenFor* and Log* operations on objects allocated inside atomic blocks (these objects must allocated inside atomic blocks (these objects must be thread local)be thread local)
• Move OpenFor* operations from methods to their Move OpenFor* operations from methods to their callerscallers
• Further decompose operations to allow logging-Further decompose operations to allow logging-space checks to be combinedspace checks to be combined
• Expose OpenFor* and Log*’s implementation to Expose OpenFor* and Log*’s implementation to inlining and further optimizationinlining and further optimization
30
0
1
2
3
Results: concurrency control overheadResults: concurrency control overheadN
orm
alis
ed e
xecu
tion
time
Sequential baseline (1.00x)
Coarse-grained locking (1.13x)
Fine-grained locking (2.57x)
Direct-update STM (2.04x)
Direct-update STM + compiler integration
(1.46x)
Buffer memory accesses, non-blocking commit (5.69x)
Workload: operations on a red-black tree, 1
thread, 6:1:1 lookup:insert:delete
mix with keys 0..65535
Scalable to multicore
31
0
2
4
6
8
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
Results: scalabilityResults: scalability
#threads
Fine-grained locking
Direct-update STM + compiler integration
Buffered-update non-blocking STM
Coarse-grained locking
Mic
rose
cond
s pe
r op
erat
ion
32
Results: long running testsResults: long running tests
2.0
4
2.1
1
6.0
5
5.8
4
3.9
1
2.0
8
2.0
5
4.3
2
3.1
6
2.6
1
1.7
2
1.6
8
1.0
0
1.0
0
1.0
0
1.0
0
1.0
0
0
1
2
3
4
5
6
7
1 2 3 4 5tree goskip merge-sort xlisp
Norm
alis
ed e
xecu
tion t
ime
10.8 73 16
2
Direct-update STM
Run-time filteringCompile-time optimizations
Original application (no tx)
33
Research challengesResearch challenges
34
The transactional elephantThe transactional elephantHow do all these issues fit into
the bigger picture (just the trunk, as it were, of the concurrency elephant)
Semantics of atomic blocks, progress guarantees, forms of nesting, interaction outside the
transactionOther applications of transactional memory
– e.g. speculationSoftware transactional
memory algorithms and APIs
Compiling and optimizing atomic blocks to STM APIs
Hardware transactional
memory
Hardware acceleration of software transactional
memory
Integration with other resources – tx IO etc.
35
Semantics of atomic blocksSemantics of atomic blocks
• What does it mean for a program using What does it mean for a program using atomic blocks to be correctly synchronized?atomic blocks to be correctly synchronized?
• What guarantees are lost / retained in What guarantees are lost / retained in programs which are not correctly programs which are not correctly synchronized?synchronized?
• What tools and analysis techniques can we What tools and analysis techniques can we develop to identify problems, or to prove develop to identify problems, or to prove correct synchronization?correct synchronization?
36
Tempting goal: “strong” atomicityTempting goal: “strong” atomicity
• ““Atomic means atomic” – as if the block Atomic means atomic” – as if the block runs with other threads suspended and runs with other threads suspended and outside any atomic blocksoutside any atomic blocks
• Problems:Problems:– Is it practical to implement?Is it practical to implement?
– Is it actually useful?Is it actually useful?
atomic {x ++; x++;
}
The atomic action can run at any
point during ‘x++’
37
One approach: single lock semanticsOne approach: single lock semantics
• Think of atomic blocks as acquiring & Think of atomic blocks as acquiring & releasing a single process-wide lockreleasing a single process-wide lock
• If the resulting single-lock program is If the resulting single-lock program is correctly synchronized then so is the correctly synchronized then so is the originaloriginal
atomic { x ++;
}
atomic {y ++;
}
lock (TxLock) {x ++;
}
lock (TxLock) {y ++;
}
38
One approach: single lock semanticsOne approach: single lock semantics
atomic { x ++; } atomic { x++ ;}
atomic { lock (x) { lock (x) {
x ++; x ++; } }}
atomic { x ++; } x++ ;
atomic { x ++; } lock (x) { x++ ;}
Correct: consistently protected by atomic
blocks
Not correctly synchronized: race on x
Correct: consistently protected by “x”s lock
Not correctly synchronized: no
interatcion between “TxLock” and “x”s lock
39
Single lock semantics & privatizationSingle lock semantics & privatization
• This is correctly synchronized under single-lock This is correctly synchronized under single-lock semantics: (initially x=0, x_shared=true)semantics: (initially x=0, x_shared=true)
• Problems both with optimistic concurrency Problems both with optimistic concurrency control (either in-place or deferred control (either in-place or deferred updates), also with non-strict pessimistic updates), also with non-strict pessimistic concurrency controlconcurrency control
• Q: is this a problem in the semantics or the Q: is this a problem in the semantics or the impl’? impl’?
atomic { // Tx1 x_shared = false;}x ++;
atomic { // Tx2 if (x_shared) { x ++;} }
Single lock semantics => x=1 or x=2 after both threads’ atcions
40
Segregated heap semanticsSegregated heap semantics
• Many problems are simplified if we distinguish tx Many problems are simplified if we distinguish tx and non-tx accessible dataand non-tx accessible data
– tx-accessible data is always tx-accessible data is always and only and only accessed within txaccessed within tx
– non-tx accessible data is non-tx accessible data is never never accessed within txaccessed within tx
• This is the model we provide in STM-HaskellThis is the model we provide in STM-Haskell– A A TVar aTVar a is a tx-mutable cell holding a value of type ‘a’ is a tx-mutable cell holding a value of type ‘a’
– A computation of type A computation of type STM aSTM a is only allowed to update is only allowed to update TVars, not other mutable cellsTVars, not other mutable cells
– A computation of type A computation of type IO aIO a can mutate other cells, and can mutate other cells, and only TVars by entering atomic blocksonly TVars by entering atomic blocks
• ……but vast annotation or expressiveness burden but vast annotation or expressiveness burden trying to control effects to the same extent in other trying to control effects to the same extent in other languageslanguages
41
ConclusionsConclusions
42
ConclusionsConclusions
• Atomic blocks and transactional memory provide a Atomic blocks and transactional memory provide a fundamental improvement to the abstractions for fundamental improvement to the abstractions for building shared-memory concurrent data structuresbuilding shared-memory concurrent data structures
– Only one part of the concurrency landscapeOnly one part of the concurrency landscape
– But we believe no one solution will solve all the challenges But we believe no one solution will solve all the challenges therethere
• Our experience with compiler integration shows that Our experience with compiler integration shows that a pure software implementation can perform well a pure software implementation can perform well for some workloadsfor some workloads
– We still need a better understanding of realistic workloads We still need a better understanding of realistic workloads
– We need this to inform the design of the semantics of We need this to inform the design of the semantics of atomic blocks, as well as to guide the optimization of their atomic blocks, as well as to guide the optimization of their implementationimplementation