1 optimizing memory transactions tim harris, maurice herlihy, jim larus, virendra marathe, simon...

1

Optimizing memory transactionsOptimizing memory transactions

Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar, Satnam Singh, David Tarditi

2

Moore’s lawMoore’s law

1970

1980

1990

2000

2005

1,000

1,000,000

1,000,000,000

4004

8086

Intel 486

Intel 386

286

IntelPentium

IntelPentium II

IntelPentium 4 Intel

Itanium 2

Source: table from http://www.intel.com/technology/mooreslaw/index.htm

Pipelined execution

Pipelined execution

Superscalar processors

Superscalar processors

Data-parallel

extensions

Data-parallel

extensions

Increasing ILP through speculation

Increasing ILP through speculation

Increases in clock frequency

Increases in clock frequency

Increases in ILP

Increases in ILP

Parallelism

Parallelism

3

Source: viewed 15 Nov 2006

4

Making use of parallelism hardware is Making use of parallelism hardware is hardhard• How do we share data between concurrent How do we share data between concurrent

threads?threads?

• How do we find things to run in parallel?How do we find things to run in parallel?

Both of these are difficult problems: in this Both of these are difficult problems: in this talk I’ll focus on the first problemtalk I’ll focus on the first problem

Server workloads often provide a

natural source of parallelism: deal

with different clients concurrently

Traditional HPC is reasonably understood too: problems can be

partitioned, or parallel building blocks

composed

Programmers aren’t good at thinking about

thread interleavings

…and locks provide a complicated and non-composable

solution

5

OverviewOverview

• IntroductionIntroduction

• Problems with locksProblems with locks

• Controlling concurrent accessControlling concurrent access

• Research challengesResearch challenges

• ConclusionsConclusions

6

11

14

3

567

12 6 30

PushRight / PopRight work on the right hand

end

PushLeft / PopLeft work on the left hand

end

• Make it thread-safe by adding a lockMake it thread-safe by adding a lock– Needlessly serializes operations when the queue is longNeedlessly serializes operations when the queue is long

• One lock for each end?One lock for each end?– Much more complex when the queue is short (0, 1, 2 items)Much more complex when the queue is short (0, 1, 2 items)

A deceptively simple problemA deceptively simple problem

7

Locks are brokenLocks are broken

• RacesRaces: due to forgotten locks : due to forgotten locks

• DeadlockDeadlock: locks acquired in “wrong” order. : locks acquired in “wrong” order.

• Lost wakeupsLost wakeups: : forgotten notify to condition forgotten notify to condition variablevariable

• Error recovery trickyError recovery tricky: need to restore : need to restore invariants and release locks in exception invariants and release locks in exception handlershandlers

• Simplicity vs scalability tensionSimplicity vs scalability tension

• Lack of progress guaranteesLack of progress guarantees

• ……but worst of all… but worst of all… locks do not composelocks do not compose

8

Atomic memory transactionsAtomic memory transactions

• To a first approximation, just write the sequential To a first approximation, just write the sequential code, and wrap code, and wrap atomicatomic around it around it

• All-or-nothing semantics: All-or-nothing semantics: atomicatomic commit commit• Atomic block executes in Atomic block executes in isolationisolation• Cannot deadlock (there are no locks!)Cannot deadlock (there are no locks!)• Atomicity makes error recovery easy Atomicity makes error recovery easy

(e.g. exception thrown inside the (e.g. exception thrown inside the GetGet code) code)

Item Get() {atomic { ... sequential get code ... }

}

9

Atomic blocks compose (locks do not)Atomic blocks compose (locks do not)

• Guarantees to get two consecutive itemsGuarantees to get two consecutive items

• Get() is unchanged Get() is unchanged (the nested atomic is a no-op)(the nested atomic is a no-op)

• Cannot be achieved with locks (except Cannot be achieved with locks (except by breaking the Get() abstraction)by breaking the Get() abstraction)

void GetTwo() {atomic { i1 = Get(); i2 = Get(); }DoSomething( i1, i2 );

}

10

Blocking: how does Get() wait for data?Blocking: how does Get() wait for data?

• retryretry means “abandon execution of the atomic block means “abandon execution of the atomic block and re-run it (when there is a chance it’ll complete)”and re-run it (when there is a chance it’ll complete)”

• No lost wake-upsNo lost wake-ups

• No consequential change to GetTwo(), No consequential change to GetTwo(), even though even though GetTwo must wait for there to be two items in the GetTwo must wait for there to be two items in the queuequeue

Item Get() {atomic {

if (size==0) { retry; }else { ...remove item from queue... }

} }

11

Choice: waiting for either of two queuesChoice: waiting for either of two queues

• dodo {...this...} {...this...} orelseorelse {...that...} tries to run “this” {...that...} tries to run “this”• If “this” retries, it runs “that” insteadIf “this” retries, it runs “that” instead• If both retry, the do-block retries. GetEither() will If both retry, the do-block retries. GetEither() will

thereby wait for there to be an item in thereby wait for there to be an item in eithereither queue queue

Q1 Q2

R

void GetEither() {atomic {

do { i = Q1.Get(); }orelse { i = Q2.Get(); }

R.Put( i );} }

12

Why is ‘orelse’ left-biased? Blocking v Why is ‘orelse’ left-biased? Blocking v polling:polling:

• No need to write separate blocking / polling No need to write separate blocking / polling methodsmethods

• Can extend to deal with timeouts by using a Can extend to deal with timeouts by using a ‘wakeup’ message on a shared memory buffer‘wakeup’ message on a shared memory buffer

Item TryGet() {atomic {

do {return Get();

} orelse {return null;

} } }

13

Choice: a server waiting on two clientsChoice: a server waiting on two clients

atomicatomic { {Item i;Item i;dodo { {

i = q1.PopLeft(3)i = q1.PopLeft(3)} } orelseorelse { {

i = q2.PopLeft (3) i = q2.PopLeft (3) } } // Process item// Process item

}}

atomicatomic { {dodo { {

WaitForShutdownRequest();WaitForShutdownRequest();// Process shutdown // Process shutdown

requestrequest} } orelseorelse { {

}}}}

If q1.PopLeft can’t finish then undo its side effects and

try q2.PopLeft instead

Unlike WaitAny or Select we can compose code

using retry – e.g. wait for a shutdown request, or

for client requests

Try this first – if it gets an item without hitting retry

then we’re done

14

InvariantsInvariants

• Suppose a queue should always have at least Suppose a queue should always have at least one element in it; how can we check this?one element in it; how can we check this?– Find all the places where queues are modified and put in Find all the places where queues are modified and put in

a check?a check?– Define ‘NonEmptyQueue’ which wraps the ordinary Define ‘NonEmptyQueue’ which wraps the ordinary

queue operations?queue operations?

• The check is duplicated (and possibly forgotten)The check is duplicated (and possibly forgotten)• The invariant may be associated with multiple The invariant may be associated with multiple

data structures (“queue X contains fewer data structures (“queue X contains fewer elements than queue Y”)elements than queue Y”)

• What if different queues have different What if different queues have different constraints?constraints?

15

• Define invariants that Define invariants that must be preserved must be preserved by successful atomic blocksby successful atomic blocks

InvariantsInvariants

Queue NewNeverEmptyQueue(Item i) {atomic {

Queue q = new Queue();q.Put(i);always { !q.IsEmpty(); }return q;

} }

Computation to check the

invariant: must be true when

proposed, and preserved by top-

level atomic blocks

Queue q = NewNeverEmptyQueue(i);atomic { q.Get(); }

Invariant failure: would leave q

empty

16

Summary so farSummary so far

With locks, you think about:With locks, you think about:• Which lock protects which data? What data can be Which lock protects which data? What data can be

mutated when by other threads? Which condition mutated when by other threads? Which condition variables must be notified when? variables must be notified when?

• None of this is explicit in the source codeNone of this is explicit in the source code

With atomic blocks you think aboutWith atomic blocks you think about• Purely sequential reasoningPurely sequential reasoning within a block, which is within a block, which is

dramatically easierdramatically easier• What are the What are the invariantsinvariants (e.g. the tree is balanced)? (e.g. the tree is balanced)?• Each atomic block maintains the invariantsEach atomic block maintains the invariants• Much easier setting for static analysis toolsMuch easier setting for static analysis tools

17

Summary so farSummary so far

• Atomic blocks (atomic, retry, orElse, always) are a Atomic blocks (atomic, retry, orElse, always) are a real step forwardreal step forward

• It’s like using a high-level language instead of It’s like using a high-level language instead of assembly codeassembly code: whole classes of low-level errors : whole classes of low-level errors are eliminatedare eliminated

• Not a silver bullet: Not a silver bullet:

– you can still write buggy programs; you can still write buggy programs;

– concurrent programs are still harder to write concurrent programs are still harder to write than sequential ones; than sequential ones;

– aimed at shared memory aimed at shared memory

• But the improvement remains substantialBut the improvement remains substantial

18

• Direct-update STM Direct-update STM (but: see programming model implications (but: see programming model implications later)later)

– Locking to allow transactions to make updates in place in Locking to allow transactions to make updates in place in the heapthe heap

– Log overwritten valuesLog overwritten values– Reads naturally see earlier writes that the transaction has Reads naturally see earlier writes that the transaction has

made (either use opt or pess concurrency control)made (either use opt or pess concurrency control)– Makes successful Makes successful commit commit operations faster at the cost of operations faster at the cost of

extra work on contention or when a transaction extra work on contention or when a transaction abortsaborts

• Compiler integrationCompiler integration– Decompose the transactional memory operations into Decompose the transactional memory operations into

primitivesprimitives– Expose the primitives to compiler optimization (e.g. to hoist Expose the primitives to compiler optimization (e.g. to hoist

concurrency control operations out of a loop)concurrency control operations out of a loop)

• Runtime system integrationRuntime system integration– Integration with the garbage collector or runtime system Integration with the garbage collector or runtime system

components to scale to atomic blocks containing 100M components to scale to atomic blocks containing 100M accessesaccesses

Can they be implemented with acceptable Can they be implemented with acceptable perf?perf?

19

Example: contention between transactionsExample: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

T2’s log:

20



Thread T2


Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

T2’s log:

T1 reads from c1: logs that it

saw version 100

21



Thread T2


Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

c1.ver=100

T2’s log:

T2 also reads from c1: logs that it saw

version 100

22



Thread T2


Thread T1

c1.ver=100c2.ver=200

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

c1.ver=100

T2’s log:

Suppose T1 now reads from c2,

sees it at version 200

23



Thread T2


Thread T1


T1’s log:

ver = 200

val = 40

c2

locked:T2

val = 10

c1

c1.ver=100lock: c1, 100

T2’s log:

Before updating c1, thread T2 must lock it:

record old version number

24



Thread T2


Thread T1


T1’s log:

ver = 200

val = 40

c2

locked:T2

val = 11

c1

c1.ver=100lock: c1, 100c1.val=10

T2’s log:

(1) Before updating c1.val, thread T2 must

log the data it’s going to overwrite

(2) After logging the old value, T2 makes its

update in place to c1

25



Thread T2


Thread T1


T1’s log:

ver = 200

val = 40

c2

ver=101

val = 10

c1

T2’s log: c1.ver=100lock: c1, 100c1.val=10 (1) Check the version we

locked matches the version we previously

read

(2) T2’s transaction commits successfully. Unlock the object,

installing the new version number

26



Thread T2


Thread T1


T1’s log:

ver = 200

val = 40

c2

ver=101

val = 10

c1

T2’s log:

(1) T1 attempts to commit. Check the versions it read are still up-to-

date.(2) Object c1 was updated from

version 100 to 101, so T1’s transaction is aborted and re-run.

27

• We expose decomposed log-writing operations in the We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL)compiler’s internal intermediate code (no change to MSIL)

– OpenForRead – before the first time we read from an object OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples)(e.g. c1 or c2 in the examples)

– OpenForUpdate – before the first time we update an objectOpenForUpdate – before the first time we update an object

– LogOldValue – before the first time we write to a given fieldLogOldValue – before the first time we write to a given field

atomic { … t += n.value; n = n.next; …}

OpenForRead(n);t = n.value;OpenForRead(n);n = n.next;

OpenForRead(n);t = n.value;n = n.next;

Source code Basic intermediate code Optimized intermediate code

Compiler integrationCompiler integration

28

Compiler integration – avoiding upgradesCompiler integration – avoiding upgrades

atomic { … c1.val ++; …}

OpenForRead(c1);temp1 = c1.val;temp1 ++;OpenForUpdate(c1);LogOldValue(&c1.val);c1.val = temp1;

OpenForUpdate(c1);temp1 = c1.val;temp1 ++;LogOldValue(&c1.val);c1.val = temp1

Source co

de

Compiler’s

intermediate co

de

Optimize

d

intermediate co

de

29

Compiler integration – other optimizationsCompiler integration – other optimizations

• Hoist OpenFor* and Log* operations out from loops Hoist OpenFor* and Log* operations out from loops

• Avoid OpenFor* and Log* operations on objects Avoid OpenFor* and Log* operations on objects allocated inside atomic blocks (these objects must allocated inside atomic blocks (these objects must be thread local)be thread local)

• Move OpenFor* operations from methods to their Move OpenFor* operations from methods to their callerscallers

• Further decompose operations to allow logging-Further decompose operations to allow logging-space checks to be combinedspace checks to be combined

• Expose OpenFor* and Log*’s implementation to Expose OpenFor* and Log*’s implementation to inlining and further optimizationinlining and further optimization

30

0

1

2

3

Results: concurrency control overheadResults: concurrency control overheadN

orm

alis

ed e

xecu

tion

time

Sequential baseline (1.00x)

Coarse-grained locking (1.13x)

Fine-grained locking (2.57x)

Direct-update STM (2.04x)

Direct-update STM + compiler integration

(1.46x)

Buffer memory accesses, non-blocking commit (5.69x)

Workload: operations on a red-black tree, 1

thread, 6:1:1 lookup:insert:delete

mix with keys 0..65535

Scalable to multicore

31

0

2

4

6

8

1 2 3 4 5 6 7 8 9 1 0 1 1 1 2

Results: scalabilityResults: scalability

#threads

Fine-grained locking

Direct-update STM + compiler integration

Buffered-update non-blocking STM

Coarse-grained locking

Mic

rose

cond

s pe

r op

erat

ion

32

Results: long running testsResults: long running tests

2.0

4

2.1

1

6.0

5

5.8

4

3.9

1

2.0

8

2.0

5

4.3

2

3.1

6

2.6

1

1.7

2

1.6

8

1.0

0

1.0

0

1.0

0

1.0

0

1.0

0

0

1

2

3

4

5

6

7

1 2 3 4 5tree goskip merge-sort xlisp

Norm

alis

ed e

xecu

tion t

ime

10.8 73 16

2

Direct-update STM

Run-time filteringCompile-time optimizations

Original application (no tx)

33

Research challengesResearch challenges

34

The transactional elephantThe transactional elephantHow do all these issues fit into

the bigger picture (just the trunk, as it were, of the concurrency elephant)

Semantics of atomic blocks, progress guarantees, forms of nesting, interaction outside the

transactionOther applications of transactional memory

– e.g. speculationSoftware transactional

memory algorithms and APIs

Compiling and optimizing atomic blocks to STM APIs

Hardware transactional

memory

Hardware acceleration of software transactional

memory

Integration with other resources – tx IO etc.

35

Semantics of atomic blocksSemantics of atomic blocks

• What does it mean for a program using What does it mean for a program using atomic blocks to be correctly synchronized?atomic blocks to be correctly synchronized?

• What guarantees are lost / retained in What guarantees are lost / retained in programs which are not correctly programs which are not correctly synchronized?synchronized?

• What tools and analysis techniques can we What tools and analysis techniques can we develop to identify problems, or to prove develop to identify problems, or to prove correct synchronization?correct synchronization?

36

Tempting goal: “strong” atomicityTempting goal: “strong” atomicity

• ““Atomic means atomic” – as if the block Atomic means atomic” – as if the block runs with other threads suspended and runs with other threads suspended and outside any atomic blocksoutside any atomic blocks

• Problems:Problems:– Is it practical to implement?Is it practical to implement?

– Is it actually useful?Is it actually useful?

atomic {x ++; x++;

}

The atomic action can run at any

point during ‘x++’

37

One approach: single lock semanticsOne approach: single lock semantics

• Think of atomic blocks as acquiring & Think of atomic blocks as acquiring & releasing a single process-wide lockreleasing a single process-wide lock

• If the resulting single-lock program is If the resulting single-lock program is correctly synchronized then so is the correctly synchronized then so is the originaloriginal

atomic { x ++;

}

atomic {y ++;

}

lock (TxLock) {x ++;

}

lock (TxLock) {y ++;

}

38

One approach: single lock semanticsOne approach: single lock semantics

atomic { x ++; } atomic { x++ ;}

atomic { lock (x) { lock (x) {

x ++; x ++; } }}

atomic { x ++; } x++ ;

atomic { x ++; } lock (x) { x++ ;}

Correct: consistently protected by atomic

blocks

Not correctly synchronized: race on x

Correct: consistently protected by “x”s lock

Not correctly synchronized: no

interatcion between “TxLock” and “x”s lock

39

Single lock semantics & privatizationSingle lock semantics & privatization

• This is correctly synchronized under single-lock This is correctly synchronized under single-lock semantics: (initially x=0, x_shared=true)semantics: (initially x=0, x_shared=true)

• Problems both with optimistic concurrency Problems both with optimistic concurrency control (either in-place or deferred control (either in-place or deferred updates), also with non-strict pessimistic updates), also with non-strict pessimistic concurrency controlconcurrency control

• Q: is this a problem in the semantics or the Q: is this a problem in the semantics or the impl’? impl’?

atomic { // Tx1 x_shared = false;}x ++;

atomic { // Tx2 if (x_shared) { x ++;} }

Single lock semantics => x=1 or x=2 after both threads’ atcions

40

Segregated heap semanticsSegregated heap semantics

• Many problems are simplified if we distinguish tx Many problems are simplified if we distinguish tx and non-tx accessible dataand non-tx accessible data

– tx-accessible data is always tx-accessible data is always and only and only accessed within txaccessed within tx

– non-tx accessible data is non-tx accessible data is never never accessed within txaccessed within tx

• This is the model we provide in STM-HaskellThis is the model we provide in STM-Haskell– A A TVar aTVar a is a tx-mutable cell holding a value of type ‘a’ is a tx-mutable cell holding a value of type ‘a’

– A computation of type A computation of type STM aSTM a is only allowed to update is only allowed to update TVars, not other mutable cellsTVars, not other mutable cells

– A computation of type A computation of type IO aIO a can mutate other cells, and can mutate other cells, and only TVars by entering atomic blocksonly TVars by entering atomic blocks

• ……but vast annotation or expressiveness burden but vast annotation or expressiveness burden trying to control effects to the same extent in other trying to control effects to the same extent in other languageslanguages

41

ConclusionsConclusions

42

ConclusionsConclusions

• Atomic blocks and transactional memory provide a Atomic blocks and transactional memory provide a fundamental improvement to the abstractions for fundamental improvement to the abstractions for building shared-memory concurrent data structuresbuilding shared-memory concurrent data structures

– Only one part of the concurrency landscapeOnly one part of the concurrency landscape

– But we believe no one solution will solve all the challenges But we believe no one solution will solve all the challenges therethere

• Our experience with compiler integration shows that Our experience with compiler integration shows that a pure software implementation can perform well a pure software implementation can perform well for some workloadsfor some workloads

– We still need a better understanding of realistic workloads We still need a better understanding of realistic workloads

– We need this to inform the design of the semantics of We need this to inform the design of the semantics of atomic blocks, as well as to guide the optimization of their atomic blocks, as well as to guide the optimization of their implementationimplementation

1 optimizing memory transactions tim harris, maurice herlihy, jim larus, virendra marathe, simon...

Documents