Download - TM performance: seeing the whole picture or Looking back over the first 500 papers

TM performance: seeing the whole picture

or

Looking back over the first 500 papers

Tim Harris (MSR Cambridge)

How might we compare TM systems?

Where might TM be most useful?

Extending Dan’s GC analogy

Concurrent GC algorithm

(run GC in small steps in

amongst mutators)

“Here’s a way to reduce the pause times...”

A

“Here’s a way to support pinned objects...”

B “Here’s a way to improve the throughput (total app

runtime)...

C

Min mutator utilization

0 2 4 6 8 10 120.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Algorithm AAlgorithm B

Time interval / ms

Min

facti

on o

f int

erva

l run

ning

mut

ator

Five dimensions to TM behaviorSequentialoverhead

Scalability(to longer

transactions)

Scalability(to more cores)

Tx-supportedoperations

Semantics

Scaling to large transactions

0 1 2 3 4 5 6 7 8 9 100.00.51.01.52.02.53.03.54.04.55.0


Tx size

Norm

alize

d ex

ecuti

on ti

me

1.0 = optimized sequential code(no tx, no locks)

Scaling: n*1-core copies

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6


#cores

Norm

alize

d ex

ecuti

on ti

me


Scaling: 1*n-core copy

0 1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5


#cores

Spee

dup

over

sequ

entia

l


How might we compare TM systems?

Where might TM be most useful?

Application model #1

Sequential Parallelizable

f = fraction of original program that is parallelizable


Sequential

Parallel

Parallel

Parallel

...

f = fraction of original program that is parallelizablen = num parallel threads


Sequential

Parallel, transactional



...

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-down

Conflict model

f = fraction of original program that is parallelizablen = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

1 2 3 4 5 6

Fixed number of alternatives, executedifferent alternatives in parallel

Execute conflictingoperations in series

n=16, c=1.0, vary f, vary x

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%

x (straight-line transactional slow-down)

f (pa

ralle

l pro

porti

on)


n=16, c=1.0

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


8x on 16 threads => 95% parallelizable

n=16, c=1.0

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


Straight-line slow-down bites quickly

n=16, c=1.1 (1..1024)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


n=16, c=1.4 (1..256)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


n=16, c=2.0 (1..64)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


n=16, c=3.1 (1..16)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


If Amdahl and overheads don’t get

you then conflicts still can...

n=16, c=1.0, scaling of large tx

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


0.0 1.0 2.0 3.0 4.00.0

5.0

10.0

x*f

x*f

n=16, c=1.0, x*(f+(f^1.25)/4)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635722

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


0.0 1.0 2.0 3.0 4.00.0

5.0

10.0

x*f

x*(f+

(f^1.

25)/

4)

n=16, c=1.0, x*(f+(f^2)/4)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635722

5.55991731349224

75%78%80%83%85%88%90%93%95%98%100%

75%78%80%85%88%


f (pa

ralle

l pro

porti

on)


0.0 1.0 2.0 3.0 4.00.0

5.0

10.0

x*f

x*(f+

(f^2)

/4)

Application model #2: 100% parallel

Tx

...

t = fraction of original program that is transactionaln = num parallel threadsx = straight-line transactional slow-downc = mean number of attempts per transaction (1 => no conflicts)

Non-tx

Tx Non-tx

Tx Non-tx

Workloads (ASPLOS ’10)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%30%


t (tr

ansa

ction

al p

ropo

rtion

)Labyrinth

Genome

JBBAtomicVacation


MaxFlow

Workloads (ASPLOS ’10)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%30%


t (tr

ansa

ction

al p

ropo

rtion

)


Labyrinth

Genome

JBBAtomicVacation

MaxFlow

n=16, c=1.0 (no conflicts)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%


t (tr

ansa

ction

al p

ropo

rtion

)


n=16, c=1.0 (no conflicts)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%


t (tr

ansa

ction

al p

ropo

rtion

)


Overheads rapidly reduce the amount

that transactions can be used

n=16, c=1.1 (1..1024)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%


t (tr

ansa

ction

al p

ropo

rtion

)


n=16, c=1.4 (1..256)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635721

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%


t (tr

ansa

ction

al p

ropo

rtion

)


n=16, c=2.0 (1..64)

11.21

1.4641

1.771561

2.14358881

2.5937424601

3.138428376721

3.79749833583242

4.59497298635722

5.559917313492240%10%20%30%40%50%60%70%80%90%100%

0%10%20%40%


t (tr

ansa

ction

al p

ropo

rtion

)


Conclusions• Bad things come in threes...

– Amdahl’s law– Sequential overhead– Conflicts

• When developing TM systems we need to be careful about tradeoffs between these

• There’s a risk of “chasing around the TM design space”– Sequential overhead– Scaling without conflicts– Scaling with conflicts

Download - TM performance: seeing the whole picture or Looking back over the first 500 papers

Top Related