donald e. porter and emmett witchel

35
The University of Texas at Aust UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and Emmett Witchel 1

Upload: makoto

Post on 22-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Understanding Transactional Memory Performance. Donald E. Porter and Emmett Witchel. The University of Texas at Austin. Multicore is here. Only concurrent applications will perform better on new hardware. This laptop 2 Intel cores. Intel Single-chip Cloud Computer 48 cores. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Donald E. Porter and Emmett  Witchel

1

The University of Texas at Austin

UNDERSTANDING TRANSACTIONAL MEMORY

PERFORMANCE

Donald E. Porter and Emmett Witchel

Page 2: Donald E. Porter and Emmett  Witchel

2

Multicore is here

Only concurrent applications will perform better on new hardware

Intel Single-chip Cloud Computer48 cores

Tilera Tile GX100 cores

This laptop2 Intel cores

Page 3: Donald E. Porter and Emmett  Witchel

3

Concurrent programming is hard Locks are the state of the art

Correctness problems: deadlock, priority inversion, etc.

Scaling performance requires more complexity Transactional memory makes correctness easy

Trade correctness problems for performance problems

Key challenge: performance tuning transactions This work:

Develops a TM performance model and tool Systems integration challenges for TM

Page 4: Donald E. Porter and Emmett  Witchel

4

Simple microbenchmark

Intuition: Transactions execute optimistically TM should scale at low contention threshold Locks always execute serially

lock();if(rand() < threshold)

shared_var = new_value;unlock();

xbegin();if(rand() < threshold)

shared_var = new_value;xend();

Page 5: Donald E. Porter and Emmett  Witchel

5

Ideal TM performance

0 20 40 60 80 100

00.5

11.5

22.5

3

Locks 32 CPUs

Probability of Conflict (%)

Exec

utio

n Ti

me

(s)

Performance win at low contention

Higher contention degrades gracefully

xbegin();if(rand() < threshold)

shared_var = new_value;xend();

Lower is betterIdeal, not real data

Page 6: Donald E. Porter and Emmett  Witchel

6

Actual performance under contention

0 20 40 60 80 100

00.5

11.5

22.5

3

Locks 32 CPUs

Probability of Conflict (%)

Exec

utio

n Ti

me

(s)

Comparable performance at modest contention

40% worse at 100% contention

xbegin();if(rand() < threshold)

shared_var = new_value;xend();

Lower is betterActual data

Page 7: Donald E. Porter and Emmett  Witchel

7

First attempt at microbenchmark

0 20 40 60 80 100

00.5

11.5

22.5

3

Locks 32 CPUs

Probability of Conflict (%)

Exec

utio

n Ti

me

(s)

xbegin();if(rand() < threshold)

shared_var = new_value;xend();

Lower is betterApproximate data

Page 8: Donald E. Porter and Emmett  Witchel

8

Subtle sources of contention

if(a < threshold) shared_var = new_value;

eax = shared_var;if(edx < threshold) eax = new_value;shared_var = eax;

Microbenchmark code

gcc optimized code

Compiler optimization to avoid branches Optimization causes 100% restart rate Can’t identify problem with source inspection +

reason

Page 9: Donald E. Porter and Emmett  Witchel

9

Developers need TM tuning tools Transactional memory can perform pathologically

Contention Poor integration with system components HTM “best effort” not good enough

Causes can be subtle and counterintuitive Syncchar: Model that predicts TM performance

Predicts poor performance remove contention Predicts good performance + poor performance system issue

Page 10: Donald E. Porter and Emmett  Witchel

10

This talk Motivating example Syncchar performance model Experiences with transactional memory

Performance tuning case study System integration challenges

Page 11: Donald E. Porter and Emmett  Witchel

11

The Syncchar model Approximate transaction performance model Intuition: scalability limited by serialized length

of critical regions Introduce two key metrics for critical regions:

Data Independence: Likelihood executions do not conflict

Conflict Density: How many threads must execute serially to resolve a conflict

Model inputs: samples critical region executions Memory accesses and execution times

Page 12: Donald E. Porter and Emmett  Witchel

12

Data independence (In) Expected number of non-conflicting, concurrent

executions of a critical region. Formally:In = n - |Cn|

n =thread countCn = set of conflicting critical region executions

Linear speedup when all critical regions are data independent (In = n ) Example: thread-private data structures

Serialized execution when (In = 0 ) Example: concurrent updates to a shared variable

Page 13: Donald E. Porter and Emmett  Witchel

13

Example:

Write a

Read a

Write a

Read a Write a

Write a

Time

Same data independence (0) Different serialization

Thread 1

Thread 2

Thread 3

Page 14: Donald E. Porter and Emmett  Witchel

14

Intuition: Low density High density

How many threads must be serialized to eliminate a conflict?

Similar to dependence density introduced by von Praun et al. [PPoPP ‘07]

Conflict density (Dn)

Write a

Read a

Write a

Read a Write a

Write a

Time

Thread 1

Thread 2

Thread 3

Page 15: Donald E. Porter and Emmett  Witchel

15

Syncchar metrics in STAMP

8 16 32 8 16 32 8 16 32 8 16 320

2

4

6

8

10

12Conflict DensityData Inde-pendence

Proj

ecte

d Sp

eedu

p ov

er

Lock

ing

intruder

kmeans

bayes ssca2Higher is better

Page 16: Donald E. Porter and Emmett  Witchel

16

Predicting execution time Speedup limited by conflict density Amdahl’s law: Transaction speedup limited

to time executing transactions concurrently

cs_cycles = time executing a critical regionother = remaining execution time

Dn = Conflict density

otherDncyclescsTimeExecution

n

)1,max(__

Page 17: Donald E. Porter and Emmett  Witchel

17

Syncchar tool Implemented as Simics machine simulator

module Samples lock-based application behavior Predicts TM performance Features:

Identifies contention “hot spot” addresses Sorts by time spent in critical region Identifies potential asymmetric conflicts

between transactions and non-transactional threads

Page 18: Donald E. Porter and Emmett  Witchel

18

Syncchar validation: microbenchmark

0 10 20 30 40 50 60 70 80 90 1000

0.51

1.52

2.53

Locks 8 CPUsTM 8 CPUs

Probability of Conflict (%)

Exec

utio

n Ti

me

(s)

Tracks trends, does not model pathologies

Balances accuracy with generality

Lower is better

Page 19: Donald E. Porter and Emmett  Witchel

19

Syncchar validation: STAMP

ssca2 32CPUssca2 16CPUssca2 8CPU

intruder 32CPUintruder 16CPUintruder 8CPU

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Predicted

Execution Time (s) Coarse predictions track scaling trend

Mean error 25% Additional benchmarks in paper

Page 20: Donald E. Porter and Emmett  Witchel

20

Syncchar summary Model: data independence and conflict

density Both contribute to transactional speedup

Syncchar tool predicts scaling trends Predicts poor performance remove

contention Predicts good performance + poor

performance system issue

Distinguishing high contention from system issues is key step in performance tuning

Page 21: Donald E. Porter and Emmett  Witchel

21

This talk Motivating example Syncchar performance model Experiences with transactional memory

Performance tuning case study System integration challenges

Page 22: Donald E. Porter and Emmett  Witchel

22

TxLinux case study TxLinux – modifies Linux synchronization

primitives to use hardware transactions [SOSP 2007]

Linux

TxLinux-xs

TxLinux-cx

Linux

TxLinux-xs

TxLinux-cx

Linux

TxLinux-xs

TxLinux-cx

Linux

TxLinux-xs

TxLinux-cx

Linux

TxLinux-xs

TxLinux-cx

Linux

TxLinux-xs

TxLinux-cx

pmake bonnie++ mab find config dpunish

0

2

4

6

8

10

12

14

aborts spins

% K

erne

l Tim

e Sp

ent S

ynch

roni

zing

16 CPUs – graph taken from SOSP talkLower is better

Page 23: Donald E. Porter and Emmett  Witchel

23

Bonnie++ pathology Simple execution profiling indicated ext3 file

system journaling code was the culprit Code inspection yielded no clear culprit What information missing?

What variable causing the contention What other code is contending with the

transaction Syncchar tool showed:

Contended variable High probability (88-92%) of asymmetric conflict

Page 24: Donald E. Porter and Emmett  Witchel

24

Bonnie++ pathology, explained

False asymmetric conflicts for unrelated bits Tuned by moving state lock to dedicated

cache line

lock(buffer->state);... xbegin(); ... assert(locked(buffer->state)); ... xend();...unlock(buffer->state);

struct bufferhead{ … bit state; bit dirty; bit free; …};

Tx RW

Page 25: Donald E. Porter and Emmett  Witchel

25

Tuned performance – 16 CPUs

bonnie++ MAB pmake radix-0.2-1.66533453693773E-16

0.20.40.60.8

11.2

TxLinuxTxLinux Tuned

Exec

utio

n Ti

me

(s)

>10 s

Tuned performance strictly dominates TxLinux

Lower is better

Page 26: Donald E. Porter and Emmett  Witchel

26

This talk Motivating example Syncchar performance model Experiences with transactional memory

Performance tuning case study System integration challenges

Compiler (motivation) Architecture Operating system

Page 27: Donald E. Porter and Emmett  Witchel

27

HTM designs must handle TLB misses

Some best effort HTM designs cannot handle TLB misses Sun Rock

What percent of STAMP txns would abort for TLB misses? 2% for kmeans 50-100%

How many times will these transactions restart? 3 (ssca2) 908 (bayes)

Practical HTM designs must handle TLB misses

Page 28: Donald E. Porter and Emmett  Witchel

28

Input size Simulation studies need scaled inputs

Simulating 1 second takes hours to weeks STAMP comes with parameters for real

and simulated environments

Page 29: Donald E. Porter and Emmett  Witchel

29

Input size

8 16 32 8 16 32 8 16 3205

1015202530 Speedup normalized to 1 CPU – Higher is

better BigSim

Spee

dup

genome ssca2 yada Simulator inputs too small to amortize

costs of scheduling threads

Page 30: Donald E. Porter and Emmett  Witchel

30

System calls – memory allocation

xbegin();malloc();xend();

Thread 1

Common case behavior:Rollback of transaction rolls back heap bookkeeping

HeapPages: 2

Allocated FreeLegend

Page 31: Donald E. Porter and Emmett  Witchel

31

System calls – memory allocation

xbegin();malloc();xend();

Thread 1

Heap

Uncommon case behavior:Allocator adds pages to heapRolls back bookkeeping, leaking pages

Pages: 2Pages: 3

Pathological memory leaks in STAMP genome and labyrinth benchmark

Allocated FreeLegend

Page 32: Donald E. Porter and Emmett  Witchel

32

System integration issues Developers need tools to identify these

subtle issues Indicated by poor performance despite

good predictions from Syncchar Pain for early adopters, important for

designers System call support evolving in OS

community xCalls [Volos et al. – Eurosys 2009]

Userspace compensation built on transactional pause

TxOS [Porter et al. – SOSP 2009] Kernel support for transactional system calls

Page 33: Donald E. Porter and Emmett  Witchel

33

Related work TM performance models

von Praun et al. [PPoPP ’07] – Dependence density

Heindl and Pokam [Computer Networks 2009] – analytic model of STM performance

HTM conflict behavior Bobba et al. [ISCA 2007] Ramadan et al. [MICRO 2008] Pant and Byrd [ICS 2009] Shriraman and Dwarkadas [ICS 2009]

Page 34: Donald E. Porter and Emmett  Witchel

34

Conclusion Developers need tools for tuning TM

performance Syncchar provides practical techniques Identified system integration challenges

for TM

Code available at: http://syncchar.code.csres.utexas.edu

[email protected]

Page 35: Donald E. Porter and Emmett  Witchel

35

Backup slides