does better throughput require worse latency?

19
Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Monday 7 January 2013

Upload: racesworkshop

Post on 01-Jul-2015

288 views

Category:

Technology


1 download

DESCRIPTION

Presentation by David Ungar. Paper and more information: http://soft.vub.ac.be/races/paper/does-better-throughput-require-worse-latency/

TRANSCRIPT

Page 1: Does Better Throughput Require Worse Latency?

Does Better Throughput Require Worse Latency?David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

Monday 7 January 2013

Page 2: Does Better Throughput Require Worse Latency?

Example: On-Line Transaction Processing

✦ Large “database” (100 GB) of information

✦ Constant stream of incoming updates & queries

✦ Need many cores to handle the work

✦ Cores need to communicate updates

✦ roll-ups sum over many variables

✦ Trick:

✦ Caching - updates must sync with invalidates

✦ Replication - updates must propagate

Monday 7 January 2013

Page 3: Does Better Throughput Require Worse Latency?

Assumptions✦ Too much computation for one core

✦ Not trivially scalable;

✦ needs communication

✦ Inputs constantly changing

✦ No sub-space radio:

✦ communication finite and limiting

Monday 7 January 2013

Page 4: Does Better Throughput Require Worse Latency?

Throughput

vs

Monday 7 January 2013

Page 5: Does Better Throughput Require Worse Latency?

Throughput ~ Scaling

0

25

50

75

100

1 core 25 cores 50 cores 75 cores 100 cores

throughput = 1.0

throughput = 0.25

Monday 7 January 2013

Page 6: Does Better Throughput Require Worse Latency?

Latency

✦ Inter-core

✦ Data structure/algorithm level

✦ Time needed for cause (input, computation result) on one core to affect another

Δt

What is best possible latency (on a given platform)?

Monday 7 January 2013

Page 7: Does Better Throughput Require Worse Latency?

Measure w/ Ring Counter

while (1

) A = D

;

while (1) D = C + 1;

while (1) B = A;

while (1

) C =

B;

Core 1

RingCounter

Latency Baseline ≣ Time / Count / Number-of-Cores

Core 2

Core 3

Core 4

Monday 7 January 2013

Page 8: Does Better Throughput Require Worse Latency?

Ring CounterLatency Baselines

0

20

40

60

80

100

1 2 3 4 5 6 7 8

Normal loads & stores

Late

ncy

(ns)

# threads (4 cores, 2-way SMT)

0

20

40

60

80

100

1 2 3 4 5 6 7 8

Normal loads & stores + memory barrier

Late

ncy

(ns)

# threads (4 cores, 2-way SMT)

Other platforms? Signals? Atomics?

min

max

min

max

Monday 7 January 2013

Page 9: Does Better Throughput Require Worse Latency?

The Intution✦ After you have optimized:

✦ Suppose relative latency is 10

✦ Relative throughput is 1/4

✦ If you then raise throughput to 1/2

✦ Latency will increase to 20

Space of best algorithms exhibits this trade-off

Monday 7 January 2013

Page 10: Does Better Throughput Require Worse Latency?

Variables#readers

# writers

contention

reading/writing

Which Instructions

Normal loads & stores

Atomic loads & stores

Signals

Memory barriers

Monday 7 January 2013

Page 11: Does Better Throughput Require Worse Latency?

Shared CounterFrom McKenney’s PerfBook

Monday 7 January 2013

Page 12: Does Better Throughput Require Worse Latency?

Shared CounterFrom McKenney’s PerfBook

Write code Read Code Latency Throughput

Serial

Mutex

Lock-Free

Per-thread

Per-thread +

cache

Race & Repair

C += delta C tiny single-core

lock, C += delta, unlock Csmall, unless

writers convoy

higher, but writers have locking &

contention overhead

C +=atomic delta Cif contention

writers can starvehigher for low-

contention writers

per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers

per-thread-C += deltaanother thread maintains sum;

read sum

higher: summing thread may be idle

high for both readers and

writers

C += delta Chigher under

contention: lost counts

high for both readers and

writers

Monday 7 January 2013

Page 13: Does Better Throughput Require Worse Latency?

Shared CounterFrom McKenney’s PerfBook

Write code Read Code Latency Throughput

Serial

Mutex

Lock-Free

Per-thread

Per-thread +

cache

Race & Repair

C += delta C tiny single-core

lock, C += delta, unlock Csmall, unless

writers convoy

higher, but writers have locking &

contention overhead

C +=atomic delta Cif contention

writers can starvehigher for low-

contention writers

per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers

per-thread-C += deltaanother thread maintains sum;

read sum

higher: summing thread may be idle

high for both readers and

writers

C += delta Chigher under

contention: lost counts

high for both readers and

writers

Monday 7 January 2013

Page 14: Does Better Throughput Require Worse Latency?

Shared CounterFrom McKenney’s PerfBook

Write code Read Code Latency Throughput

Serial

Mutex

Lock-Free

Per-thread

Per-thread +

cache

Race & Repair

C += delta C tiny single-core

lock, C += delta, unlock Csmall, unless

writers convoy

higher, but writers have locking &

contention overhead

C +=atomic delta Cif contention

writers can starvehigher for low-

contention writers

per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers

per-thread-C += deltaanother thread maintains sum;

read sum

higher: summing thread may be idle

high for both readers and

writers

C += delta Chigher under

contention: lost counts

high for both readers and

writers

Monday 7 January 2013

Page 15: Does Better Throughput Require Worse Latency?

Shared CounterFrom McKenney’s PerfBook

Write code Read Code Latency Throughput

Serial

Mutex

Lock-Free

Per-thread

Per-thread +

cache

Race & Repair

C += delta C tiny single-core

lock, C += delta, unlock Csmall, unless

writers convoy

higher, but writers have locking &

contention overhead

C +=atomic delta Cif contention

writers can starvehigher for low-

contention writers

per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers

per-thread-C += deltaanother thread maintains sum;

read sum

higher: summing thread may be idle

high for both readers and

writers

C += delta Chigher under

contention: lost counts

high for both readers and

writers

Monday 7 January 2013

Page 16: Does Better Throughput Require Worse Latency?

Shared CounterFrom McKenney’s PerfBook

Write code Read Code Latency Throughput

Serial

Mutex

Lock-Free

Per-thread

Per-thread +

cache

Race & Repair

C += delta C tiny single-core

lock, C += delta, unlock Csmall, unless

writers convoy

higher, but writers have locking &

contention overhead

C +=atomic delta Cif contention

writers can starvehigher for low-

contention writers

per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers

per-thread-C += deltaanother thread maintains sum;

read sum

higher: summing thread may be idle

high for both readers and

writers

C += delta Chigher under

contention: lost counts

high for both readers and

writers

Monday 7 January 2013

Page 17: Does Better Throughput Require Worse Latency?

Shared CounterFrom McKenney’s PerfBook

Write code Read Code Latency Throughput

Serial

Mutex

Lock-Free

Per-thread

Per-thread +

cache

Race & Repair

C += delta C tiny single-core

lock, C += delta, unlock Csmall, unless

writers convoy

higher, but writers have locking &

contention overhead

C +=atomic delta Cif contention

writers can starvehigher for low-

contention writers

per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers

per-thread-C += deltaanother thread maintains sum;

read sum

higher: summing thread may be idle

high for both readers and

writers

C += delta Chigher under

contention: lost counts

high for both readers and

writers

Monday 7 January 2013

Page 18: Does Better Throughput Require Worse Latency?

Shared CounterFrom McKenney’s PerfBook

Write code Read Code Latency Throughput

Serial

Mutex

Lock-Free

Per-thread

Per-thread +

cache

Race & Repair

C += delta C tiny single-core

lock, C += delta, unlock Csmall, unless

writers convoy

higher, but writers have locking &

contention overhead

C +=atomic delta Cif contention

writers can starvehigher for low-

contention writers

per-thread-C += delta sum(all C’s) high if many coreshigher for writers,lower for readers

per-thread-C += deltaanother thread maintains sum;

read sum

higher: summing thread may be idle

high for both readers and

writers

C += delta Chigher under

contention: lost counts

high for both readers and

writers

Monday 7 January 2013

Page 19: Does Better Throughput Require Worse Latency?

Conclusions✦ Throughput: how well parallelism gets work

done

✦ Latency: how fast one core responds to another

✦ Lots of dimensions: # readers, # writers, contention

✦ Throughput vs Latency:

✦ throughput -> parallel -> distributed/replicated -> more latency

Monday 7 January 2013