improving database performance on simultaneous multithreading processors jingren zhou microsoft...

34
Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research [email protected] m John Cieslewicz Columbia University [email protected] du Kenneth A. Ross Columbia University [email protected] Mihir Shah Columbia University [email protected]

Upload: clarissa-copeland

Post on 17-Dec-2015

248 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Improving Database Performance on Simultaneous Multithreading Processors

Jingren ZhouMicrosoft Research

[email protected]

John CieslewiczColumbia University

[email protected]

Kenneth A. RossColumbia University

[email protected]

Mihir ShahColumbia University

[email protected]

Page 2: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Simultaneous Multithreading (SMT)

Available on modern CPUs: “Hyperthreading” on

Pentium 4 and Xeon. IBM POWER5 Sun UltraSparc IV

Challenge: Design software to efficiently utilize SMT. This talk: Database

software

Intel Pentium 4 with Hyperthreading

Page 3: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Superscalar Processor (no SMT)

Instruction Stream

Superscalar pipeline (up to 2 instructions/cycle)

... ...... ...

Time

•Improved instruction level parallelism

CPI = 3/4

Page 4: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

SMT Processor

Instruction Streams

... ...... ...

Time

•Improved thread level parallelism

•More opportunities to keep the processor busy

•But sometimes SMT does not work so well

CPI = 5/8

... ...

Page 5: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

StallsInstruction Stream 1

... ...

... ... Time

Instruction Stream 2... ...CPI = 3/4 . Progress despite stalled thread.

Stall

Stalls due to cache misses (200-300 cycles for L2 cache), branch mispredictions (20-30 cycles), etc.

Page 6: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Memory ConsistencyInstruction Stream 1

... ...

... ... Time

Instruction Stream 2... ...Detect conflicting access to common cache line

flush pipeline + sync cache with RAM

“MOMC Event” on Pentium 4. (300-350 cycles)

Page 7: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

SMT Processor

Exposes multiple “logical” CPUs (one per instruction stream)

One physical CPU (~5% extra silicon to duplicate thread state information)

Better than single threading: Increased thread-level parallelism Improved processor utilization when one thread blocks

Not as good as two physical CPUs: CPU resources are shared, not replicated

Page 8: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

SMT Challenges

Resource Competition Shared Execution Units Shared Cache

Thread Coordination Locking, etc. has high overhead

False Sharing MOMC Events

Page 9: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Approaches to using SMT

Ignore it, and write single threaded code. Naïve parallelism

Pretend the logical CPUs are physical CPUs

SMT-aware parallelism Parallel threads designed to avoid SMT-related

interference

Use one thread for the algorithm, and another to manage resources E.g., to avoid stalls for cache misses

Page 10: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Naïve Parallelism

Treat SMT processor as if it is multi-core Databases already designed to utilize multiple

processors - no code modification Uses shared processor resources inefficiently:

Cache Pollution / Interference Competition for execution units

Page 11: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

SMT-Aware Parallelism

Exploit intra-operator parallelism Divide input and use a separate thread to process

each part E.g., one thread for even tuples, one for odd tuples. Explicit partitioning step not required.

Sharing input involves multiple readers No MOMC events, because two reads don’t conflict

Page 12: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

SMT-Aware Parallelism (cont.)

Sharing output is challenging Thread coordination for output read/write and write/write conflicts on common cache

lines (MOMC Events)

“Solution:” Partition the output Each thread writes to separate memory buffer to avoid

memory conflicts Need an extra merge step in the consumer of the output

stream Difficult to maintain input order in the output

Page 13: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Managing Resources for SMT

Cache misses are a well-known performance bottleneck for modern database systems Mainly L2 data cache misses, but also L1 instruction

cache misses [Ailamaki et al 98].

Goal: Use a “helper” thread to avoid cache misses in the “main” thread load future memory references into the cache explicit load, not a prefetch

Page 14: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Data Dependency

Memory references that depend upon a previous memory access exhibit a data dependency

E.g., Lookup hash table:

Hash Buckets Overflow Cells

Tuple

Page 15: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Data Dependency (cont.) Data dependencies make instruction level parallelism

harder Modern architectures provide prefetch instructions.

Request that data be brought into the cache Non-blocking

Pitfalls: Prefetch instructions are frequently dropped Difficult to tune Too much prefetching can pollute the cache

Page 16: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Staging Computation

Hash Buckets Overflow Cells Tuple

A B C

1. Preload A.

2. (other work)

3. Process A.

4. Preload B.

5. (other work)

6. Process B.

7. Preload C.

8. (other work)

9. Process C.

10. Preload Tuple.

11. (other work)

12. Process Tuple.(Assumes each element is a cache line.)

Page 17: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Staging Computation (cont.)

By overlapping memory latency with other work, some cache miss latency can be hidden.

Many probes “in flight” at the same time. Algorithms need to be rewritten. E.g. Chen, et al. [2004],

Harizopoulos, et al. [2004].

Page 18: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Work-Ahead Set: Main Thread

Writes memory address + computation state to the work-ahead set

Retrieves a previous address + state Hope that helper thread can preload data before

retrieval by the main thread Correct whether or not helper thread succeeds at

preloading data helper thread is read-only

Page 19: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Work-ahead Set Data Structure

state address

Main Thread

A1

B1

C1

D1

E1

F1

Page 20: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Work-ahead Set Data Structure

state address

Main Thread

A1

B1

C1

D1

E1

F1

G1

H2

I2

J2

K2

L2

Page 21: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Work-Ahead Set: Helper Thread

Reads memory addresses from the work-ahead set, and loads their contents

Data becomes cache resident Tries to preload data before main thread cycles

around If successful, main thread experiences cache hits

Page 22: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

G

H2

I2

J2

Work-ahead Set Data Structure

state address

E

F

1

1

1

Helper Thread

“ temp += *slot[i] ”

Page 23: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

G

H2

I2

J2

Iterate Backwards!state address

E

F

1

1

1

Helper Thread

i = i-1 mod size

i

Why? See Paper.

Page 24: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Helper Thread Speed If helper thread faster than main thread:

More computation than memory latency Helper thread should not preload twice (wasted CPU

cycles) See paper for how to stop redundant loads

If helper thread is slower: No special tuning necessary Main thread will absorb some cache misses

Page 25: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Work-Ahead Set Size Too Large: Cache Pollution

Preloaded data evicts other preloaded data before it can be used

Too Small: Thread Contention Many MOMC events because work-ahead set spans few

cache lines

Just Right: Experimentally determined But use the smallest size within the acceptable range

(performance plateaus), so that cache space is available for other purposes (for us, 128 entries)

Data structure itself much smaller than L2 cache

Page 26: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Experimental Workload

Two Operators: Probe phase of Hash Join CSB+ Tree Index Join

Operators run in isolation and in parallel

Intel VTune used to measure hardware events

CPUPentium 4

3.4 GHz

Memory 2 GB DDR

L1, L2 Size 8 KB, 512 KB

L1, L2

Cache-line Size64 B, 128 B

L1 Miss Latency 18 cycles

L2 Miss Latency

276 Cycles

MOMC Latency

~300+ Cycles

Page 27: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Experimental Outline

1. Hash join2. Index lookup3. Mixed: Hash join and index lookup

Page 28: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Hash JoinComparative Performance

Page 29: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Hash JoinL2 Cache Misses Per Tuple

Page 30: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

CSB+ Tree Index JoinComparative Performance

Page 31: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

CSB+ Tree Index JoinL2 Cache Misses Per Tuple

Page 32: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Parallel Operator Performance

52% 55%20%

Page 33: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Parallel Operator Performance

26% 29%

Page 34: Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia

Conclusion

Naïve parallel SMT-Aware Work-Ahead

Impl. Effort Small Moderate Moderate

Data Format Unchanged Split output Unchanged

Data Order Unchanged Changed Unchanged*

Performance (row) Moderate High High

Performance (col) Moderate High Moderate

Control of Cache No No Yes