improving database performance on simultaneous multithreading processors jingren zhou microsoft...

Improving Database Performance on Simultaneous Multithreading Processors

Jingren ZhouMicrosoft Research

[email protected]

John CieslewiczColumbia University

[email protected]

Kenneth A. RossColumbia University

[email protected]

Mihir ShahColumbia University

[email protected]

Simultaneous Multithreading (SMT)

Available on modern CPUs: “Hyperthreading” on

Pentium 4 and Xeon. IBM POWER5 Sun UltraSparc IV

Challenge: Design software to efficiently utilize SMT. This talk: Database

software

Intel Pentium 4 with Hyperthreading

Superscalar Processor (no SMT)

Instruction Stream

Superscalar pipeline (up to 2 instructions/cycle)

... ...... ...

Time

•Improved instruction level parallelism

CPI = 3/4

SMT Processor

Instruction Streams

... ...... ...

Time

•Improved thread level parallelism

•More opportunities to keep the processor busy

•But sometimes SMT does not work so well

CPI = 5/8

... ...

StallsInstruction Stream 1

... ...

... ... Time

Instruction Stream 2... ...CPI = 3/4 . Progress despite stalled thread.

Stall

Stalls due to cache misses (200-300 cycles for L2 cache), branch mispredictions (20-30 cycles), etc.

Memory ConsistencyInstruction Stream 1

... ...

... ... Time

Instruction Stream 2... ...Detect conflicting access to common cache line

flush pipeline + sync cache with RAM

“MOMC Event” on Pentium 4. (300-350 cycles)

SMT Processor

Exposes multiple “logical” CPUs (one per instruction stream)

One physical CPU (~5% extra silicon to duplicate thread state information)

Better than single threading: Increased thread-level parallelism Improved processor utilization when one thread blocks

Not as good as two physical CPUs: CPU resources are shared, not replicated

SMT Challenges

Resource Competition Shared Execution Units Shared Cache

Thread Coordination Locking, etc. has high overhead

False Sharing MOMC Events

Approaches to using SMT

Ignore it, and write single threaded code. Naïve parallelism

Pretend the logical CPUs are physical CPUs

SMT-aware parallelism Parallel threads designed to avoid SMT-related

interference

Use one thread for the algorithm, and another to manage resources E.g., to avoid stalls for cache misses

Naïve Parallelism

Treat SMT processor as if it is multi-core Databases already designed to utilize multiple

processors - no code modification Uses shared processor resources inefficiently:

Cache Pollution / Interference Competition for execution units

SMT-Aware Parallelism

Exploit intra-operator parallelism Divide input and use a separate thread to process

each part E.g., one thread for even tuples, one for odd tuples. Explicit partitioning step not required.

Sharing input involves multiple readers No MOMC events, because two reads don’t conflict

SMT-Aware Parallelism (cont.)

Sharing output is challenging Thread coordination for output read/write and write/write conflicts on common cache

lines (MOMC Events)

“Solution:” Partition the output Each thread writes to separate memory buffer to avoid

memory conflicts Need an extra merge step in the consumer of the output

stream Difficult to maintain input order in the output

Managing Resources for SMT

Cache misses are a well-known performance bottleneck for modern database systems Mainly L2 data cache misses, but also L1 instruction

cache misses [Ailamaki et al 98].

Goal: Use a “helper” thread to avoid cache misses in the “main” thread load future memory references into the cache explicit load, not a prefetch

Data Dependency

Memory references that depend upon a previous memory access exhibit a data dependency

E.g., Lookup hash table:

Hash Buckets Overflow Cells

Tuple

Data Dependency (cont.) Data dependencies make instruction level parallelism

harder Modern architectures provide prefetch instructions.

Request that data be brought into the cache Non-blocking

Pitfalls: Prefetch instructions are frequently dropped Difficult to tune Too much prefetching can pollute the cache

Staging Computation

Hash Buckets Overflow Cells Tuple

A B C

1. Preload A.

2. (other work)

3. Process A.

4. Preload B.

5. (other work)

6. Process B.

7. Preload C.

8. (other work)

9. Process C.

10. Preload Tuple.

11. (other work)

12. Process Tuple.(Assumes each element is a cache line.)

Staging Computation (cont.)

By overlapping memory latency with other work, some cache miss latency can be hidden.

Many probes “in flight” at the same time. Algorithms need to be rewritten. E.g. Chen, et al. [2004],

Harizopoulos, et al. [2004].

Work-Ahead Set: Main Thread

Writes memory address + computation state to the work-ahead set

Retrieves a previous address + state Hope that helper thread can preload data before

retrieval by the main thread Correct whether or not helper thread succeeds at

preloading data helper thread is read-only

Work-ahead Set Data Structure

state address

Main Thread

A1

B1

C1

D1

E1

F1


state address

Main Thread

A1

B1

C1

D1

E1

F1

G1

H2

I2

J2

K2

L2

Work-Ahead Set: Helper Thread

Reads memory addresses from the work-ahead set, and loads their contents

Data becomes cache resident Tries to preload data before main thread cycles

around If successful, main thread experiences cache hits

G

H2

I2

J2


state address

E

F

1

1

1

Helper Thread

“ temp += *slot[i] ”

G

H2

I2

J2

Iterate Backwards!state address

E

F

1

1

1

Helper Thread

i = i-1 mod size

i

Why? See Paper.

Helper Thread Speed If helper thread faster than main thread:

More computation than memory latency Helper thread should not preload twice (wasted CPU

cycles) See paper for how to stop redundant loads

If helper thread is slower: No special tuning necessary Main thread will absorb some cache misses

Work-Ahead Set Size Too Large: Cache Pollution

Preloaded data evicts other preloaded data before it can be used

Too Small: Thread Contention Many MOMC events because work-ahead set spans few

cache lines

Just Right: Experimentally determined But use the smallest size within the acceptable range

(performance plateaus), so that cache space is available for other purposes (for us, 128 entries)

Data structure itself much smaller than L2 cache

Experimental Workload

Two Operators: Probe phase of Hash Join CSB+ Tree Index Join

Operators run in isolation and in parallel

Intel VTune used to measure hardware events

CPUPentium 4

3.4 GHz

Memory 2 GB DDR

L1, L2 Size 8 KB, 512 KB

L1, L2

Cache-line Size64 B, 128 B

L1 Miss Latency 18 cycles

L2 Miss Latency

276 Cycles

MOMC Latency

~300+ Cycles

Experimental Outline

1. Hash join2. Index lookup3. Mixed: Hash join and index lookup

Hash JoinComparative Performance

Hash JoinL2 Cache Misses Per Tuple

CSB+ Tree Index JoinComparative Performance

CSB+ Tree Index JoinL2 Cache Misses Per Tuple

Parallel Operator Performance

52% 55%20%

Parallel Operator Performance

26% 29%

Conclusion

Naïve parallel SMT-Aware Work-Ahead

Impl. Effort Small Moderate Moderate

Data Format Unchanged Split output Unchanged

Data Order Unchanged Changed Unchanged*

Performance (row) Moderate High High

Performance (col) Moderate High Moderate

Control of Cache No No Yes

improving database performance on simultaneous multithreading processors jingren zhou microsoft...

Documents

output slide

smt cache misses

hyperthreading slide

separate thread

time instruction stream

stalls instruction stream

stalled thread

nave parallelism