simple criticality predictors for dynamic performance and power management in cmps group talk: dec...

42
Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs

Group Talk: Dec 10, 2008

Abhishek Bhattacharjee

Page 2: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Why Thread Criticality Prediction (TCP) ?

Significant load imbalance in typical parallel programs Threads of a parallel program don’t finish at the same time

Worse with heterogeneity (process variation, thermal emergencies etc.)

Relative thread speed or criticality is difficult to predict

If we can deduce that thread 2 is critical and by how much, what can we do with this ?

Page 3: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Performance Improvements from TCP

TCP to improve dynamic parallel management performance (eg. TBB) Steal tasks from

critical threads

Page 4: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Energy Efficiency from TCP

TCP-guided DVFS for barriers Slow down non-critical

threads without affecting runtime

Page 5: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Our Goals

Develop low-overhead hardware TCP schemes Harness counters and metrics available on-chip Aim for high accuracy across architectures

TCP should be general for a variety of applications Improve TBB performance with TCP-guided task stealing Improve barrier energy efficiency with TCP-guided DVFS

Page 6: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Related Work

Thrifty Barrier [Li, Martinez, Huang, HPCA ‘05] Transition fast, non-critical threads into low-power sleep

modes to save energy at barriers Predict barrier stall time based purely on history Still wastes energy in compute phase…

Meeting Points [Cai et al. PACT ’08] DVFS threads to save energy without performance penalty Only applicable to parallel loops Insert meeting points to track loop iterations executed per

core Broadcast iteration counts to all cores Use software to calculate appropriate DVFS settings

1. Unlike thrifty barrier, predict before barrier is reached2. Unlike meeting points, avoid specialized instructions &

broadcasts 3. Unlike meeting points, broader applicability than parallel

loops4. Unlike both approaches, use software-independent

criticality calculation 5. Target versatility: TBB performance, barrier energy waste,

SMT priority schemes, memory priority schemes etc.

Page 7: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Outline

Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers

with TCPs

Page 8: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Methodology: Simulators

TCP accuracy is harder with in-order pipelines Assess energy savings accurately and with OS on emulator

Assumes 50% fixed leakage energy cost of baseline

Page 9: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Methodology: Benchmarks

Use larger, realistic data sets for Splash-2 [Bienia et al. PACT ’08]

Page 10: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Outline

Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers

with TCPs

Page 11: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Per-Core Fully-Local History

If behavior is sufficiently repetitive, can use fully-local history (thrifty barrier)

Difficult to achieve on in-order pipelines

Solution: target thread comparative info. Rationale: criticality of a single thread

determined by others

Page 12: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Comparative Metric: Instruction Count

Normalize compute times and metric against critical thread 6

Page 13: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Comparative Metric: Instruction Count

Avg. error from barrier iterations and 10% execution snapshots of non-barrier apps (Swaptions, Fluidanimate)

Poor accuracy across all tested benchmarks

Page 14: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Comparative Metric: Cache Statistics

But Ocean and LU still suffer from over 25% error

Include L1 I-Cache Misses …

Page 15: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Comparative Metric: Cache Statistics

Instruction counts and control flow affects LU, Water-Nsq, Water-Sp

But Ocean still has over 22% error

Include L2 Cache Misses …

Page 16: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Comparative Metric: Cache Statistics

Memory-intensive Ocean, Volrend, Radix, PARSEC particularly improved

Now check weighted cache misses metric on out-of-order machine

Page 17: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Comparative Metric: Cache Statistics

Memory intensive Ocean, Volrend Radix, PARSEC least impacted

Weighted cache misses is most accurate

Page 18: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Comparative Metric: Control Flow and TLB Misses Control flow (branch mispredictions) penalties and

instruction count effects tracked with L1 I-Cache misses

TLB Misses Little variation among multiple threads (usually access

closely spaced data) Trivial to include weighted TLB component if necessary

Page 19: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Outline

Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers

with TCPs

Page 20: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Basic TCP Hardware Criticality counters

placed with shared, unified L2 cache

Simple, scalable hardware

Eliminate broadcasts as L2 controller sees all cache misses

Can accommodate more cache levels, split L2, or distributed LLCs with trivial HW and messages

Page 21: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Outline

Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers

with TCPs

Page 22: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

The TBB Task Scheduler

Concurrency is expression in tasks TBB dynamic scheduler stores and distributes tasks Scheduler has control of worker thread with per-thread

software queue Threads try to extract tasks from local queue If queues are empty, threads steal tasks from remote

queues If steal unsuccessful, back off for pre-determined time before

retrying

Problem: Steal victim is chosen randomly Poor performance at higher core counts and high load imbalance

Page 23: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Our TCP-Guided TBB Stealing Algorithm1. If (Cache miss from Core P) {2. Update criticality counter for Core P based on cache miss type3. }4. If (Steal request from a Core){5. Scan all criticality counters to find the maximum value6. Report core with highest criticality counter value as steal victim7. }8. If (Message indicating steal from victim Core P unsuccessful) {9. Reset criticality counter for Core P10 }11. If ( (Number of Cycles % Interval Bound) = = 0 )12. Reset all criticality counters

Page 24: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Hardware Details

Interval Bound = 100K cycles

Simple and scalable hardware: Even at 64 cores with 14 bits per Criticality Counter, 114

bytes of storage

Minimal message overhead

TCP access takes the same latency as L2 cache miss

Page 25: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Steal Rate Improvements with TCP

• TCP-guided task stealing limits false negatives under 7% • Now, at higher cores, greater success rates

Page 26: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Performance Improvements with TCP

• Up to 32% performance gains against random stealing• Regularly outperforms occupancy-based stealing

• Streamcluster is highly load imbalanced TCP and occupancy benefits are similar

Page 27: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Outline

Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers

with TCPs

Page 28: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

TCP-Driven Energy Efficiency in Barrier-Based Parallel Programs Goals recap:

Multiple threads reach barriers at different times Use TCP to predict thread speeds DVFS fast threads to low frequencies

Assume f0 , 0.85f0, 0.70f0, 0.55f0

Aim: energy efficiency with no performance penalty

But: TCP mispredictions due to spurious program behavior DVFS transition overhead impact should be minimized

Page 29: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

TCP-Driven DVFS Hardware

Cache miss from core P Update criticality counter P

Is P running at fo and is counter above T?

Page 30: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

TCP-Driven DVFS Hardware

For all cores, find closest SST match to Criticality Counter

Is SST suggesting freq. switch?

Page 31: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

TCP-Driven DVFS Hardware

If SST suggesting freq. switch, increment suggested SCT counter, decrement others

If max. counter for new DVFS setting, actual DVFS

Page 32: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

TCP-Driven DVFS Hardware

If SST not suggesting freq. switch, increment current SCT counter, decrement others

Page 33: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

TCP-Driven DVFS Hardware

Reset criticality counters

Page 34: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Criticality Counter Threshold

In-order pipeline results (harder case) – 16 cores Balance between TCP speed and accuracy

Avg. Accuracy @ 1024 = 78.19 %

Page 35: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Integrating the Suggestion Conf. Table

Avg. Accuracy @ 2 bits = 92.68 %

Page 36: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Case Study: Impact of SCT on Streamcluster

Page 37: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Impact of Memory Parallelism

Page 38: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Gradual DVFS

Might not save as much energy as direct DVFS at low MHSRs But at high MHSRs, much higher accuracy than direct DVFS

Page 39: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

TCP-Guided DVFS Scheme Performance

Page 40: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Energy Savings

FPGA platform with 4-cores, 50% fixed leakage cost

Even higher savings expected with greater core counts, core complexity, realistic modeling of leakage (temperature impact), on-chip switching regulators

Page 41: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Hardware Characteristics

Based on readily available on-chip cache stats All thread criticality calculation done in hardware

with SST Minimal network messages Low overhead and scalable

16 core CMP – 71 bytes of storage 64 CMP – 215 bytes of storage

Page 42: Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

Conclusion

Low overhead TCPs help manage parallelism for energy and performance

Accurate TCPs can be based on simple cache statistics

TCP-based TBB task stealer offers 12.9% to 32% performance improvements on 32-core CMP

TCP-based DVFS offers 15% energy savings on 4-core CMP

Future: TLB prefetching, DRAM scheduling