dynamic power redistribution in failure-prone cmps
Post on 29-Jan-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
Dynamic Power Redistribution in Failure-Prone CMPs
Paula Petrica, Jonathan A. Winter* and David H. Albonesi
Cornell University
*Google, Inc.
Paula Petrica WEED2010 2
Motivation Hardware failures expected to become
prominent in future generations
Front End (FE)
Back End (BE)
Load-Store Queue (LSQ)Core
Paula Petrica WEED2010 3
Motivation Deconfiguration
tolerates defects at the expense of performance
Pipeline imbalance Units correlated with
deconfigured one might become overprovisioned
Power inefficiencies Application specific
Front End (FE)
Back End (BE)
Load-Store Queue (LSQ)Core
Paula Petrica WEED2010 4
Research Goal
Given a CMP with a set of failures and a power budget: Eliminate power inefficiencies Improve performance
Paula Petrica WEED2010 5
Outline
Motivation Architecture
Power Harnessing Performance Boosting
Power Transfer Runtime Manager Conclusions and future work
Paula Petrica WEED2010 6
Core 2
Front End (FE)
Load-Store Queue (LSQ)
Architecture
Two-step approach Transfer power Harness Power
Back End (BE)
Core 1
Front End (FE)
Load-Store Queue (LSQ)
Back End (BE)
Paula Petrica WEED2010 7
Power Harnessing
FQDecode/ Rename
Dispatch
ROB
IQ
Select
D-Cache
RFBPred
I-Cache
FE
BE
LSQ
Paula Petrica WEED2010 8
Pipeline Imbalance
Per
form
ance
Los
s Pow
er Saved
Paula Petrica WEED2010 9
Performance Boosting
Distribute accumulated margin of power to boost performance Temporarily enable a previously dormant feature
Requirements Small area and fast power-up Small PPR (Power-Performance Ratio)
Paula Petrica WEED2010 10
Performance Boosting Techniques
Speculative Cache Access Speculatively send L1 requests to the L2 cache Speculatively access both tag and data in the L2
cache at the same time (rather than serially) Turned on independently or in combination Approximately linear power-performance relationship Benefits applications limited by L1 cache capacity
LoadL1
CacheL2
CacheL1 Miss Tag Data
Lower Hierarchy Level
miss
hit
L2 Cache
Tag
Data
Lower Hierarchy Level
miss
L2 Cache
Paula Petrica WEED2010 11
Performance Boosting Techniques
Boosting main memory performance CLEAR [N. Kirman et al, HPCA 2005] Predict and speculatively retire long latency loads Supply predicted values to destination registers Free processor resources for non-dependent
instructions Linear power-performance relationship Benefits memory bound applications
Paula Petrica WEED2010 12
Performance Boosting Techniques
DVFS Scale up voltage and frequency Already built in Cubic power cost for linear performance benefit Benefits high-IPC applications
Paula Petrica WEED2010 13
Comparison of Boosting Techniques
Per
form
ance
Im
prov
emen
t
Paula Petrica WEED2010 14
Core 2
Front End (FE)
Load-Store Queue (LSQ)
Architecture
Two-step approach Transfer power Harness Power
Back End (BE)
Core 1
Front End (FE)
Load-Store Queue (LSQ)
Back End (BE)
Paula Petrica WEED2010 15
Power Transfer Runtime Manager Periodically coordinate chip-wide effort to
relocate power among cores Obtain current local hardware deconfiguration
status (due to faults) Determine additional components to be
deconfigured Transfer power to one or more mechanisms that
make best use of it
Paula Petrica WEED2010 16
Power Transfer Runtime Manager
Sampling Phase
Steady Phase
Sample deconfigurations
Choose additional deconfiguration
Sample performance boosting
Compute global throughput with fairness
Choose best 4-core configuration
Apply DVFS (greedy)
Local decisions
Global Decisions
Paula Petrica WEED2010 17
Global vs Local Optimization 100 4-core configurations, random errors and random SPEC
CPU2000 benchmarks
22.2%
10.0%
Spe
edup
Paula Petrica WEED2010 18
Diversity of Boosting Techniques 100 4-core configurations, random errors and random SPEC
CPU2000 benchmarks
22.2%
6.3%
Spe
edup
Paula Petrica WEED2010 19
Power Transfer Runtime Manager 100 4-core configurations, random errors and random SPEC
CPU2000 benchmarks
22.2%
15.3%
10.0%
6.3%
Spe
edup
Paula Petrica WEED2010 20
Conclusions We proposed a technique to increase performance
given a certain power budget in the presence of hard faults
Exploited the deconfiguration capabilities already built in microprocessors
Demonstrated that pipeline imbalances and additional deconfiguration are application-dependent
Proposed several boosting techniques Demonstrated the potential for substantial
performance gains for a 4-core CMP
Paula Petrica WEED2010 21
Future Work Heuristic approaches to scale this problem to
many cores Simulated Annealing, Genetic Algorithm Pareto optimal fronts to reduce the number of
combinations Hierarchical optimization
Questions?
top related