trace&based&switching&for&a&tightly& … · •...
TRANSCRIPT
Trace Based Switching For A Tightly Coupled Heterogeneous Core
Shru% Padmanabha, Andrew Lukefahr, Reetuparna Das, Sco@ Mahlke
Micro-‐46
December 2013
University of Michigan Electrical Engineering and Computer
Science 1
• Fine-‐grained heterogeneity
• Don’t React – Predict!
• Trace-‐Based Switching Controller
• Results
• Conclusion
2
Outline
Single-‐ISA Heterogeneity
Big
Li@le
Memory substrate
ü Fast ü Good for high performance phases
û Smaller/Slower û 1/2 the performance of big
û Power hungry structures û Wastes energy on low
performance phases
ü Simpler, low power structures ü 3X more energy efficient
Out-‐of-‐order
In-‐order
Big Backend
Big Frontend
Li@le Frontend
Li@le Backend
3
L2$ L1$
L2$ L1$
Energy savings with minimal performance loss
TradiYonal Heterogeneous Architectures
Big Backend
Li@le Frontend
Memory
ARM’s big.LITTLE
Big Frontend
Li@le Backend
4
L2$
L2$
L1$
L1$
Transfer overhead = ~20K Cycles Minimum switching interval = ~1M instrucYons
What about low performance phases at finer-‐granularity?
Performance Change In GCC
• Average IPC over a 1M instrucYon window (Quantum) • Average IPC over 2K Quanta Huge performance changes within a coarse quantum!
0
0.5
1
1.5
2
2.5
3
100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
InstrucYon
s / Cycle
InstrucYons
Big Core Li@le Core
0
0.5
1
1.5
2
2.5
3
100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
InstrucYon
s / Cycle
InstrucYons
Big Core Li@le Core
5
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
100 1K 10K 100K 1M
Time Spen
t on Li5le
Quantum Size in Instruc%ons
Fine-‐grain Heterogeneity Has PotenYal
TradiYonal systems
Coarse Grain Fine Grain Finer is be@er! 6
(Subject to a maximum 5% performance loss target)
Oracle – theoreYcal maximum per quantum size
Fine-‐grained Heterogeneous Architectures
Composite Cores *
*Composite Cores: Pushing Heterogeneity into a Core, Lukefahr et al, Micro 2012 7
Big Backend
Li@le Frontend
Memory
Big Frontend
Li@le Backend
L2$
L2$
L1$
L1$ RF
RF Shared Frontend L2$ L1$ Transfer overhead = ~35 Cycles
Minimum switching interval = ~1000 instrucYons
• Fine-‐grained heterogeneity
• Don’t React – Predict!
• Trace-‐Based Switching Controller
• Results
• Conclusion
8
Outline
10
ReacYve Controller at 300 InstrucYon Quanta, gcc
ReacYve Follies at Fine-‐Granularity
Pick Big
Pick Li@le
Quanta chosen on Big Quanta chose on Li@le
Oracle Threshold
RelaYve Pe
rformance of B
ig vs L
i@le
Quanta in Program Order
Incorrectly picked li@le à lost performance, on 29% of quanta
Incorrectly picked big à lost opportuniYes for energy savings on 26% of quanta
0%
10%
20%
30%
40%
50%
60%
100 1K 10K 100K 1M
% Tim
e Spen
t On Li5le
# Instruc%ons per Quantum
Oracle Composite Cores
Fine-‐granularity “Reacts” Badly
11
Don’t React – Predict! ReacYve Controller Oracle
• Fine-‐grained heterogeneity
• Don’t React – Predict!
• Trace-‐Based Switching Controller
• Results
• Conclusion
12
Outline
History Based PredicYon ObservaYon
Code repeats (loops, funcYons)
Behavior repeats in the same program context
13
Super-‐Trace Exploit this inherent repeatable nature of code
placed in program context
Our AssumpYon A Super-‐Trace’s behavior history follows a
pa@ern
ObjecYve 1. Predict oncoming Super-‐traces 2. Predict its backend preference
14
BB1
BB2 BB3
BB4
BB5
BB7
BB6
BB1
BB2 BB3
BB4
Backedge1
Backedge2
Backedge3
Backedge0
Super-‐trace ConstrucYon
Backedge: Branch with PCtarget < PCbranch
Observe dynamic execuYon
FuncYon call
BB5
BB6
BB7
BB1
BB2 BB3
BB4
FuncYon call
Super-‐trace ConstrucYon
Trace =
BE1 -‐> BE2
15
BB1
BB2 BB3
BB4
Backedge1
Backedge2 BB5
Backedge3
BB7
BB6
Combine traces to meet minimum length requirement
(300 instrucYons)
BB1
BB2 BB3
BB4
Backedge0
~ 50 instrucYons…
Performance highly dependent on context
Trace: Block of code defined at backedge
boundaries
Super-‐trace ConstrucYon
16
BB1
BB2 BB3
BB4
Backedge1
Backedge2 BB5
Backedge3
BB8
BB7
BB1
BB2 BB3
BB4
Backedge0
~ 300 instrucYons Super-‐Trace!
ID = BE1 + BE2 + BE3
Count instrucYons between backedges
Trace-‐Based Controller Overview
Super-‐Trace Constructor
Next Super-‐Trace
Predictor
Backend Predictor
Feedback to Predictors*
Big Backend
Commi@ed InstrucYons & Backedge PCs
Super-‐Trace ID Next Super-‐Trace ID
Run on Big/Li5le?
Update Update
1
2 3
17
Li@le Backend
4
*Composite Cores: Pushing Heterogeneity into a Core, Lukefahr et al, Micro 2012
Two-‐level predictor tables
2. Hash backedge PCs to provide a super-‐trace ID
18
BE12 BE11 BE10 BE9 BE8 BE7 BE6 BE5 BE4 BE3 BE2 BE1
More Recently seen
Super-‐Trace Constructor 11. Check for minimum instrucYon length constraint
Super-‐Trace Constructor 1
19
More Recently seen
BE12 BE11 BE10 BE9 BE8 BE7 BE6 BE5 BE4 BE3 BE2 BE1
3 3 3 3 3 3 3 3 3 3 6
9
9 9 9 9
9 9
9
9
9
Hash backedge PCs to provide a super-‐trace ID
9 bit STrace ID
HeadPC
(To ) 2
Next Super-‐Trace Predictor
1.Super-‐Trace ID
Head Replaceability 2.Super-‐Trace ID
Head Replaceability
(9 bits) (3 bits) (2 bits) (9 bits) (3 bits) (2 bits)
Strace IDi
Strace Idi+1 STrace Idi+1 10
2
20
(From ) 1
S-‐Trace Idm+1 11 Headi+1 Headm+1
Headi+1
(To ) 3
Next Backend Predictor
Big/ Li5le Confidence
(2 bits) Strace IDi+1
10
3
21
Big Backend Li@le
Backend
(From ) 2
Run on Big!
EvaluaYon Architectural Feature Parameters
Big Core 3 wide O3 @ 1.2GHz 12 stage pipeline 128 ROB Entries 128 entry register file
Li@le Core 2 wide InOrder @ 1.2GHz 8 stage pipeline 32 entry register file
Memory System 32 KB L1 i/d cache, 2 cycle access 1MB L2 cache, 15 cycle access 1GB Main Mem, 80 cycle access
Simulator Gem5, Full system
Energy Model McPAT
Benchmarks SPEC 2k6, compiled for Alpha ISA Fast Forward 2 billion instrucYons, simulate 100 million instrucYons
22
• Fine-‐grained heterogeneity
• Don’t React – Predict!
• Trace-‐Based Switching Controller
• Results
• Conclusion
23
Outline
Oracle
ReacYve Controller
SuperTrace-‐Based Controller
Big Quantums = Red, Li@le Quantums = Blue
RelaYve Pe
rformance of B
ig vs L
i@le
0 50000 100000 150000 200000 250000 300000 Super-‐traces in Increasing Time Order
Easy to predict: DisYnct and repeatable phase behavior
h264ref
24
Oracle
ReacYve Controller
SuperTrace-‐Based Controller
RelaYve Pe
rformance of B
ig vs L
i@le
Big Quantums = Red, Li@le Quantums = Blue
0 50000 100000 150000 200000 250000 300000 350000 Super-‐traces in Increasing Time Order
Hard to predict: Highly variable data & control flow
gcc
25
Time Spent on Li@le
26
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
Time Spen
t on Li5le
Benchmarks
Oracle ReacYve SuperTrace
Time spent on the Li@le backend increases by 46% on average
50% 55% 60% 65% 70% 75% 80% 85% 90% 95%
100%
Performan
ce Com
pared To
Baseline (Big)
Benchmarks
Oracle ReacYve SuperTrace
Performance RelaYve To Standalone OoO
27
The controllers all honor the 95% performance target
Energy ConservaYon
28
76.8%
76.9%
69.6%
0%
10%
20%
30%
40%
50%
60%
Energy Con
served
Rela%
ve to
Baseline
(Big)
Benchmarks
ReacYve SuperTrace
On average, we conserve 43% more energy than a reacYve controller
Conclusion • Don’t React – Predict! – UYlize Li@le 46% more than current work
• 15% energy savings over the baseline (43% more than exisYng work) – small hardware overheads (1.9kB)
29
Trace Based Switching For A Tightly Coupled Heterogeneous Core
Shru% Padmanabha, Andrew Lukefahr, Reetuparna Das, Sco@ Mahlke
Micro-‐46 December 2013
University of Michigan Electrical Engineering and Computer
Science
Thank you!
30
Case for Heterogeneous Cores
Performance
Energy
Time
High energy usage Yields high performance
SYll consumes high energy
Low performance phase
High performance cores waste energy on low performance phases
32
General purpose applicaYon execuYng on a high performance core
Run low performance phases on slower, but energy-‐efficient cores
• Large structures (ROB, Rename Table, LSQ) • Higher issue width
Core Energy Comparison
Brooks, ISCA’00
Dally, IEEE Computer’08
Out-‐of-‐Order In-‐Order
• Out-‐Of-‐Order contains performance enhancing hardware • Not necessary for correctness Do we always need the extra hardware?
33
ReacYve Online Controller
Li@le uEngine True
Big uEngine False
𝐶𝑃𝐼↓𝑙𝑖𝑡𝑡𝑙𝑒
Δ𝐶𝑃𝐼↓𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
𝐶𝑃𝐼↓𝑏𝑖𝑔
Switching Controlle
r
𝐶𝑃𝐼↓𝑙𝑖𝑡𝑡𝑙𝑒
Big Model
𝐶𝑃𝐼↓𝑏𝑖𝑔
∑↑▒𝐶𝑃𝐼↓𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑
𝑆∗∑↑▒𝐶𝑃𝐼↓𝐵𝑖𝑔
+
𝐶𝑃𝐼↓𝑎𝑐𝑡𝑢𝑎𝑙
𝐶𝑃𝐼↓𝑡𝑎𝑟𝑔𝑒𝑡
Threshold
Controller
𝐶𝑃𝐼↓𝑒𝑟𝑟𝑜𝑟
Li@le Model
𝐶𝑃𝐼↓𝑙𝑖𝑡𝑡𝑙𝑒 Δ𝐶𝑃𝐼↓𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 + 𝐶𝑃𝐼↓𝑙𝑖𝑡𝑡𝑙𝑒 ≤
𝐶𝑃𝐼↓𝑏𝑖𝑔
Δ𝐶𝑃𝐼↓𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 𝐾↓𝑝 𝐶𝑃𝐼↓𝑒𝑟𝑟𝑜𝑟 + 𝐾↓𝑖 ∑↑▒𝐶𝑃𝐼↓𝑒𝑟𝑟𝑜𝑟
𝐶𝑃𝐼↓𝑏𝑖𝑔
User-‐Selected Performance
Accuracy
35
60
65
70
75
80
85
100 1,000 10,000 100,000 1,000,000 % Qua
ntum
s Accurately Map
ped
Quantum Size In Instruc%ons
ReacYve PerfectNextTraceKnowledge SuperTrace BackedgeFrequency
SensiYvity to Other Schemes
36
0 10 20 30 40 50 60 70 80 90
100
% Qua
ntum
s Accurately Map
ped
Benchmarks
2LevelAdapYve-‐Local 2LevelAdapYve-‐Global Our Scheme
Composite Cores – Controller Overview
Core Performance Models
(RegressionModel)
Performance Monitor (Feedback Controller)
Li5le backend
Big backend
Performance Loss <
Threshold Yes No
Backend selector
Observed perform
ance metrics
Threshold
37
Li5le backend
Es%mated target performance
Es%mated big & li5le
performance difference
1.S-‐Trace ID Head Confidence 2.S-‐Trace ID
Head Confidence
(9 bits) (3 bits) (2 bits) (9 bits) (3 bits) (2 bits)
Next Super-‐Trace Predictor
Strace IDj Headj+1 Strace Idi+1 Strace Idj+1
Update – Incorrect Decision
00 S-‐Trace Idl+1 Headl+1 11 S-‐Trace Idk+1 Headk+1 10
2
38
Predict! Update!
(From ) 1 (From AcYve Backend)
Performance Change In GCC
• Average IPC over a 1M instrucYon window (Quantum) • Average IPC over 2K Quanta Huge performance changes within a coarse quantum!
0
0.5
1
1.5
2
2.5
3
100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
InstrucYon
s / Cycle
InstrucYons
Big Core Li@le Core
0
0.5
1
1.5
2
2.5
3
100K 200K 300K 400K 500K 600K 700K 800K 900K 1M
InstrucYon
s / Cycle
InstrucYons
Big Core Li@le Core
39
Fine-‐grain Performance Phases Exist
40
0
0.5
1
1.5
2
2.5
3
160K 170K 180K
InstrucYon
s / Cycle
InstrucYons
Big Core Li@le Core
• 20K instrucYon window from GCC • Average IPC over 100 instrucYon quanta
*Composite Cores: Pushing Heterogeneity into a Core, Lukefahr et al, Micro 2012
Fine grain offers more opportunity to save energy by exploiYng:
• Dependent compute • Dependent load misses • Branch mispredicts
Next Super-‐Trace Predictor
1.Super-‐Trace ID
Head Replacability 2.Super-‐Trace ID
Head Replacability
(9 bits) (3 bits) (2 bits) (9 bits) (3 bits) (2 bits)
Strace IDi
Strace Idi+1
Strace Idi+1
11 STrace Idi+1 10
2
41
Predict! Update!
(From ) 1
S-‐Trace Idm+1 11
(From AcYve Backend)
Headi+1 Headm+1
Headi+1 Update: Correct Decision
(To ) 3
Next Backend Predictor
Big/ Li5le
(2 bits) Strace IDi+1
Performance Counters
Feedback Generator
Observed Strace IDi+1
10
Observed Performance
Metrics
Predict! Update!
3
42
11
Big Backend Li@le
Backend
(From ) 2
Update: Correct Decision