trace&based&switching&for&a&tightly& … · •...

43
Trace Based Switching For A Tightly Coupled Heterogeneous Core Shru% Padmanabha, Andrew Lukefahr, Reetuparna Das, Sco@ Mahlke Micro46 December 2013 University of Michigan Electrical Engineering and Computer Science 1

Upload: dinhhanh

Post on 20-Apr-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Trace  Based  Switching  For  A  Tightly  Coupled  Heterogeneous  Core  

Shru%  Padmanabha,  Andrew  Lukefahr,  Reetuparna  Das,  Sco@  Mahlke  

 Micro-­‐46  

December  2013    

University of Michigan Electrical Engineering and Computer

Science 1  

•  Fine-­‐grained  heterogeneity  

•  Don’t  React  –  Predict!  

•  Trace-­‐Based  Switching  Controller  

•  Results  

•  Conclusion    

 

2  

Outline  

Single-­‐ISA  Heterogeneity  

Big  

Li@le  

Memory  substrate  

ü  Fast  ü  Good  for  high  performance  phases  

û  Smaller/Slower  û  1/2  the  performance  of  big    

û  Power  hungry  structures  û  Wastes  energy  on  low  

performance  phases  

ü  Simpler,  low  power  structures  ü  3X  more  energy  efficient  

Out-­‐of-­‐order  

In-­‐order  

Big    Backend  

Big  Frontend  

Li@le  Frontend  

Li@le  Backend  

3  

L2$   L1$  

L2$   L1$  

Energy  savings  with  minimal  performance  loss  

TradiYonal  Heterogeneous  Architectures  

Big    Backend  

Li@le  Frontend  

Memory  

ARM’s  big.LITTLE  

Big  Frontend  

Li@le  Backend  

4  

L2$  

L2$  

L1$  

L1$  

Transfer  overhead  =  ~20K  Cycles  Minimum  switching  interval  =  ~1M  instrucYons  

What  about  low  performance  phases  at  finer-­‐granularity?  

Performance  Change  In  GCC  

•  Average  IPC  over  a  1M  instrucYon  window  (Quantum)    •  Average  IPC  over  2K  Quanta  Huge  performance  changes  within  a  coarse  quantum!  

0  

0.5  

1  

1.5  

2  

2.5  

3  

100K   200K   300K   400K   500K   600K   700K   800K   900K   1M  

InstrucYon

s  /  Cycle  

InstrucYons  

Big  Core   Li@le  Core  

0  

0.5  

1  

1.5  

2  

2.5  

3  

100K   200K   300K   400K   500K   600K   700K   800K   900K   1M  

InstrucYon

s  /  Cycle  

InstrucYons  

Big  Core   Li@le  Core  

5  

0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  

100%  

100   1K   10K   100K   1M  

Time  Spen

t  on  Li5le  

Quantum  Size  in  Instruc%ons  

Fine-­‐grain  Heterogeneity  Has  PotenYal  

TradiYonal  systems  

Coarse  Grain  Fine  Grain  Finer  is  be@er!  6  

(Subject  to  a  maximum  5%  performance  loss  target)  

Oracle  –  theoreYcal  maximum  per  quantum  size  

Fine-­‐grained  Heterogeneous  Architectures  

Composite  Cores  *  

*Composite  Cores:  Pushing  Heterogeneity  into  a  Core,  Lukefahr  et  al,  Micro  2012  7  

Big    Backend  

Li@le  Frontend  

Memory  

Big  Frontend  

Li@le  Backend  

L2$  

L2$  

L1$  

L1$  RF  

RF  Shared  Frontend  L2$   L1$  Transfer  overhead  =  ~35  Cycles  

Minimum  switching  interval  =  ~1000  instrucYons  

•  Fine-­‐grained  heterogeneity  

•  Don’t  React  –  Predict!  

•  Trace-­‐Based  Switching  Controller  

•  Results  

•  Conclusion    

 

8  

Outline  

TradiYonal  Switching  Controller  

9  

Assump%on:    A  quantum’s  behavior  is  indicated  by  the  recent  past  

10  

ReacYve  Controller  at  300  InstrucYon  Quanta,  gcc  

ReacYve  Follies  at  Fine-­‐Granularity  

Pick  Big  

Pick  Li@le  

Quanta  chosen  on  Big  Quanta  chose  on  Li@le  

 Oracle  Threshold      

RelaYve  Pe

rformance  of  B

ig  vs  L

i@le  

Quanta  in  Program  Order  

Incorrectly  picked  li@le  à  lost  performance,  on  29%  of  quanta    

Incorrectly  picked  big  à  lost  opportuniYes  for  energy  savings  on  26%  of  quanta  

0%  

10%  

20%  

30%  

40%  

50%  

60%  

100   1K   10K   100K   1M  

%  Tim

e  Spen

t  On  Li5le  

#  Instruc%ons  per  Quantum  

Oracle   Composite  Cores  

Fine-­‐granularity  “Reacts”  Badly  

 

11  

Don’t  React  –  Predict!  ReacYve  Controller  Oracle  

•  Fine-­‐grained  heterogeneity  

•  Don’t  React  –  Predict!  

•  Trace-­‐Based  Switching  Controller  

•  Results  

•  Conclusion    

 

12  

Outline  

History  Based  PredicYon  ObservaYon    

Code  repeats  (loops,  funcYons)    

Behavior  repeats  in  the  same  program  context    

13  

Super-­‐Trace  Exploit  this  inherent  repeatable  nature  of  code  

placed  in  program  context  

Our  AssumpYon  A  Super-­‐Trace’s  behavior  history  follows  a  

pa@ern  

ObjecYve  1.  Predict  oncoming  Super-­‐traces  2.  Predict  its  backend  preference  

14  

BB1  

BB2   BB3  

BB4  

BB5  

BB7  

BB6  

BB1  

BB2   BB3  

BB4  

Backedge1  

Backedge2  

Backedge3  

Backedge0  

Super-­‐trace  ConstrucYon  

Backedge:  Branch  with  PCtarget  <  PCbranch  

Observe  dynamic  execuYon  

FuncYon  call  

BB5  

BB6  

BB7  

BB1  

BB2   BB3  

BB4  

FuncYon  call  

Super-­‐trace  ConstrucYon  

Trace    =    

BE1  -­‐>  BE2  

15  

BB1  

BB2   BB3  

BB4  

Backedge1  

Backedge2  BB5  

Backedge3  

BB7  

BB6  

Combine  traces  to  meet  minimum  length  requirement  

(300  instrucYons)  

BB1  

BB2   BB3  

BB4  

Backedge0  

~  50  instrucYons…    

Performance  highly  dependent  on  context  

Trace:  Block  of  code  defined  at  backedge  

boundaries  

Super-­‐trace  ConstrucYon  

16  

BB1  

BB2   BB3  

BB4  

Backedge1  

Backedge2  BB5  

Backedge3  

BB8  

BB7  

BB1  

BB2   BB3  

BB4  

Backedge0  

~  300  instrucYons  Super-­‐Trace!  

 ID  =  BE1  +  BE2  +  BE3  

Count  instrucYons  between  backedges  

Trace-­‐Based  Controller  Overview  

   Super-­‐Trace  Constructor  

Next  Super-­‐Trace  

Predictor  

Backend  Predictor  

Feedback  to    Predictors*  

Big  Backend  

Commi@ed  InstrucYons    &    Backedge  PCs  

Super-­‐Trace  ID  Next  Super-­‐Trace  ID  

Run  on  Big/Li5le?  

Update     Update    

1

2 3

17  

Li@le  Backend  

4

*Composite  Cores:  Pushing  Heterogeneity  into  a  Core,  Lukefahr  et  al,  Micro  2012  

Two-­‐level  predictor  tables  

2.  Hash  backedge  PCs  to  provide  a  super-­‐trace  ID  

18  

BE12   BE11   BE10   BE9   BE8   BE7   BE6   BE5   BE4   BE3   BE2   BE1  

More  Recently  seen  

 Super-­‐Trace  Constructor  11.  Check  for  minimum  instrucYon  length  constraint  

 Super-­‐Trace  Constructor  1

19  

More  Recently  seen  

BE12   BE11   BE10   BE9   BE8   BE7   BE6   BE5   BE4   BE3   BE2   BE1  

3   3   3   3   3   3   3   3   3   3   6  

9  

9   9   9   9  

9   9  

9  

9  

9  

Hash  backedge  PCs  to  provide  a  super-­‐trace  ID  

9  bit  STrace  ID  

HeadPC  

(To                      )    2

Next  Super-­‐Trace  Predictor  

1.Super-­‐Trace  ID  

Head   Replaceability   2.Super-­‐Trace  ID  

Head   Replaceability  

(9  bits)   (3  bits)   (2  bits)   (9  bits)   (3  bits)   (2  bits)  

Strace  IDi  

Strace  Idi+1  STrace  Idi+1   10  

2

20  

(From                      )    1

S-­‐Trace  Idm+1   11  Headi+1   Headm+1  

Headi+1    

(To                      )    3

Next  Backend  Predictor  

Big/  Li5le  Confidence  

(2  bits)  Strace  IDi+1  

10  

3

21  

Big  Backend  Li@le  

Backend  

(From                      )    2

Run  on  Big!  

EvaluaYon  Architectural  Feature   Parameters  

Big  Core   3  wide  O3  @  1.2GHz  12  stage  pipeline  128  ROB  Entries  128  entry  register  file  

Li@le  Core   2  wide  InOrder  @  1.2GHz  8  stage  pipeline  32  entry  register  file  

Memory  System   32  KB  L1  i/d  cache,  2  cycle  access  1MB  L2  cache,  15  cycle  access  1GB  Main  Mem,  80  cycle  access    

Simulator     Gem5,  Full  system  

Energy  Model   McPAT  

Benchmarks   SPEC  2k6,  compiled  for  Alpha  ISA  Fast  Forward  2  billion  instrucYons,  simulate  100  million  instrucYons  

22  

•  Fine-­‐grained  heterogeneity  

•  Don’t  React  –  Predict!  

•  Trace-­‐Based  Switching  Controller  

•  Results  

•  Conclusion    

 

23  

Outline  

Oracle  

ReacYve  Controller  

SuperTrace-­‐Based  Controller  

Big  Quantums  =  Red,    Li@le  Quantums  =  Blue  

RelaYve  Pe

rformance  of  B

ig  vs  L

i@le  

0   50000   100000   150000   200000   250000   300000  Super-­‐traces  in  Increasing  Time  Order  

Easy  to  predict:  DisYnct  and  repeatable  phase  behavior  

h264ref  

24  

Oracle  

ReacYve  Controller  

SuperTrace-­‐Based  Controller  

RelaYve  Pe

rformance  of  B

ig  vs  L

i@le  

Big  Quantums  =  Red,    Li@le  Quantums  =  Blue  

0   50000   100000   150000   200000   250000   300000   350000  Super-­‐traces  in  Increasing  Time  Order  

Hard  to  predict:  Highly  variable  data  &  control  flow  

gcc  

25  

Time  Spent  on  Li@le    

26  

0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  

100%  

Time  Spen

t  on  Li5le  

Benchmarks  

Oracle   ReacYve   SuperTrace  

Time  spent  on  the  Li@le  backend  increases  by  46%  on  average  

50%  55%  60%  65%  70%  75%  80%  85%  90%  95%  

100%  

Performan

ce  Com

pared  To

 Baseline  (Big)    

Benchmarks  

Oracle   ReacYve   SuperTrace  

Performance  RelaYve  To  Standalone  OoO  

27  

The  controllers  all  honor  the  95%  performance  target  

Energy  ConservaYon  

28  

76.8%  

76.9%  

69.6%  

0%  

10%  

20%  

30%  

40%  

50%  

60%  

Energy  Con

served

 Rela%

ve  to

 Baseline  

(Big)  

Benchmarks  

ReacYve     SuperTrace  

 On  average,  we  conserve  43%  more  energy  than  a  reacYve  controller

   

Conclusion  •  Don’t  React  –  Predict!  – UYlize  Li@le  46%  more  than  current  work  

•  15%  energy  savings  over  the  baseline  (43%  more  than  exisYng  work)  – small  hardware  overheads  (1.9kB)  

   

29  

Trace  Based  Switching  For  A  Tightly  Coupled  Heterogeneous  Core  

Shru%  Padmanabha,  Andrew  Lukefahr,  Reetuparna  Das,  Sco@  Mahlke  

   

Micro-­‐46  December  2013  

University of Michigan Electrical Engineering and Computer

Science

Thank  you!  

30  

[email protected]  

Backup  

31  

Case  for  Heterogeneous  Cores  

Performance    

Energy  

Time  

High  energy  usage     Yields  high  performance  

SYll  consumes  high  energy  

Low  performance  phase  

High  performance  cores  waste  energy  on  low  performance  phases  

32  

General  purpose  applicaYon  execuYng  on  a  high  performance  core  

Run  low  performance  phases  on  slower,  but  energy-­‐efficient  cores  

•  Large  structures  (ROB,  Rename  Table,  LSQ)  •  Higher  issue  width  

Core  Energy  Comparison  

Brooks,  ISCA’00  

Dally,  IEEE  Computer’08  

Out-­‐of-­‐Order   In-­‐Order  

•  Out-­‐Of-­‐Order  contains  performance  enhancing  hardware  •  Not  necessary  for  correctness  Do  we  always  need  the  extra  hardware?  

33  

ReacYve  Online  Controller  

Li@le  uEngine  True  

Big  uEngine  False  

𝐶𝑃𝐼↓𝑙𝑖𝑡𝑡𝑙𝑒   

Δ𝐶𝑃𝐼↓𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑   

𝐶𝑃𝐼↓𝑏𝑖𝑔   

Switching  Controlle

r  

𝐶𝑃𝐼↓𝑙𝑖𝑡𝑡𝑙𝑒   

Big  Model  

𝐶𝑃𝐼↓𝑏𝑖𝑔   

∑↑▒𝐶𝑃𝐼↓𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑    

𝑆∗∑↑▒𝐶𝑃𝐼↓𝐵𝑖𝑔    

+

𝐶𝑃𝐼↓𝑎𝑐𝑡𝑢𝑎𝑙   

𝐶𝑃𝐼↓𝑡𝑎𝑟𝑔𝑒𝑡   

Threshold  

Controller  

𝐶𝑃𝐼↓𝑒𝑟𝑟𝑜𝑟   

Li@le  Model  

𝐶𝑃𝐼↓𝑙𝑖𝑡𝑡𝑙𝑒   Δ𝐶𝑃𝐼↓𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 + 𝐶𝑃𝐼↓𝑙𝑖𝑡𝑡𝑙𝑒 ≤

𝐶𝑃𝐼↓𝑏𝑖𝑔   

Δ𝐶𝑃𝐼↓𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 𝐾↓𝑝 𝐶𝑃𝐼↓𝑒𝑟𝑟𝑜𝑟 + 𝐾↓𝑖 ∑↑▒𝐶𝑃𝐼↓𝑒𝑟𝑟𝑜𝑟    

𝐶𝑃𝐼↓𝑏𝑖𝑔   

User-­‐Selected  Performance  

Accuracy  

35  

60  

65  

70  

75  

80  

85  

100   1,000   10,000   100,000   1,000,000  %  Qua

ntum

s  Accurately  Map

ped  

Quantum  Size  In  Instruc%ons  

ReacYve   PerfectNextTraceKnowledge  SuperTrace   BackedgeFrequency  

SensiYvity  to  Other  Schemes  

36  

0  10  20  30  40  50  60  70  80  90  

100  

%  Qua

ntum

s  Accurately  Map

ped  

Benchmarks  

2LevelAdapYve-­‐Local   2LevelAdapYve-­‐Global   Our  Scheme  

Composite  Cores  –  Controller  Overview  

Core  Performance  Models  

(RegressionModel)  

Performance  Monitor  (Feedback  Controller)    

Li5le  backend  

Big  backend  

Performance  Loss  <  

Threshold  Yes   No  

Backend  selector  

Observed  perform

ance  metrics  

Threshold  

37  

Li5le  backend  

Es%mated  target  performance  

Es%mated  big  &  li5le  

performance  difference    

1.S-­‐Trace  ID   Head   Confidence     2.S-­‐Trace  ID  

Head   Confidence  

(9  bits)   (3  bits)   (2  bits)   (9  bits)   (3  bits)   (2  bits)  

Next  Super-­‐Trace  Predictor  

Strace  IDj   Headj+1   Strace  Idi+1  Strace  Idj+1  

Update  –  Incorrect  Decision  

00  S-­‐Trace  Idl+1   Headl+1   11   S-­‐Trace  Idk+1  Headk+1   10  

2

38  

Predict!  Update!  

(From                      )    1 (From    AcYve  Backend)    

Performance  Change  In  GCC  

•  Average  IPC  over  a  1M  instrucYon  window  (Quantum)    •  Average  IPC  over  2K  Quanta  Huge  performance  changes  within  a  coarse  quantum!  

0  

0.5  

1  

1.5  

2  

2.5  

3  

100K   200K   300K   400K   500K   600K   700K   800K   900K   1M  

InstrucYon

s  /  Cycle  

InstrucYons  

Big  Core   Li@le  Core  

0  

0.5  

1  

1.5  

2  

2.5  

3  

100K   200K   300K   400K   500K   600K   700K   800K   900K   1M  

InstrucYon

s  /  Cycle  

InstrucYons  

Big  Core   Li@le  Core  

39  

Fine-­‐grain  Performance  Phases  Exist  

40  

0  

0.5  

1  

1.5  

2  

2.5  

3  

160K   170K   180K  

InstrucYon

s  /  Cycle  

InstrucYons  

Big  Core   Li@le  Core  

•  20K  instrucYon  window  from  GCC  •  Average  IPC  over  100  instrucYon  quanta  

*Composite  Cores:  Pushing  Heterogeneity  into  a  Core,  Lukefahr  et  al,  Micro  2012  

Fine  grain  offers  more  opportunity  to  save  energy  by  exploiYng:  

•  Dependent  compute  •  Dependent  load  misses  •  Branch  mispredicts  

Next  Super-­‐Trace  Predictor  

1.Super-­‐Trace  ID  

Head   Replacability   2.Super-­‐Trace  ID  

Head   Replacability  

(9  bits)   (3  bits)   (2  bits)   (9  bits)   (3  bits)   (2  bits)  

Strace  IDi  

Strace  Idi+1  

Strace  Idi+1  

11  STrace  Idi+1   10  

2

41  

Predict!  Update!  

(From                      )    1

S-­‐Trace  Idm+1   11  

(From    AcYve  Backend)    

Headi+1   Headm+1  

Headi+1     Update:  Correct  Decision  

(To                      )    3

Next  Backend  Predictor  

Big/  Li5le  

(2  bits)  Strace  IDi+1  

Performance  Counters  

Feedback  Generator  

Observed  Strace  IDi+1  

10  

Observed  Performance  

Metrics  

Predict!  Update!  

3

42  

11  

Big  Backend  Li@le  

Backend  

(From                      )    2

Update:  Correct  Decision  

Fine-­‐grained  Heterogeneous  Architectures  

(B)  H3  *  

*RaBonale  for  a  3D  Heterogeneous  MulB-­‐core  Processor,  Rotenberg,  ICCD  2013  43