computer)architecture) ele475/cos475)) slide)deck)12:)mul...

39
Computer Architecture ELE 475 / COS 475 Slide Deck 12: Mul>threading David Wentzlaff Department of Electrical Engineering Princeton University 1

Upload: others

Post on 18-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Computer  Architecture  ELE  475  /  COS  475    

Slide  Deck  12:  Mul>threading  David  Wentzlaff  

Department  of  Electrical  Engineering  Princeton  University  

1  

Page 2: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Agenda  

•  Mul>threading  Mo>va>on  •  Course  Grain  Mul>threading  •  Simultaneous  Mul>threading  

2  

Page 3: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Mul>threading  

•  Difficult  to  con>nue  to  extract  instruc>on-­‐level  parallelism  (ILP)  or  data  level  parallelism  (DLP)  from  a  single  sequen>al  thread  of  control  

•  Many  workloads  can  make  use  of  thread-­‐level  parallelism  (TLP)  –  TLP  from  mul>programming  (run  independent  sequen>al  jobs)  –  TLP  from  mul>threaded  applica>ons  (run  one  job  faster  using  parallel  threads)  

•  Mul>threading  uses  TLP  to  improve  u>liza>on  of  a  single  processor  

3  

Page 4: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Pipeline  Hazards  

•  Each  instruc>on  may  depend  on  the  next  

LW  r1,  0(r2)  LW  r5,  12(r1)  ADDI  r5,  r5,  #12  SW  12(r1),  r5  

F   D   X   M  W  

t0   t1   t2   t3   t4   t5   t6   t7   t8  

F   D   X   M  W  D   D   D  F   D   X   M  W  D   D   D  F   F   F  

F   D  D   D   D  F   F   F  

t9   t10  t11  t12  t13  t14  

4  

Page 5: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Pipeline  Hazards  

•  Each  instruc>on  may  depend  on  the  next  

LW  r1,  0(r2)  LW  r5,  12(r1)  ADDI  r5,  r5,  #12  SW  12(r1),  r5  

F   D   X   M  W  

t0   t1   t2   t3   t4   t5   t6   t7   t8  

F   D   X   M  W  D   D   D  F   D   X   M  W  D   D   D  F   F   F  

F   D  D   D   D  F   F   F  

t9   t10  t11  t12  t13  t14  

What  is  usually  done  to  cope  with  this?  

5  

Page 6: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Pipeline  Hazards  

•  Each  instruc>on  may  depend  on  the  next  

LW  r1,  0(r2)  LW  r5,  12(r1)  ADDI  r5,  r5,  #12  SW  12(r1),  r5  

F   D   X   M  W  

t0   t1   t2   t3   t4   t5   t6   t7   t8  

F   D   X   M  W  D   D   D  F   D   X   M  W  D   D   D  F   F   F  

F   D  D   D   D  F   F   F  

t9   t10  t11  t12  t13  t14  

What  is  usually  done  to  cope  with  this?  –  interlocks  (slow)  – or  bypassing  (needs  hardware,  doesn’t  help  all  hazards)  

6  

Page 7: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Mul>threading  How  can  we  guarantee  no  dependencies  between  instruc>ons  in  a  pipeline?  

-­‐-­‐  One  way  is  to  interleave  execu>on  of  instruc>ons  from  different  program  threads  on  same  pipeline  

7  

Page 8: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Mul>threading  How  can  we  guarantee  no  dependencies  between  instruc>ons  in  a  pipeline?  

-­‐-­‐  One  way  is  to  interleave  execu>on  of  instruc>ons  from  different  program  threads  on  same  pipeline  

F   D   X   M  W  

t0   t1   t2   t3   t4   t5   t6   t7   t8  

T1:  LW  r1,  0(r2)  T2:  ADD  r7,  r1,  r4  T3:  XORI  r5,  r4,  #12  T4:  SW  0(r7),    r5  T1:  LW  r5,  12(r1)  

t9  

F   D   X   M  W  F   D   X   M  W  

F   D   X   M  W  F   D   X   M  W  

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

8  

Page 9: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Mul>threading  How  can  we  guarantee  no  dependencies  between  instruc>ons  in  a  pipeline?  

-­‐-­‐  One  way  is  to  interleave  execu>on  of  instruc>ons  from  different  program  threads  on  same  pipeline  

F   D   X   M  W  

t0   t1   t2   t3   t4   t5   t6   t7   t8  

T1:  LW  r1,  0(r2)  T2:  ADD  r7,  r1,  r4  T3:  XORI  r5,  r4,  #12  T4:  SW  0(r7),    r5  T1:  LW  r5,  12(r1)  

t9  

F   D   X   M  W  F   D   X   M  W  

F   D   X   M  W  F   D   X   M  W  

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

9  

Page 10: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Simple  Mul>threaded  Pipeline  

• Have  to  carry  thread  select  down  pipeline  to  ensure  correct  state  bits  read/wricen  at  each  pipe  stage  •  Appears  to  sodware  (including  OS)  as  mul>ple,  albeit  slower,  CPUs  

+1  

2   Thread  select  

PC  1  PC  

1  PC  1  PC  

1  I$   IR   GPR1  GPR1  GPR1  GPR1  

X  

Y  

2  

D$  

10  

Page 11: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Mul>threading  Costs  •  Each  thread  requires  its  own  user  state  –   PC  –   GPRs  

•  Also,  needs  its  own  system  state  –  virtual  memory  page  table  base  register  –  excep>on  handling  registers  –  Other  system  state  

•  Other  overheads:  –  Addi>onal  cache/TLB  conflicts  from  compe>ng  threads  –  (or  add  larger  cache/TLB  capacity)  –  More  OS  overhead  to  schedule  more  threads  (where  do  all  these  

threads  come  from?)  

11  

Page 12: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Thread  Scheduling  Policies  •  Fixed  interleave  (CDC  6600  PPUs,  1964)  

–  Each  of  N  threads  executes  one  instruc>on  every  N  cycles  –  If  thread  not  ready  to  go  in  its  slot,  insert  pipeline  bubble  –  Can  poten>ally  remove  bypassing  and  interlocking  logic  

•  Sodware-­‐controlled  interleave  (TI  ASC  PPUs,  1971)  –  OS  allocates  S  pipeline  slots  amongst  N  threads  –  Hardware  performs  fixed  interleave  over  S  slots,  execu>ng  whichever  thread  is  in  that  slot  

•  Hardware-­‐controlled  thread  scheduling  (HEP,  1982)  –  Hardware  keeps  track  of  which  threads  are  ready  to  go  –  Picks  next  thread  to  execute  based  on  hardware  priority  scheme  

12  

Page 13: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Coarse-­‐Grain  Hardware  Mul>threading  

•  Some  architectures  do  not  have  many  low-­‐latency  bubbles  

•  Add  support  for  a  few  threads  to  hide  occasional  cache  miss  latency  

•  Swap  threads  in  hardware  on  cache  miss  

13  

Page 14: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Denelcor  HEP  (Burton  Smith,  1982)  

First  commercial  machine  to  use  hardware  threading  in  main  CPU  

–  120  threads  per  processor  –  10  MHz  clock  rate  –  Up  to  8  processors  –  precursor  to  Tera  MTA  /  Cray  XMT  /  YARC  Data  /  Cray  URIKA  

BRL  HEP  Machine  Image  Credit:    Denelcor    hcp://dp.arl.army.mil/dp/historic-­‐computers/png/hep2.png  

14  

Page 15: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Tera  (Cray)  MTA  (1990)  •  Up  to  256  processors  •  Up  to  128  ac>ve  threads  per  processor  •  Processors  and  memory  modules  populate  

a  sparse  3D  torus  interconnec>on  fabric  •  Flat,  shared  main  memory  

–   No  data  cache  –   Sustains  one  main  memory  access  per  cycle  per  processor  

•  GaAs  logic  in  prototype,  1KW/processor  @  260MHz  –  Second  version  CMOS,  MTA-­‐2,  50W/processor  –  Newest  version,  Urika-­‐GD  

Image  Credit:  Tera  Computer  Company  

15  

Page 16: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

MTA  Pipeline  

A  

W  

C  

W  

M  

Inst  Fetch  

Mem

ory  Po

ol  

Retry  Pool  

InterconnecEon  Network  

Write  Po

ol  

W  

Memory  pipeline  

Issue  Pool  •  Every cycle, one VLIW instruction from one active thread is launched into pipeline •  Instruction pipeline is 21 cycles long •  Memory operations incur ~150 cycles of latency

16  

Page 17: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

MIT  Alewife  (1990)  

• Modified  SPARC  chips  –  register  windows  hold  different  thread  contexts  

• Up  to  four  threads  per  node  • Thread  switch  on  local  cache  miss  

Image  Credit:  MIT  17  

Page 18: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Oracle/Sun  Niagara  processors  

•  Target  is  datacenters  running  web  servers  and  databases,  with  many  concurrent  requests  

•  Provide  mul>ple  simple  cores  each  with  mul>ple  hardware  threads,  reduced  energy/opera>on  though  much  lower  single  thread  performance  

•  Niagara-­‐1  [2004],  8  cores,  4  threads/core  •  Niagara-­‐2  [2007],  8  cores,  8  threads/core  •  Niagara-­‐3  [2009],  16  cores,  8  threads/core  

18  

Page 19: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Oracle/Sun  Niagara-­‐3,  “Rainbow  Falls”  2009  

Image  Credit:  Oracle/Sun  

Image  Credit:  Oracle/Sun  From  Hot  Chips  2009  Presenta>on  by  Sanjay  Patel  19  

Page 20: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Simultaneous  Mul>threading  (SMT)  for  OOO  Superscalars  

•  Techniques  presented  so  far  have  all  been  “ver>cal”  mul>threading  where  each  pipeline  stage  works  on  one  thread  at  a  >me  

•  SMT  uses  fine-­‐grain  control  already  present  inside  an  OOO  superscalar  to  allow  instruc>ons  from  mul>ple  threads  to  enter  execu>on  on  same  clock  cycle.    Gives  becer  u>liza>on  of  machine  resources.  

20  

Page 21: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Ideal  Superscalar  Mul>threading    [Tullsen,  Eggers,  Levy,  UW,  1995]  

•  Interleave  mul>ple  threads  to  mul>ple  issue  slots  with  no  restric>ons  

Issue  width  

Time  

21  

Page 22: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

For  most  apps,  most  execu>on  units  lie  idle  in  an  OOO  superscalar  

Image From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism”, ISCA 1995.

For an 8-way superscalar.

22  

Page 23: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Superscalar  Machine  Efficiency  Issue  width  

Time  

Completely  idle  cycle  (ver7cal  waste)  

Instruc7on  issue  

Par7ally  filled  cycle,  i.e.,  IPC  <  4  (horizontal  waste)  

23  

Page 24: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Ver>cal  Mul>threading  

•  What  is  the  effect  of  cycle-­‐by-­‐cycle  interleaving?  

Issue  width  

Time  

Second  thread  interleaved  cycle-­‐by-­‐cycle  

Instruc7on  issue  

Par7ally  filled  cycle,  i.e.,  IPC  <  4  (horizontal  waste)  

24  

Page 25: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Ver>cal  Mul>threading  

•  What  is  the  effect  of  cycle-­‐by-­‐cycle  interleaving?  –  removes  ver>cal  waste,  but  leaves  some  horizontal  waste  

Issue  width  

Time  

Second  thread  interleaved  cycle-­‐by-­‐cycle  

Instruc7on  issue  

Par7ally  filled  cycle,  i.e.,  IPC  <  4  (horizontal  waste)  

25  

Page 26: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Chip  Mul>processing  (CMP)  

•  What  is  the  effect  of  splisng  into  mul>ple  processors?  

Issue  width  

Time  

26  

Page 27: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Chip  Mul>processing  (CMP)  

•  What  is  the  effect  of  splisng  into  mul>ple  processors?  –  reduces  horizontal  waste,    –  leaves  some  ver>cal  waste,  and    –  puts  upper  limit  on  peak  throughput  of  each  thread.  

Issue  width  

Time  

27  

Page 28: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Ideal  Superscalar  Mul>threading    [Tullsen,  Eggers,  Levy,  UW,  1995]  

•  Interleave  mul>ple  threads  to  mul>ple  issue  slots  with  no  restric>ons  

Issue  width  

Time  

28  

Page 29: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

OOO  Simultaneous  Mul>threading  [Tullsen,  Eggers,  Emer,  Levy,  Stamm,  Lo,  DEC/UW,  1996]  

•  Add  mul>ple  contexts  and  fetch  engines  and  allow  instruc>ons  fetched  from  different  threads  to  issue  simultaneously  

•  U>lize  wide  out-­‐of-­‐order  superscalar  processor  issue  queue  to  find  instruc>ons  to  issue  from  mul>ple  threads  

•  OOO  instruc>on  window  already  has  most  of  the  circuitry  required  to  schedule  from  mul>ple  threads  

•  Any  single  thread  can  u>lize  whole  machine  

29  

Page 30: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

SMT  adapta>on  to  parallelism  type    For  regions  with  high  thread  level  parallelism  (TLP)  en>re  machine  width  is  shared  by  all  threads  

Issue  width  

Time  

Issue  width  

Time  

For  regions  with  low  thread  level  parallelism  (TLP)  en>re  machine  width  is  available  for  instruc>on  level  parallelism  (ILP)  

30  

Page 31: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Power  4  

Power  5  

2  fetch  (PC),  2  iniEal  decodes  

2  commits  (architected  register  sets)  

[POWER  4  system  microarchitecture,  Tendler  et  al,  IBM  J.  Res.  &  Dev.,  Jan  2002]    Image  Credit:  IBM  Courtesy  of  Interna>onal  Business  Machines,  ©  Interna>onal  Business  Machines.  

[POWER  5  system  microarchitecture,  Sinharoy  et  al,  IBM  J.  Res.  &  Dev.,  Jul/Sept  2005]    Image  Credit:  IBM  Courtesy  of  Interna>onal  Business  Machines,  ©  Interna>onal  Business  Machines.  

31  

Page 32: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Power  5  data  flow  ...  

Why  only  2  threads?  With  4,  one  of  the  shared  resources  (physical  registers,  cache,  memory  bandwidth)  would  be  prone  to  boUleneck    

Image  Credit:  Carsten  Schulz  

[POWER  5  system  microarchitecture,  Sinharoy  et  al,  IBM  J.  Res.  &  Dev.,  Jul/Sept  2005]    Image  Credit:  IBM  Courtesy  of  Interna>onal  Business  Machines,  ©  Interna>onal  Business  Machines.  

32  

Page 33: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Changes  in  Power  5  to  support  SMT  •  Increased  associa>vity  of  L1  instruc>on  cache  and  the  instruc>on  address  transla>on  buffers    

•  Added  per  thread  load  and  store  queues    •  Increased  size  of  the  L2  (1.92  vs.  1.44  MB)  and  L3  caches  

•  Added  separate  instruc>on  prefetch  and  buffering  per  thread  

•  Increased  the  number  of  virtual  registers  from  152  to  240  

•  Increased  the  size  of  several  issue  queues  •  The  Power5  core  is  about  24%  larger  than  the  Power4  core  because  of  the  addi>on  of  SMT  support  

33  

Page 34: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Pen>um-­‐4  Hyperthreading  (2002)  •  First  commercial  SMT  design  (2-­‐way  SMT)  

–  Hyperthreading  ==  SMT  •  Logical  processors  share  nearly  all  resources  of  the  physical  

processor  –  Caches,  execu>on  units,  branch  predictors  

•  Die  area  overhead  of  hyperthreading    ~  5%  •  When  one  logical  processor  is  stalled,  the  other  can  make  progress  

–  No  logical  processor  can  use  all  entries  in  queues  when  two  threads  are  ac>ve  

•  Processor  running  only  one  ac>ve  sodware  thread  runs  at  approximately  same  speed  with  or  without  hyperthreading  

•  Hyperthreading  dropped  on  OOO  P6  based  follow-­‐ons    to  Pen>um-­‐4  (Pen>um-­‐M,  Core  Duo,  Core  2  Duo),  un>l  revived  with  Nehalem  genera>on  machines  in  2008.  

•  Intel  Atom  (in-­‐order  x86  core)  has  two-­‐way  ver>cal  mul>threading  

34  

Page 35: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Ini>al  Performance  of  SMT  •  Pen>um  4  Extreme  SMT  yields  1.01  speedup  for  

SPECint_rate  benchmark  and  1.07  for  SPECfp_rate  –  Pen>um  4  is  dual  threaded  SMT  –  SPECRate  requires  that  each  SPEC  benchmark  be  run  against  a  vendor-­‐selected  number  of  copies  of  the  same  benchmark  

•  Running  on  Pen>um  4  each  of  26  SPEC  benchmarks  paired  with  every  other  (262  runs)  speed-­‐ups  from  0.90  to  1.58;  average  was  1.20  

•  Power  5,  8-­‐processor  server  1.23  faster  for  SPECint_rate  with  SMT,  1.16  faster  for  SPECfp_rate  

•  Power  5  running  2  copies  of  each  app  speedup  between  0.89  and  1.41  –  Most  gained  some  –  Floa>ng  Point  apps  had  most  cache  conflicts  and  least  gains  

35  

Page 36: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Icount  Choosing  Policy  

Why  does  this  enhance  throughput?  

Fetch  from  thread  with  the  least  instruc>ons  in  issue  queue.  

36  

Page 37: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Summary:  Mul>threaded  Categories  Tim

e (pr

oces

sor c

ycle)

Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading

Thread 1 Thread 2

Thread 3 Thread 4

Thread 5 Idle slot

37  

Page 38: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

Acknowledgements  •  These  slides  contain  material  developed  and  copyright  by:  

–  Arvind  (MIT)  –  Krste  Asanovic  (MIT/UCB)  –  Joel  Emer  (Intel/MIT)  –  James  Hoe  (CMU)  –  John  Kubiatowicz  (UCB)  –  David  Pacerson  (UCB)  –  Christopher  Bacen  (Cornell)  

•  MIT  material  derived  from  course  6.823  •  UCB  material  derived  from  course  CS252  &  CS152  •  Cornell  material  derived  from  course  ECE  4750  

38  

Page 39: Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul ...eleclass.princeton.edu/classes/ele475/fall_2015/... · Computer)Architecture) ELE475/COS475)) Slide)Deck)12:)Mul>threading)

           Copyright  ©  2015  David  Wentzlaff  

39