lecture’xx:’midterm’review’ - github pages...great ideas in computer architectures 1. design...

66
Lecture XX: Midterm Review CSE 564 Computer Architecture Fall 2016 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1

Upload: others

Post on 29-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Lecture  XX:  Midterm  Review    

CSE  564  Computer  Architecture  Fall  2016  

Department  of  Computer  Science  and  Engineering  Yonghong  Yan  

[email protected]  www.secs.oakland.edu/~yan  

1  

Page 2: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Lecture  01  IntroducCon  

2  

Page 3: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

3  

The  InstrucCon  Set:  a  CriCcal  Interface  

instrucBon  set  

soCware  

hardware  

•  ProperBes  of  a  good  abstracBon  –  Lasts  through  many  generaBons  (portability)  –  Used  in  many  different  ways  (generality)  –  Provides  convenient    funcBonality  to  higher  levels  –  Permits  an  efficient  implementaBon  at  lower  levels  

Page 4: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Great Ideas in Computer Architectures

1.  Design for Moore’s Law

2.  Use abstraction to simplify design

3.  Make the common case fast

4.  Performance via parallelism

5.  Performance via pipelining

6.  Performance via prediction

7.  Hierarchy of memories

8.  Dependability via redundancy

4  

Page 5: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Great  Idea:  “Moore’s  Law”  

   Gordon  Moore,  Founder  of  Intel  •  1965:  since  the  integrated  circuit  was  invented,  the  number  of  

transistors/inch2  in  these  circuits  roughly  doubled  every  year;  this  trend  would  conBnue  for  the  foreseeable  future  

•  1975:  revised  -­‐  circuit  complexity  doubles  every  two  years  

5  Image  credit:  Intel  

Page 6: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Moore’s  Law  trends  •  More  transistors  =  ↑  opportuniBes  for  exploiBng  parallelism  in  the  

instrucBon  level  (ILP)  –  Pipeline,  superscalar,  VLIW  (Very  Long  InstrucBon  Word),  SIMD  (Single  

InstrucBon  MulBple  Data)  or  vector,  speculaBon,  branch  predicBon  •  General  path  of  scaling  

–  Wider  instrucBon  issue,  longer  piepline  –  More  speculaBon  –  More  and  larger  registers  and  cache  

•  Increasing  circuit  density  ~=  increasing  frequency  ~=  increasing  performance  

•  Transparent  to  users  –  An  easy  job  of  geang  beber  performance:  buying  faster  processors  (higher  

frequency)  

•  We  have  enjoyed  this  free  lunch  for  several  decades,  however  (TBD)  …  

6  

Page 7: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Problems  of  tradiConal  ILP  scaling  

•  Fundamental  circuit  limitaBons1  –  delays  ⇑  as  issue  queues  ⇑  and  mulB-­‐port  register  files  ⇑  –  increasing  delays  limit  performance  returns  from  wider  issue  

•  Limited  amount  of  instrucBon-­‐level  parallelism1  

–  inefficient  for  codes  with  difficult-­‐to-­‐predict  branches  

•  Power  and  heat  stall  clock  frequencies  

7  

[1]  The  case  for  a  single-­‐chip  mulBprocessor,  K.  Olukotun,  B.  Nayfeh,  L.  Hammond,  K.  Wilson,  and  K.  Chang,  ASPLOS-­‐VII,  1996.  

Page 8: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Power/heat  density  limits  frequency  

8  

•  Some  fundamental  physical  limits  are  being  reached  

Page 9: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

9  

RevoluCon  is  happening  now  •  Chip  density  is  

conBnuing  increase  ~2x  every  2  years  –  Clock  speed  is  not  –  Number  of  processor  

cores  may  double  instead  

•  There  is  lible  or  no  hidden  parallelism  (ILP)  to  be  found  

•  Parallelism  must  be  exposed  to  and  managed  by  soCware  –  No  free  lunch  

Source:  Intel,  MicrosoC  (Suber)  and  Stanford  (Olukotun,  Hammond)  

Page 10: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Architectural  Challenges  

•  Massive  (ca.  4X)  increase  in  concurrency  –  MulBcore  (4  -­‐  <100)  à  Manycores  (100s  –  1ks)  

•  Heterogeneity    –  System-­‐level  (accelerators)  vs  chip  level  (embedded)  

•  Compute  power  and  memory  speed  challenges  (two  walls)  –  500x  compute  power  and  30x  memory  of  2PF  HW  –  Memory  access  ;me  lags  further  behind  

10  

• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2

Course Motivation: Research Perspective

���

�����

�����

� ���� ������ ����

��� ��� ��� ��� ���� ���� �� � �� � ����

��� ��� ��� ������� �� ��

�������

���������

����������

� �����

������

!"�#���

"��$������������

���%�����

&�'(�)

(�*�!+(

�,����$����

'�����'�+-��

�����'.�� �

�/�0��1'���

&��2$���-�������

'.�� �

(�$3

�������

�4��5��2�6�&0�.&���7��8�9��1:*�4��9 ��'.�� ���������$�� �4�6����� ����;����� ��2��<�� �4��9 ��'"<�����5$�� <��5$�� �")

��������

ECE 5950 Course Overview 18 / 35Data$Processing$in$Exascale1class$Computing$Systems$$|$$April$27,$2011$$|$$CRM$4"

Three"Eras"of"Processor"Performance"

Single4Core""Era"

Single1thread$$Performance$

?$

Time$

we#are#here#

o"

Enabled$by:$� ���������$� Voltage$Scaling$� MicroArchitecture$

$

Constrained$by:$Power$Complexity$

Multi4Core""Era"

Throughput$$Performance$

Time$(##of#Processors)#

we#are#here#

o"

Enabled$by:$� ���������$� Desire$for$Throughput$� 20$years$of$SMP$arch$

$

Constrained$by:$Power$Parallel$SW$availability$Scalability$

Heterogeneous"Systems"Era"

Targeted$Application$$

Performance$

Time$(Data1parallel#exploitation)#

we#are#here#

o"

Enabled$by:$� ���������$� Abundant$data$parallelism$� Power$efficient$GPUs$

$

Currently)constrained$by:$Programming$models$Communication$overheads$

Source:  Chuck  Moore,  Data  Processing  in  ExaScale-­‐ClassComputer  Systems,  Salishan,  April  2011  

Page 11: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Lecture  02  Performance  

11  

Page 12: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Dynamic  Energy  and  Power  

•  Dynamic  energy  –  Transistor  switch  from  0  -­‐>  1  or  1  -­‐>  0  

•  Dynamic  power  

•  Reducing  clock  rate  reduces  power,  not  energy  •  The  capaciBve  load:  

–  a  funcBon  of  the  number  of  transistors  connected  to  an  output  and  the  technology,  which  determines  the  capacitance  of  the  wires  and  the  transistors.    

12  

Page 13: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

An  Example  from  Textbook  page  #21  

13  

Page 14: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

InstrucCon  Count  and  CPI  

•  InstrucBon  Count  for  a  program  –  Determined  by  program,  ISA  and  compiler  

•  Average  cycles  per  instrucBon  –  Determined  by  CPU  hardware  –  If  different  instrucBons  have  different  CPI  

•  Average  CPI  affected  by  instrucBon  mix  

Rate ClockCPICount nInstructio

Time Cycle ClockCPICount nInstructioTime CPU

nInstructio per CyclesCount nInstructioCycles Clock

×=

××=

×=

14  

Page 15: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

CPI  Example  

•  Computer  A:  Cycle  Time  =  250ps,  CPI  =  2.0  •  Computer  B:  Cycle  Time  =  500ps,  CPI  =  1.2  •  Same  ISA  •  Which  is  faster,  and  by  how  much?  

1.2500psI600psI

ATime CPUBTime CPU

600psI500ps1.2IBTime CycleBCPICount nInstructioBTime CPU

500psI250ps2.0IATime CycleACPICount nInstructioATime CPU

×=

×=××=

××=

×=××=

××=

A  is  faster…  

…by  this  much  

15  

Page 16: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

CPI in More Detail

•  If different instruction classes take different numbers of cycles

∑=

×=n

1iii )Count nInstructio(CPICycles Clock

n  Weighted  average  CPI  

∑=

⎟⎠

⎞⎜⎝

⎛ ×==n

1i

ii Count nInstructio

Count nInstructioCPICount nInstructio

Cycles ClockCPI

Relative frequency

16  

Page 17: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

CPI Example

•  AlternaBve  compiled  code  sequences  using  instrucBons  in  classes  A,  B,  C  

Class A B C CPI for class 1 2 3

IC in sequence #1 2 1 2 IC in sequence #2 4 1 1

n  Sequence  #1:  IC  =  5  n  Clock  Cycles  =  2×1  +  1×2  +  2×3  =  10  

n  Avg.  CPI  =  10/5  =  2.0  

n  Sequence  #2:  IC  =  6  n  Clock  Cycles  =  4×1  +  1×2  +  1×3  =  9  

n  Avg.  CPI  =  9/6  =  1.5  

17  

Page 18: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Principles  of  Computer  Design  

•  The  Processor  Performance  EquaBon    

18  

Page 19: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Principles  of  Computer  Design  

•  Different  instrucBon  types  having  different  CPIs  

19  

Page 20: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Impacts  by  Components  

         Inst  Count        CPI  Clock  Rate    

Program                      X      Compiler                      X          (X)    Inst.  Set.                      X            X    Architecture                      X                    X    Technology            X  

20  

inst count

CPI

Cycle time

Page 21: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Principles  of  Computer  Design  

•  Take  Advantage  of  Parallelism  –  e.g.  mulBple  processors,  disks,  memory  banks,  pipelining,  

mulBple  funcBonal  units  

•  Principle  of  Locality  –  Reuse  of  data  and  instrucBons  

•  Focus  on  the  Common  Case  –  Amdahl’s  Law  

21  

Page 22: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Amdahl’s  Law  

22  

( )enhanced

enhancedenhanced

new

oldoverall

SpeedupFraction Fraction

1 ExTimeExTime Speedup

+−==1

Best you could ever hope to do:

( )enhancedmaximum Fraction - 1

1 Speedup =

( ) ⎥⎦

⎤⎢⎣

⎡+−×=

enhanced

enhancedenhancedoldnew Speedup

FractionFraction ExTime ExTime 1

Page 23: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Using  Amdahl’s  Law  

23  

Page 24: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Amdahl’s  Law  for  Parallelism  

•  The  enhanced  fracBon  F  is  through  parallelism,  perfect  parallelism  with  linear  speedup  –   The  speedup  for  F  is  N  for  N  processors  

•  Overall  speedup  

•  Speedup  upper  bound  (when  N  à∞  ):    –  1-­‐F:  the  sequenBal  porBon  of  a  program  

24  

Page 25: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Lecture  03:  ISA  

25  

Page 26: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Iron-­‐code  Summary  •  Sec$on  A.2—Use  general-­‐purpose  registers  with  a  load-­‐store  architecture.    •  Sec$on  A.3—Support  these  addressing  modes:  displacement  (with  an  address  offset  

size  of  12  to  16  bits),  immediate  (size  8  to  16  bits),  and  register  indirect.    •  Sec$on  A.4—Support  these  data  sizes  and  types:  8-­‐,  16-­‐,  32-­‐,  and  64-­‐bit  integers  and  

64-­‐bit  IEEE  754  floaCng-­‐point  numbers.    –  Now  we  see  16-­‐bit  FP  for  deep  learning  in  GPU  

•  hip://www.nextplajorm.com/2016/09/13/nvidia-­‐pushes-­‐deep-­‐learning-­‐inference-­‐new-­‐pascal-­‐gpus/  

•  Sec$on  A.5—Support  these  simple  instrucCons,  since  they  will  dominate  the  number  of  instrucCons  executed:  load,  store,  add,  subtract,  move  register-­‐  register,  and  shil.    

•  Sec$on  A.6—Compare  equal,  compare  not  equal,  compare  less,  branch  (with  a  PC-­‐relaCve  address  at  least  8  bits  long),  jump,  call,  and  return.    

•  Sec$on  A.7—Use  fixed  instrucCon  encoding  if  interested  in  performance,  and  use  variable  instrucCon  encoding  if  interested  in  code  size.    

•  Sec$on  A.8—Provide  at  least  16  general-­‐purpose  registers,  be  sure  all  addressing  modes  apply  to  all  data  transfer  instrucCons,  and  aim  for  a  minimalist  IS  

–  Olen  use  separate  floaCng-­‐point  registers.    –  The  jusCficaCon  is  to  increase  the  total  number  of  registers  without  raising  problems  

in  the  instrucCon  for-­‐mat  or  in  the  speed  of  the  general-­‐purpose  register  file.  This  compromise,  however,  is  not  orthogonal.    

26  

Page 27: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Lecture  05:  Pipeline  

27  

Page 28: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

RISC Instruction Set

•  Every instruction can be implemented in at most 5 clock cycles –  Instruction fetch cycle (IF): send PC to memory, fetch the

current instruction from memory, and update PC to the next sequential PC by adding 4 to the PC.

–  Instruction decode/register fetch cycle (ID): decode the instruction, read the registers corresponding to register source specifiers from the register file.

–  Execution/effective address cycle (EX): perform Memory reference, Register-Register ALU instruction and Register-Immediate ALU instruction.

–  Memory access (MEM): perform load/store instructions. –  Write-back cycle (WB): Register-Register ALU instruction or

Load instruction.

Page 29: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Making RISC Pipelining Real

•  Function units used in different cycles –  Hence we can overlap the execution of multiple instructions

•  Important things to make it real –  Separate instruction and data memories, e.g. I-cache and D-cache,

banking •  Eliminate a conflict for accessing a single memory.

–  The Register file is used in the two stages (two R and one W every cycle) •  Read from register in ID (second half of CC), and write to register

in WB (first half of CC). –  PC

•  Increment and store the PC every clock, and done it during the IF stage.

•  A branch does not change the PC until the ID stage (have an adder to compute the potential branch target).

–  Staging data between pipeline stages •  Pipeline register

Page 30: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Pipeline  Datapath  

•  Register  files  in  ID  and  WB  stage  –  Read from register in ID (second half of CC), and write to

register in WB (first half of CC). •  IM and DM

Page 31: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Pipeline Registers

Memory Access

Write Back

Instruction Fetch

Instr. Decode Reg. Fetch

Execute Addr. Calc

ALU

Mem

ory

Reg File

MU

X MU

X

Data

Mem

ory

MU

X

Sign Extend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/M

EM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Dat

a

Next PC

Address

RS1 RS2

Imm

MU

X

IR <= mem[PC];

PC <= PC + 4

A <= Reg[IRrs];

B <= Reg[IRrt] rslt <= A opIRop B

Reg[IRrd] <= WB

WB <= rslt Pipeline Registers for Data Staging between Pipeline Stages Named  as:  IF/ID,  ID/EX,  EX/MEM,  and  MEM/WB      

Page 32: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Pipeline  Registers  

•  Edge-­‐triggered  property  of  register  is  criBcal  

Page 33: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Inst.  Set  Processor  Controller  

IR <= mem[PC]; PC <= PC + 4

A <= Reg[IRrs]; B <= Reg[IRrt]

r <= A opIRop B

Reg[IRrd] <= WB

WB <= r

Ifetch  

opFetch-­‐DCD  

PC <= IRjaddr if bop(A,b) PC <= PC+IRim

br   jmp  RR  

r <= A opIRop IRim

Reg[IRrd] <= WB

WB <= r

RI  r <= A + IRim

WB <= Mem[r]

Reg[IRrd] <= WB

LD  ST  

JSR   JR  

branch requires 3 cycles, store requires 4 cycles, and all other instructions require 5 cycles.

Page 34: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

•  InstrucBons  per  program  depends  on  source  code,  compiler  technology,  and  ISA  

•  Cycles  per  instrucCons  (CPI)  depends  on  ISA  and  µarchitecture  

•  Time  per  cycle  depends  upon  the  µarchitecture  and  base  technology  

Processor  Performance  

CPU Time = InstructionsProgram

* CyclesInstruction

*TimeCycle

Page 35: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

RISC-­‐V  ISA  and  ImplementaCons  

35  

Page 36: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

User  Level  ISA  

•  Defines  the  normal  instrucBons  needed  for  computaBon  –  A  mandatory  Base  integer  ISA  

•  I:  Integer  instrucCons:  ALU,  branches/jumps,  and  loads/stores  •  Support  for  misaligned  memory  access  is  mandatory  

–  Standard  Extensions  • M:  Integer  MulCplicaCon  and  Division  •  A:  Atomic  InstrucCons  •  F:  Single-­‐Precision  FloaCng-­‐Point  •  D:  Double-­‐Precision  FloaCng-­‐Point  •  C:  Compressed  InstrucCons  (16  bit)  

•  G  =  IMAFD:  Integer  base  +  four  standard  extensions  –  OpBonal  extensions  

36  

Page 37: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Purpose of a Specific Control Signal

Page 38: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Datapath  for  ALU  InstrucCons  

38  

<14:12>  

Op2Sel  Reg  /  Imm  

Imm  Select  

ImmSel  OpCode  

0x4  Add  

clk  

addr  inst  

Inst.  Memory  

PC  ALU  

RegWriteEn  clk  

rd1  

GPRs  

rs1  rs2  

wa  wd   rd2  

we  <19:15>  <24:20>  

ALU  Control  

<11:7>  

<6:0>  

7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd ← (rs1) func (rs2) immediate12 rs1 func3 rd opcode rd ← (rs1) op immediate 31 20 19 15 14 12 11 7 6 0

Inst<31:20>  

Page 39: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Datapath  for  Load/Store  InstrucCons  

39  

WBSel  ALU  /  Mem  

rs1 is the base register rd is the destination of a Load, rs2 is the data source for a Store

Op2Sel  

“base”  

disp  

ImmSel  OpCode  

ALU  Control  

ALU  

0x4  Add  

clk  

addr  inst  

Inst.  Memory  

PC  

RegWriteEn  

clk  

rd1  

GPRs  

rs1  rs2  

wa  wd   rd2  

we  

Imm  Select  

clk  

MemWrite  

addr  

wdata  

rdata  Data    Memory  

we  

7 5 5 3 5 7 imm rs2 rs1 func3 imm opcode Store (rs1) + displacement immediate12 rs1 func3 rd opcode Load 31 20 19 15 14 12 11 7 6 0

Page 40: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Datapath  for  CondiConal  Branches  (BEQ/BNE/BLT/BGE/BLTU/BGEU)  

40  

0x4  

Add  

PCSel  

clk  

WBSel  MemWrite  

addr  

wdata  

rdata  Data    Memory  

we  

Op2Sel  ImmSel  OpCode  

Bcomp?  

clk  

clk  

addr  inst  

Inst.  Memory  

PC   rd1  

GPRs  

rs1  rs2  

wa  wd   rd2  

we  

Imm  Select  

ALU  

ALU  Control  

Add  

br  

pc+4  

RegWrEn  

Br  Logic  

Page 41: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Data  Hazards  Summary  •  Stall  cycles  without  by-­‐passing  

–  3,  2,  1  depending  on  the  distance  between  the  two  instrucBons    •  RAW  dependency  between  ALU  

–  Full-­‐bypassing  will  eliminate  all  stalls  •  Load-­‐use  RAW  

–  Full-­‐bypassing  could  have  at  most  1  cycle  stall  for  two  instrs  •  Ld  x5  16(x4)  •  Add  x6,  x5,  x1  

•  Load-­‐store  or  store-­‐store  –  No  stalls  with  full-­‐bypassing  

•  Interlock  control  logic  for  RAW  hazard  detecBon  and  stall  inserBon  •  By-­‐passing  data  path  

–  Need  to  deal  with  three  situaBons  •  ALU    à  ALU  •  MEM  à  ALU  •  WB    à  ALU  

41  

Page 42: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Interlock  Control  Logic  ignoring  jumps  &  branches  

42  

IR  IR   IR  

PC   A  

B  Y  

R  

MD1   MD2  

addr  inst  

Inst  Memory  

0x4  Add  

IR   ALU  rd1  

GPRs  

rs1  rs2  

wa  wd  rd2  

we  

wdata  

addr  

wdata  

rdata  Data    Memory  

we  

bubble  

stall  Cstall  

wa1  

rs1  rs2   ?  

we1  

re1   re2  Cre  

wa3  we2   wa2  

Cdest   Cdest  we3  

Imm  Select  

Page 43: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Fully  Bypassed  Datapath  

43  

ASrc  IR  IR   IR  

PC   A  

B  Y  

R  

MD1   MD2  

addr  inst  

Inst  Memory  

0x4  Add  

IR   ALU  

Imm  Select  

rd1  

GPRs  

rs1  rs2  

wa  wd  rd2  

we  

wdata  

addr  

wdata  

rdata  Data    Memory  

we  

bubble  

stall  

D  

E   M   W  

PC  for  JAL,  ...  

BSrc  

Page 44: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Control  Hazards  Summary  

44  

Instruc$on              Taken  known?                Target  known?  

JAL  

JALR  B<cond.>  

Each  instrucBon  fetch  depends  on  one  or  two  pieces  of  informaBon  from  the  preceding  instrucBon:  

 1)  Is  the  preceding  instrucBon  a  taken  branch?    2)  If  so,  what  is  the  target  address?  

 •  JAL:  uncondiConal  jump  to  PC+immediate  •  JALR:  indirect  jump  to  rs1+immediate  •  Branch:  if  (rs1  conds  rs2),  branch  to  PC+immediate    

Aler  Inst.  Decode  

Aler  Inst.  Decode   Aler  Inst.  Decode  

Aler  Inst.  Decode   Aler  Reg.  Fetch  

Aler  Execute  

Page 45: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Control  Hazards  Summary  

•  JAL:  uncondiConal  jump  to  PC+immediate  –  1  cycle  delay  of  pipeline  

•  JALR:  indirect  jump  to  rs1+immediate  –  1  cycle  delay  

•  Branch:  if  (rs1  conds  rs2),  branch  to  PC+immediate  –  2  cycles  delay    

•  SoluCons:    –  Delay  slot  (not  a  soluCon  to  remove  the  bubble)  –  BHT  and  BTB  (not  a  soluCon  either)  

45  

Page 46: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Lecture  10/11  Memory  Tech,  Cache  OrganizaCon  and  Performance  

46  

Page 47: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Review:  Memory  Technology  and  Hierarchy  

•  RelaBonships  

47  

Technology  challenge:  Memory  Wall  

Address  Space  0   2^n  -­‐  1  Prob

ability  

of  re

ference  

Program  Behavior:  Principle  of  Locality  

Architecture  Approach:  Memory  Hierarchy  

Page 48: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Technology  Challenge:  Memory  Wall  

•  Growing  disparity  of  processor  and  memory  speed  

•  DRAM:  Slow  cheap  and  dense:  –  Good  choice  for  presenBng  the  

user  with  a  BIG  memory  system  –  Used  for  Main  memory  

48  

•  SRAM:  fast,  expensive,  and  not  very  dense:  –  Good  choice  for  providing  the  user  FAST  access  Bme.  –  Used  for  Cache  

•  Speed:  –  Latency  –  Bandwidth  –  Memory  interleaving    

Page 49: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Program  Behavior:  Principle  of  Locality  •  Programs  tend  to  reuse  data  and  instrucBons  near  those  they  have  used  

recently,  or  that  were  recently  referenced  themselves  –  SpaBal  locality:    Items  with  nearby  addresses  tend  to  be  referenced  close  

together  in  Bme  –  Temporal  locality:    Recently  referenced  items  are  likely  to  be  referenced  

in  the  near  future  

Locality  Example:  •  Data  

– Reference  array  elements  in  succession  (stride-­‐1  reference  paiern):  

– Reference  sum  each  iteraCon:  •  InstrucCons  

– Reference  instrucCons  in  sequence:  – Cycle  through  loop  repeatedly:    

sum = 0; for (i = 0; i < n; i++) sum += a[i];

return sum;

SpaCal  locality  

SpaCal  locality  Temporal  locality  

Temporal  locality  

Page 50: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Architecture  Approach:  Memory  Hierarchy  

•  Keep  most  recent  accessed  data  and  its  adjacent  data  in  the  smaller/faster  caches  that  are  closer  to  processor  

•  Mechanisms  for  replacing  data  

50  

Control

Datapath

Secondary Storage (Disk)

Processor

Registers

Main Memory (DRAM)

2nd/3rd Level Cache

(SRAM)

On-C

hip C

ache

1s 10,000,000s (10s ms)

Speed (ns): 10s 100s

100s Gs Size (bytes): Ks Ms

Tertiary Storage (Tape)

10,000,000,000s (10s sec)

Ts

Page 51: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

4  QuesCons  for  Cache  OrganizaCon  Review  

•  Q1: Where can a block be placed in the upper level? –  Block placement

•  Q2: How is a block found if it is in the upper level? –  Block identification

•  Q3: Which block should be replaced on a miss? –  Block replacement

•  Q4: What happens on a write? –  Write strategy

Page 52: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Q1: Where Can a Block be Placed in The Upper Level?

•  Block Placement –  Direct Mapped, Fully Associative, Set Associative

•  Direct mapped: (Block number) mod (Number of blocks in cache) •  Set associative: (Block number) mod (Number of sets in cache)

–  # of set ≤ # of blocks –  n-way: n blocks in a set –  1-way = direct mapped

•  Fully associative: # of set = 1

Block-frame address

Block no. 0 1 2 3 54 76 8 12 9 31

Direct mapped: block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7 Block no.

Set associative: block 12 can go anywhere in set 0 (12 mod 4)

0 1 2 3 4 5 6 7

Set0

Block no.

Set1 Set2 Set3

Fully associative: block 12 can go anywhere

Block no. 0 1 2 3 4 5 6 7

Page 53: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

1 KB Direct Mapped Cache, 32B blocks

•  For a 2N byte cache –  The uppermost (32 - N) bits are always the Cache Tag –  The lowest M bits are the Byte Select (Block Size = 2M)

Cache Index

0 1 2 3

:

Cache Data Byte 0

0 4 31

:

Cache Tag Example: 0x50 Ex: 0x01

0x50

Stored as part of the cache “state”

Valid Bit

: 31

Byte 1 Byte 31 :

Byte 32 Byte 33 Byte 63 : Byte 992 Byte 1023 :

Cache Tag

Byte Select Ex: 0x00

9 5 10

Page 54: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Set Associative Cache

•  N-way set associative: N entries for each Cache Index –  N direct mapped caches operates in parallel

•  Example: Two-way set associative cache –  Cache Index selects a “set” from the cache; –  The two tags in the set are compared to the input in parallel; –  Data is selected based on the tag result.

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

ORHit

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

CompareCompare

ORHit

Page 55: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Disadvantage of Set Associative Cache

•  N-way Set Associative Cache versus Direct Mapped Cache: –  N comparators vs. 1 –  Extra MUX delay for the data –  Data comes AFTER Hit/Miss decision and set selection

•  In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: –  Possible to assume a hit and continue. Recover later if miss.

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

ORHit

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

CompareCompare

ORHit

Page 56: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Q2: Block Identification

•  Tag on each block –  No need to check index or block offset

•  Increasing associativity shrinks index, expands tag

Block Offset

Block Address

Index Tag

Cache size = Associativity × 2index_size × 2offest_size

Set Select Data Select

Page 57: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Q3: Which block should be replaced on a miss?

•  Easy for Direct Mapped •  Set Associative or Fully Associative

–  Random –  LRU (Least Recently Used) –  First in, first out (FIFO)

Associativity

2-way 4-way 8-way

Size LRU Ran. FIFO LRU Ran. FIFO LRU Ran. FIFO

16KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4

64KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3

256KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

Page 58: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Q4: What Happens on a Write?

Write-Through Write-Back

Policy

Data written to cache block, also written to lower-level memory

1.  Write data only to the cache

2.  Update lower level when a block falls out of the cache

Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Yes No

Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”).

Page 59: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Write Buffers for Write-Through Caches

•  Q. Why a write buffer ? –  A. So CPU doesn’t stall

•  Q. Why a buffer, why not just one register ? –  A. Bursts of writes are common.

•  Q. Are Read After Write (RAW) hazards an issue for write buffer? –  A. Yes! Drain buffer before next read, or send read 1st after check

write buffers.

ProcessorCache

Write Buffer

DRAM

Page 60: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Write - Miss Policy

•  Two options on a write miss –  Write allocate – the block is allocated on a write miss, followed

by the write hit actions. •  Write misses act like read misses.

–  No-write allocate – write misses do not affect the cache. The block is modified only in the lower-level memory. •  Block stay out of the cache in no-write allocate until the

program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache.

Page 61: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Write-Miss Policy Example

•  Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations (The address is in square brackets): Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?

•  Answer No-write Allocate: Write allocate:

Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[200]; 1 write hit Write Mem[100]. 1 write miss Write Mem[100]; 1 write hit 4 misses; 1 hit 2 misses; 3 hits

Page 62: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Cache  Performance  (1/3)

•  Memory  Stall  Cycles:  the  number  of  cycles  during  which  the  processor  is  stalled  waiBng  for  a  memory  access.    

•  RewriBng  the  CPU  performance  Bme  

 •  The  number  of  memory  stall  cycles  depends  on  both  the  number  of  misses  

and  the  cost  per  miss,  which  is  called  the  miss  penalty:    

timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU ×+=

Penalty Missrate MissInstrution

accessesMemory IC

Penalty MissInstrution

MissesIC

penalty Missmisses ofNumber cycles stallMemory

×××=

××=

×=

†  The advantage of the last form is the component can be easily measured.

Page 63: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Cache  Performance  (2/3)

•  Miss  penalty  depends  on  –  Prior  memory  requests  or  memory  refresh;  –  Different  clocks  of  the  processor,  bus,  and  memory;  –  Thus,  using  miss  penalty  be  a  constant  is  a  simplificaBon.  

•  Miss  rate:  the  fracBon  of  cache  access  that  result  in  a  miss  (i.e.,  number  of  accesses  that  miss  divided  by  number  of  accesses).  

•  Extract  formula  for  R/W  

Penalty Miss Writerate miss Writeninstructioper WritesIC Penalty Miss Readrate miss Readninstructioper ReadsICcycles stallMemory

×××+

×××=

penalty Missrate MissInstrution

accessesMemory ICcycles stallMemory ×××=

†  Simplify the complete formula by combining the R/W.

Page 64: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Example  (C-­‐5) •  Assume  we  have  a  computer  where  the  clocks  per  instrucBon  (CPI)  is  1.0  when  all  memory  

accesses  hit  in  the  cache.  The  only  data  accesses  are  loads  and  stores,  and  these  total  50%  of  the  instrucBons.  If  the  miss  penalty  is  25  clock  cycles  and  the  miss  rate  is  2%,  how  much  faster  would  the  computer  be  if  all  instrucBons  were  cache  hits?  

•  Answer:    

cycleClock 1.0ICcycleClock 0)CPI(IC timecycleClock cycles) stallMemory cyclesclock (CPUtimeexecution CPU

××=×+×=

×+=

1.  Compute  the  performance  for  the  computer  that  always  hits:  

0.75IC250.020.5)(1IC

penalty Missrate MissInstrution

accessesMemory ICcycles stallMemory

×=××+×=

×××=

2.  For  the  computer  with  the  real  cache,  we  compute  memory  stall  cycles:  

CPU execution time = (CPU clock cycles + Memory stall cycles)×Clock cycle time =1.75× IC×Clock cycle

3.  Compute  the  total  performance  

75.1cycleClock IC1.0cycleClock IC1.75

timeexecution CPUtimeexecution CPU cache =

××

××=

4.  Compute  the  performance  raCo  which  is  the  inverse  of  the  execuCon  Cmes:  

Page 65: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Cache  Performance  (3/3)

•  Usually,  measuring  miss  rate  as  misses  per  instrucBon  rather  than  misses  per  memory  reference.  

•  For  example,  in  the  previous  example  into  misses  per  instrucBon:

nInstructioaccessesMemoryrate Miss

countn InstructioaccessesMemory rate Miss

InstrutionMisses ×

×=×

=

†  The latter formula is useful when you know the average number of memory accesses per instruction.

030.05.102.0nInstructio

accessesMemoryrate MissInstrution

Misses=×=

××=

Page 66: Lecture’XX:’Midterm’Review’ - GitHub Pages...Great Ideas in Computer Architectures 1. Design for Moore’s Law 2. Use abstraction to simplify design 3. Make the common case

Example  (C-­‐6)

•  To show equivalency between the two miss rate equations, let’s redo the example above, this time assuming a miss rate per 1000 instructions of 30. What is memory stall time in terms of instruction count?

•  Answer Recomputing the memory stall cycles:

0.75IC

7501000

IC

25301000

IC

penalty Miss1000nInstructio

Misses1000

IC

penalty MissnInstructio

MissesIC

penalty Missmisses ofNumber cycles stallMemory

×=

×=

××=

××

×=

××=

×=