elec 669 low power design techniques lecture 1
DESCRIPTION
ELEC 669 Low Power Design Techniques Lecture 1. Amirali Baniasadi [email protected]. ELEC 669: Low Power Design Techniques. Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/2.jpg)
2
ELEC 669: Low Power Design Techniques
Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. Email: [email protected] Office Tel: 721-8613 Web Page for this class will be at http://www.ece.uvic.ca/~amirali/courses/ELEC669/elec669.html
Will use paper reprints
Lecture notes will be posted on the course web page.
![Page 3: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/3.jpg)
3
Course
Structure
Lectures: 1-2 weeks on processor review 5 weeks on low power techniques 6 weeks: discussion, presentation, meetings
Reading paper posted on the web for each week. Need to bring a 1 page review of the papers.
Presentations: Each student should give to presentations in class.
![Page 4: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/4.jpg)
4
Course Philosophy
Papers to be used as supplement for lectures (If a topic is not covered in the class, or a detail not presented in the class, that means I expect you to read on your own to learn those details)
One Project (50%) Presentation (30%)- Will be announced in advance. Final Exam: take home (20%)
IMPORTANT NOTE: Must get passing grade in all components to pass the course. Failing any of the three components will result in failing the course.
![Page 5: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/5.jpg)
5
Project
More on project later
![Page 6: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/6.jpg)
6
Topics
High Performance Processors? Low-Power Design Low Power Branch Prediction Low-Power Register Renaming Low-Power SRAMs Low-Power Front-End Low-Power Back-End Low-Power Issue Logic Low-Power Commit AND more…
![Page 7: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/7.jpg)
7
A Modern Processor
Fetch CommitCompleteIssueDecode
Front-endBack-end
1-What do each do?2-Possible Power Optimizations?
![Page 8: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/8.jpg)
8
Power Breakdown
Back-end35%
REST37%
Front-end28%
PentiumPro
Rest26%
Back-end68%
Front-end6%
Alpha 21464
![Page 9: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/9.jpg)
9
Instruction Set Architecture (ISA)
Fetch Instruction From Memory
Decode Instruction determine its size & action
Fetch Operand data
Execute instruction & compute results or status
Store Result in memory
Determine Next Instruction’s address
•Instruction Execution Cycle
![Page 10: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/10.jpg)
10
What Should we Know?
A specific ISA (MIPS)
Performance issues - vocabulary and motivation
Instruction-Level Parallelism
How to Use Pipelining to improve performance
Exploiting Instruction-Level Parallelism w/ Dynamic Approach
Memory: caches and virtual memory
![Page 11: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/11.jpg)
11
What is Expected From You?
• Read papers!• Be up-to-date! • Come back with your input & questions for discussion!
![Page 12: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/12.jpg)
12
Power?
Everything is done by tiny switches
Their charge represents logic values Changing charge energy Power energy over time Devices are non-ideal power heat Excess heat Circuits breakdown
Need to keep power within acceptable limitsNeed to keep power within acceptable limits
![Page 13: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/13.jpg)
13
POWER in the real world
1
10
100
1000
W/c
m2
![Page 14: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/14.jpg)
14
Power as a Performance Limiter
Conventional Performance Scaling:
Goal: Max. performance w/ min cost/complexity
How: -More and faster xtors.
-More complex structures.
Power: Don’t fix if it ain’t broken
Not True Anymore: Power has increased rapidly
Power-Aware Architecture a Necessity
![Page 15: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/15.jpg)
15
Power-Aware Architecture
Conventional Architecture:Conventional Architecture:
Goal: Max. performance
How: Do as much as you can.
This WorkThis Work Power-Aware ArchitecturePower-Aware Architecture
Goal: Min. Power and Maintain Performance
How: Do as little as you can, while maintaining performance
Challenging and new area
![Page 16: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/16.jpg)
16
Why is this challenging
Identify actions that can be delayed/eliminated
Don’t touch those that boost performance
Cost/Power of doing so must not out-weight benefits
![Page 17: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/17.jpg)
17
Definitions
Performance is in units of things-per-second bigger is better
If we are primarily concerned with response time performance(x) = 1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n = ----------------------
Performance(Y)
![Page 18: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/18.jpg)
04/20/23
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = -------------------- = ---------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task
by a factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E)
Speedup(with E) =1/ ((1-F) + F/S)
![Page 19: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/19.jpg)
04/20/23
Amdahl's Law-example
A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement?
Fraction enhanced= 0.4
Speedup enhanced = 10
Speedup overall = 1 = 1.56
0.6 +0.4/10
![Page 20: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/20.jpg)
04/20/23
Why Do Benchmarks? How we evaluate differences
Different systems Changes to a single system
Provide a target Benchmarks should represent large class of important
programs Improving benchmark performance should help many
programs For better or worse, benchmarks shape a field Good ones accelerate progress
good target for development Bad benchmarks hurt progress
help real programs v. sell machines/papers? Inventions that help real programs don’t help
benchmark
![Page 21: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/21.jpg)
04/20/23
SPEC first round
First round 1989; 10 programs, single number to summarize performance
One program: 99% of time in single line of code New front-end compiler could improve dramatically
Benchmark
SPE
C P
erf
0
100
200
300
400
500
600
700
800
gcc
epre
sso
spic
e
doduc
nasa7
li
eqnto
tt
matr
ix300
fpppp
tom
catv
![Page 22: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/22.jpg)
23
SPEC95
Eighteen application benchmarks (with inputs) reflecting a technical computing workload
Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu,
turb3d, apsi, fppp, wave5 Must run with standard compiler flags
eliminate special undocumented incantations that may not even generate working code for real programs
![Page 23: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/23.jpg)
04/20/23
Summary
Time is the measure of computer performance! Remember Amdahl’s Law: Improvement is limited by unimproved
part of program
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
![Page 24: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/24.jpg)
25
Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
![Page 25: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/25.jpg)
26
What Must be Specified?Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
° Instruction Format or Encoding
– how is it decoded?
° Location of operands and result
– where other than memory?
– how many explicit operands?
– how are memory operands located?
– which can or cannot be in memory?
° Data type and Size
° Operations
– what are supported
° Successor instruction
– jumps, conditions, branches
![Page 26: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/26.jpg)
27
What Is an ILP?
Principle: Many instructions in the code do not depend on each other
Result: Possible to execute them in parallel ILP: Potential overlap among instructions (so they can be
evaluated in parallel)
Issues: Building compilers to analyze the code Building special/smarter hardware to handle the code
ILP: Increase the amount of parallelism exploited among instructions
Seeks Good Results out of Pipelining
![Page 27: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/27.jpg)
28
What Is ILP?
CODE A: CODE B:
LD R1, (R2)100 LD R1,(R2)100 ADD R4, R1 ADD R4,R1 SUB R5,R1 SUB R5,R4 CMP R1,R2 SW R5,(R2)100 ADD R3,R1 LD R1,(R2)100
Code A: Possible to execute 4 instructions in parallel. Code B: Can’t execute more than one instruction per cycle.
Code A has Higher ILP
![Page 28: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/28.jpg)
29
Out of Order Execution
Programmer: Instructions execute in-order
Processor: Instructions may execute in any orderifif results remain the same at the endat the end
A B
D
CA: LD R1, (R2) B: ADD R3, R4C: ADD R3, R5D: CMP R3, R1
In-Order
B: ADD R3, R4C: ADD R3, R5A: LD R1, (R2)D: CMP R3, R1
Out-of-Order
![Page 29: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/29.jpg)
30
Assumptions
Five-stage integer pipeline Branches have delay of one clock cycle
ID stage: Comparisons done, decisions made and PC loaded No structural hazards
Functional units are fully pipelined or replicated (as many times as the pipeline depth)
FP Latencies
0Store doubleLoad double
1FP ALU opLoad double
2Store doubleFP ALU op
3Another FP ALU opFP ALU op
Latency (clock cycles)Dependant instructionSource instruction
Integer load latency: 1; Integer ALU operation latency: 0
![Page 30: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/30.jpg)
31
Simple Loop & Assembler Equivalent
for (i=1000; i>0; i--) x[i] = x[i] + s;
Loop: LD F0, 0(R1) ;F0=array element ADDD F4, F0, F2 ;add scalar in F2 SD F4 , 0(R1) ;store result SUBI R1, R1, #8 ;decrement pointer
8bytes (DW) BNE R1, R2, Loop ;branch R1!=R2
• x[i] & s are double/floating point type• R1 initially address of array element with the highest
address• F2 contains the scalar value s• Register R2 is pre-computed so that 8(R2) is the last
element to operate on
![Page 31: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/31.jpg)
32
Where are the stalls?UnscheduledLoop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) SUBI R1, R1, #8 stall BNE R1, R2, Loop stall
10 clock cyclesCan we minimize?
Scheduled Loop: LD F0, 0(R1) SUBI R1, R1, #8 ADDD F4, F0, F2 stall BNE R1, R2, Loop SD F4, 8(R1)
6 clock cycles 3 cycles: actual work; 3 cycles:
overhead Can we minimize further?
0Store doubleLoad double
1FP ALU opLoad double
2Store doubleFP ALU op
3Another FP ALU opFP ALU op
Latency (clock cycles)Dependant instructionSource instruction
Schedule
![Page 32: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/32.jpg)
33
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8
BNE R1, R2, Loop
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8
BNE R1, R2, Loop
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8
BNE R1, R2, Loop
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8
BNE R1, R2, Loop
Loop Unrolling
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD F4, 0(R1) LD F6, -8(R1) ADDD F8, F6, F2 SD F8, -8(R1) LD F10, -16(R1) ADDD F12, F10, F2 SD F12, -16(R1) LD F14, -24(R1) ADDD F16, F14, F2 SD F16, -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop
Four copies of loop
LD F0, 0(R1)ADDD F4, F0, F2SD F4 , 0(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -8(R1)ADDD F4, F0, F2SD F4 , -8(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -16(R1)ADDD F4, F0, F2SD F4 , -16(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -24(R1)ADDD F4, F0, F2SD F4 , -24(R1)SUBI R1, R1, #32BNE R1, R2, Loop
Eliminate Incr, Branch Four iteration code
Assumption: R1 is initially a multiple of 32 or number of loop iterations is a multiple of 4
![Page 33: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/33.jpg)
34
Loop Unroll & Schedule
Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall
28 clock cycles or 7 per iterationCan we minimize further?
Loop:LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD F4, 0(R1)SD F8, -8(R1)SD F12, -16(R1)
SUBI R1, R1, #32BNE R1, R2, LoopSD F16, 8(R1)
No stalls!14 clock cycles or 3.5 per iterationCan we minimize further?
Schedule
![Page 34: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/34.jpg)
35
Summary
Iteration10 cycles
6 cycles
7 cycles
3.5 cycles(No stalls)
Scheduling
Unrolling
Scheduling
![Page 35: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/35.jpg)
36
Multiple Issue
• Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.
• Superscalar processors
• Very Long Instruction Word (VLIW) processors
![Page 36: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/36.jpg)
37
A Modern Processor
Fetch CommitCompleteIssueDecode
Front-endBack-end
Multiple Issue
![Page 37: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/37.jpg)
38
1990’s: Superscalar Processors
Bottleneck: CPI >= 1 Limit on scalar performance (single instruction issue)
Hazards Superpipelining? Diminishing returns (hazards + overhead)
How can we make the CPI = 0.5? Multiple instructions in every pipeline stage (super-scalar)
1 2 3 4 5 6 7 Inst0 IF ID EX MEM WB Inst1 IF ID EX MEM WB Inst2 IF ID EX MEM WB Inst3 IF ID EX MEM WB Inst4 IF ID EX MEM WB Inst5 IF ID EX MEM WB
![Page 38: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/38.jpg)
39
Elements of Advanced Superscalars
High performance instruction fetching Good dynamic branch and jump prediction Multiple instructions per cycle, multiple branches per cycle?
Scheduling and hazard elimination Dynamic scheduling Not necessarily: Alpha 21064 & Pentium were statically scheduled Register renaming to eliminate WAR and WAW
Parallel functional units, paths/buses/multiple register ports High performance memory systems Speculative execution
![Page 39: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/39.jpg)
40
SS + DS + Speculation
Superscalar + Dynamic scheduling + SpeculationThree great tastes that taste great together CPI >= 1?
Overcome with superscalar Superscalar increases hazards
Overcome with dynamic scheduling RAW dependences still a problem?
Overcome with a large window Branches a problem for filling large window? Overcome with speculation
![Page 40: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/40.jpg)
41
The Big Picture
&Static program Fetch & branch
predict execution
issue
Reorder & commit
![Page 41: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/41.jpg)
42
Superscalar Microarchitecture
Integer register file
Floating point register file
Decode rename dispatch
Floating point inst. buffer
Integer address inst buffer
Functional units
Functional units and data cache
Memory interface
Reorder and commit
Inst.buffer
Pre-decode Inst.
Cache
![Page 42: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/42.jpg)
43
Register renaming methods
First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of
log. Reg use a free list of physical registers Physical register file bigger than log register file
Second Method: physical register file same size as logical Also, use a buffer w/ one entry per inst. Reorder buffer.
![Page 43: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/43.jpg)
44
Register Renaming Example
Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall
28 clock cycles or 7 per iterationCan we minimize further?
Loop:LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD F4, 0(R1)SD F8, -8(R1)SD F12, -16(R1)
SUBI R1, R1, #32BNE R1, R2, LoopSD F16, 8(R1)
No stalls!14 clock cycles or 3.5 per iterationCan we minimize further?
Schedule
![Page 44: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/44.jpg)
45
Register renaming: first method
R2 R6 R13
R8
R7
R5
R9
R1
r0
r1
r2
r3
r4
R6 R13
R8
R7
R5
R9
R2
r0
r1
r2
r3
r4
Add r3,r3,4
Mapping table
Free List
Mapping table
Free List
![Page 45: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/45.jpg)
46
Superscalar Processors
• Issues varying number of instructions per clock
• Scheduling: Static (by the compiler) or dynamic(by the hardware)
• Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo).
• IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
![Page 46: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/46.jpg)
47
Program
Instr
ucti
on
issu
es p
er
cy
cle
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatv
11
15
12
29
54
10
15
12
49
16
10
1312
35
15
44
9 10 11
20
11
28
5 5 6 5 57
4 45
45 5
59
45
Infinite 256 128 64 32 None
More Realistic HW: Register Impact
Effect of limiting the number of renaming registers
Integer: 5 - 15
FP: 11 - 45
IPC
![Page 47: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/47.jpg)
48
Reorder Buffer
Place data in entry when execution finished
Reserve entry at tail when dispatched
Remove from head when complete
Bypass to other instructions when needed
![Page 48: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/48.jpg)
49
…..…..
register renaming:reorder buffer
r3
R8
R7
R5
R9
rob6
r0
r1
r2
r3
r4
R3 0 R3 ….
R8
R7
R5
R9
rob8
r0
r1
r2
r3
r4
Before add r3,r3,4Add r3, rob6, 4add rob8,rob6,4
Reorder buffer
Reorder buffer
7 6 0 8 7 6 0
![Page 49: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/49.jpg)
50
Instruction Buffers
Integer register file
Floating point register file
Decode rename dispatch
Floating point inst. buffer
Integer address inst buffer
Functional units
Functional units and data cache
Memory interface
Reorder and commit
Inst.buffer
Pre-decode Inst.
Cache
![Page 50: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/50.jpg)
51
Issue Buffer Organization
a) Single, shared queue b)Multiple queue; one per inst. type
No out-of-orderNo Renaming
No out-of-order inside queuesQueues issue out of order
![Page 51: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/51.jpg)
52
Issue Buffer Organization
c) Multiple reservation stations; (one per instruction type or big pool)
NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo
From Instruction Dispatch
![Page 52: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/52.jpg)
53
Typical reservation station
Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination
![Page 53: ELEC 669 Low Power Design Techniques Lecture 1](https://reader033.vdocuments.mx/reader033/viewer/2022052913/56813ac0550346895da2cd19/html5/thumbnails/53.jpg)
54
Memory Hazard Detection Logic
Address add & translation
Address compare
Load address buffer
Store address buffer
loads
stores
Hazard Control
To memoryInstruction issue