elec 669 low power design techniques lecture 1

ELEC 669Low Power Design

Techniques

Lecture 1Amirali Baniasadi

[email protected]

2

ELEC 669: Low Power Design Techniques

Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. Email: [email protected] Office Tel: 721-8613 Web Page for this class will be at http://www.ece.uvic.ca/~amirali/courses/ELEC669/elec669.html

Will use paper reprints

Lecture notes will be posted on the course web page.

3

Course

Structure

Lectures: 1-2 weeks on processor review 5 weeks on low power techniques 6 weeks: discussion, presentation, meetings

Reading paper posted on the web for each week. Need to bring a 1 page review of the papers.

Presentations: Each student should give to presentations in class.

4

Course Philosophy

Papers to be used as supplement for lectures (If a topic is not covered in the class, or a detail not presented in the class, that means I expect you to read on your own to learn those details)

One Project (50%) Presentation (30%)- Will be announced in advance. Final Exam: take home (20%)

IMPORTANT NOTE: Must get passing grade in all components to pass the course. Failing any of the three components will result in failing the course.

5

Project

More on project later

6

Topics

High Performance Processors? Low-Power Design Low Power Branch Prediction Low-Power Register Renaming Low-Power SRAMs Low-Power Front-End Low-Power Back-End Low-Power Issue Logic Low-Power Commit AND more…

7

A Modern Processor

Fetch CommitCompleteIssueDecode

Front-endBack-end

1-What do each do?2-Possible Power Optimizations?

8

Power Breakdown

Back-end35%

REST37%

Front-end28%

PentiumPro

Rest26%

Back-end68%

Front-end6%

Alpha 21464

9

Instruction Set Architecture (ISA)

Fetch Instruction From Memory

Decode Instruction determine its size & action

Fetch Operand data

Execute instruction & compute results or status

Store Result in memory

Determine Next Instruction’s address

•Instruction Execution Cycle

10

What Should we Know?

A specific ISA (MIPS)

Performance issues - vocabulary and motivation

Instruction-Level Parallelism

How to Use Pipelining to improve performance

Exploiting Instruction-Level Parallelism w/ Dynamic Approach

Memory: caches and virtual memory

11

What is Expected From You?

• Read papers!• Be up-to-date! • Come back with your input & questions for discussion!

12

Power?

Everything is done by tiny switches

Their charge represents logic values Changing charge energy Power energy over time Devices are non-ideal power heat Excess heat Circuits breakdown

Need to keep power within acceptable limitsNeed to keep power within acceptable limits

13

POWER in the real world

1

10

100

1000

W/c

m2

14

Power as a Performance Limiter

Conventional Performance Scaling:

Goal: Max. performance w/ min cost/complexity

How: -More and faster xtors.

-More complex structures.

Power: Don’t fix if it ain’t broken

Not True Anymore: Power has increased rapidly

Power-Aware Architecture a Necessity

Name

Say that Dealing with power was viewed as an additional complexity.Also make sure at the end to make the point that power-aware architecture is one approach. Others, especially at the circuit level are also necessary and probably more important.

15

Power-Aware Architecture

Conventional Architecture:Conventional Architecture:

Goal: Max. performance

How: Do as much as you can.

This WorkThis Work Power-Aware ArchitecturePower-Aware Architecture

Goal: Min. Power and Maintain Performance

How: Do as little as you can, while maintaining performance

Challenging and new area

Name

Say that Dealing with power was viewed as an additional complexity.Also make sure at the end to make the point that power-aware architecture is one approach. Others, especially at the circuit level are also necessary and probably more important.

16

Why is this challenging

Identify actions that can be delayed/eliminated

Don’t touch those that boost performance

Cost/Power of doing so must not out-weight benefits

17

Definitions

Performance is in units of things-per-second bigger is better

If we are primarily concerned with response time performance(x) = 1

execution_time(x)

" X is n times faster than Y" means

Performance(X)

n = ----------------------

Performance(Y)

04/20/23

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E

Speedup(E) = -------------------- = ---------------------

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task

by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E)

Speedup(with E) =1/ ((1-F) + F/S)

04/20/23

Amdahl's Law-example

A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement?

Fraction enhanced= 0.4

Speedup enhanced = 10

Speedup overall = 1 = 1.56

0.6 +0.4/10

04/20/23

Why Do Benchmarks? How we evaluate differences

Different systems Changes to a single system

Provide a target Benchmarks should represent large class of important

programs Improving benchmark performance should help many

programs For better or worse, benchmarks shape a field Good ones accelerate progress

good target for development Bad benchmarks hurt progress

help real programs v. sell machines/papers? Inventions that help real programs don’t help

benchmark

04/20/23

SPEC first round

First round 1989; 10 programs, single number to summarize performance

One program: 99% of time in single line of code New front-end compiler could improve dramatically

Benchmark

SPE

C P

erf

0

100

200

300

400

500

600

700

800

gcc

epre

sso

spic

e

doduc

nasa7

li

eqnto

tt

matr

ix300

fpppp

tom

catv

23

SPEC95

Eighteen application benchmarks (with inputs) reflecting a technical computing workload

Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu,

turb3d, apsi, fppp, wave5 Must run with standard compiler flags

eliminate special undocumented incantations that may not even generate working code for real programs

04/20/23

Summary

Time is the measure of computer performance! Remember Amdahl’s Law: Improvement is limited by unimproved

part of program

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

25

Execution Cycle

Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

Obtain instruction from program storage

Determine required actions and instruction size

Locate and obtain operand data

Compute result value or status

Deposit results in storage for later use

Determine successor instruction

26

What Must be Specified?Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

° Instruction Format or Encoding

– how is it decoded?

° Location of operands and result

– where other than memory?

– how many explicit operands?

– how are memory operands located?

– which can or cannot be in memory?

° Data type and Size

° Operations

– what are supported

° Successor instruction

– jumps, conditions, branches

27

What Is an ILP?

Principle: Many instructions in the code do not depend on each other

Result: Possible to execute them in parallel ILP: Potential overlap among instructions (so they can be

evaluated in parallel)

Issues: Building compilers to analyze the code Building special/smarter hardware to handle the code

ILP: Increase the amount of parallelism exploited among instructions

Seeks Good Results out of Pipelining

28

What Is ILP?

CODE A: CODE B:

LD R1, (R2)100 LD R1,(R2)100 ADD R4, R1 ADD R4,R1 SUB R5,R1 SUB R5,R4 CMP R1,R2 SW R5,(R2)100 ADD R3,R1 LD R1,(R2)100

Code A: Possible to execute 4 instructions in parallel. Code B: Can’t execute more than one instruction per cycle.

Code A has Higher ILP

29

Out of Order Execution

Programmer: Instructions execute in-order

Processor: Instructions may execute in any orderifif results remain the same at the endat the end

A B

D

CA: LD R1, (R2) B: ADD R3, R4C: ADD R3, R5D: CMP R3, R1

In-Order

B: ADD R3, R4C: ADD R3, R5A: LD R1, (R2)D: CMP R3, R1

Out-of-Order

Name

here you talk about ordering onlyGOAL: I can execute instructions any order I like so long as I produce the right result at the end

30

Assumptions

Five-stage integer pipeline Branches have delay of one clock cycle

ID stage: Comparisons done, decisions made and PC loaded No structural hazards

Functional units are fully pipelined or replicated (as many times as the pipeline depth)

FP Latencies

0Store doubleLoad double

1FP ALU opLoad double

2Store doubleFP ALU op

3Another FP ALU opFP ALU op

Latency (clock cycles)Dependant instructionSource instruction

Integer load latency: 1; Integer ALU operation latency: 0

31

Simple Loop & Assembler Equivalent

for (i=1000; i>0; i--) x[i] = x[i] + s;

Loop: LD F0, 0(R1) ;F0=array element ADDD F4, F0, F2 ;add scalar in F2 SD F4 , 0(R1) ;store result SUBI R1, R1, #8 ;decrement pointer

8bytes (DW) BNE R1, R2, Loop ;branch R1!=R2

• x[i] & s are double/floating point type• R1 initially address of array element with the highest

address• F2 contains the scalar value s• Register R2 is pre-computed so that 8(R2) is the last

element to operate on

32

Where are the stalls?UnscheduledLoop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) SUBI R1, R1, #8 stall BNE R1, R2, Loop stall

10 clock cyclesCan we minimize?

Scheduled Loop: LD F0, 0(R1) SUBI R1, R1, #8 ADDD F4, F0, F2 stall BNE R1, R2, Loop SD F4, 8(R1)

6 clock cycles 3 cycles: actual work; 3 cycles:

overhead Can we minimize further?

0Store doubleLoad double

1FP ALU opLoad double

2Store doubleFP ALU op

3Another FP ALU opFP ALU op

Latency (clock cycles)Dependant instructionSource instruction

Schedule

33

LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8

BNE R1, R2, Loop


BNE R1, R2, Loop


BNE R1, R2, Loop


BNE R1, R2, Loop

Loop Unrolling

Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD F4, 0(R1) LD F6, -8(R1) ADDD F8, F6, F2 SD F8, -8(R1) LD F10, -16(R1) ADDD F12, F10, F2 SD F12, -16(R1) LD F14, -24(R1) ADDD F16, F14, F2 SD F16, -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop

Four copies of loop

LD F0, 0(R1)ADDD F4, F0, F2SD F4 , 0(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -8(R1)ADDD F4, F0, F2SD F4 , -8(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -16(R1)ADDD F4, F0, F2SD F4 , -16(R1)SUBI R1, R1, #8 BNE R1, R2, LoopLD F0, -24(R1)ADDD F4, F0, F2SD F4 , -24(R1)SUBI R1, R1, #32BNE R1, R2, Loop

Eliminate Incr, Branch Four iteration code

Assumption: R1 is initially a multiple of 32 or number of loop iterations is a multiple of 4

34

Loop Unroll & Schedule

Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall

28 clock cycles or 7 per iterationCan we minimize further?

Loop:LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD F4, 0(R1)SD F8, -8(R1)SD F12, -16(R1)

SUBI R1, R1, #32BNE R1, R2, LoopSD F16, 8(R1)

No stalls!14 clock cycles or 3.5 per iterationCan we minimize further?

Schedule

35

Summary

Iteration10 cycles

6 cycles

7 cycles

3.5 cycles(No stalls)

Scheduling

Unrolling

Scheduling

36

Multiple Issue

• Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.

• Superscalar processors

• Very Long Instruction Word (VLIW) processors

37

A Modern Processor

Fetch CommitCompleteIssueDecode

Front-endBack-end

Multiple Issue

38

1990’s: Superscalar Processors

Bottleneck: CPI >= 1 Limit on scalar performance (single instruction issue)

Hazards Superpipelining? Diminishing returns (hazards + overhead)

How can we make the CPI = 0.5? Multiple instructions in every pipeline stage (super-scalar)

1 2 3 4 5 6 7 Inst0 IF ID EX MEM WB Inst1 IF ID EX MEM WB Inst2 IF ID EX MEM WB Inst3 IF ID EX MEM WB Inst4 IF ID EX MEM WB Inst5 IF ID EX MEM WB

39

Elements of Advanced Superscalars

High performance instruction fetching Good dynamic branch and jump prediction Multiple instructions per cycle, multiple branches per cycle?

Scheduling and hazard elimination Dynamic scheduling Not necessarily: Alpha 21064 & Pentium were statically scheduled Register renaming to eliminate WAR and WAW

Parallel functional units, paths/buses/multiple register ports High performance memory systems Speculative execution

40

SS + DS + Speculation

Superscalar + Dynamic scheduling + SpeculationThree great tastes that taste great together CPI >= 1?

Overcome with superscalar Superscalar increases hazards

Overcome with dynamic scheduling RAW dependences still a problem?

Overcome with a large window Branches a problem for filling large window? Overcome with speculation

41

The Big Picture

&Static program Fetch & branch

predict execution

issue

Reorder & commit

42

Superscalar Microarchitecture

Integer register file

Floating point register file

Decode rename dispatch

Floating point inst. buffer

Integer address inst buffer

Functional units

Functional units and data cache

Memory interface

Reorder and commit

Inst.buffer

Pre-decode Inst.

Cache

43

Register renaming methods

First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of

log. Reg use a free list of physical registers Physical register file bigger than log register file

Second Method: physical register file same size as logical Also, use a buffer w/ one entry per inst. Reorder buffer.

44

Register Renaming Example

Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall

28 clock cycles or 7 per iterationCan we minimize further?

Loop:LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD F4, 0(R1)SD F8, -8(R1)SD F12, -16(R1)

SUBI R1, R1, #32BNE R1, R2, LoopSD F16, 8(R1)

No stalls!14 clock cycles or 3.5 per iterationCan we minimize further?

Schedule

45

Register renaming: first method

R2 R6 R13

R8

R7

R5

R9

R1

r0

r1

r2

r3

r4

R6 R13

R8

R7

R5

R9

R2

r0

r1

r2

r3

r4

Add r3,r3,4

Mapping table

Free List

Mapping table

Free List

46

Superscalar Processors

• Issues varying number of instructions per clock

• Scheduling: Static (by the compiler) or dynamic(by the hardware)

• Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo).

• IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

47

Program

Instr

ucti

on

issu

es p

er

cy

cle

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

11

15

12

29

54

10

15

12

49

16

10

1312

35

15

44

9 10 11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None

More Realistic HW: Register Impact

Effect of limiting the number of renaming registers

Integer: 5 - 15

FP: 11 - 45

IPC

48

Reorder Buffer

Place data in entry when execution finished

Reserve entry at tail when dispatched

Remove from head when complete

Bypass to other instructions when needed

49

…..…..

register renaming:reorder buffer

r3

R8

R7

R5

R9

rob6

r0

r1

r2

r3

r4

R3 0 R3 ….

R8

R7

R5

R9

rob8

r0

r1

r2

r3

r4

Before add r3,r3,4Add r3, rob6, 4add rob8,rob6,4

Reorder buffer

Reorder buffer

7 6 0 8 7 6 0

50

Instruction Buffers

Integer register file

Floating point register file

Decode rename dispatch

Floating point inst. buffer

Integer address inst buffer

Functional units

Functional units and data cache

Memory interface

Reorder and commit

Inst.buffer

Pre-decode Inst.

Cache

51

Issue Buffer Organization

a) Single, shared queue b)Multiple queue; one per inst. type

No out-of-orderNo Renaming

No out-of-order inside queuesQueues issue out of order

52

Issue Buffer Organization

c) Multiple reservation stations; (one per instruction type or big pool)

NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo

From Instruction Dispatch

53

Typical reservation station

Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination

54

Memory Hazard Detection Logic

Address add & translation

Address compare

Load address buffer

Store address buffer

loads

stores

Hazard Control

To memoryInstruction issue

elec 669 low power design techniques lecture 1

Documents