cs152 – computer architecture and engineering lecture 9...

26
CS 152 L09 Performance () UC Regents Fall 2004 © UCB 2004-09-28 Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs152/ CS152 – Computer Architecture and Engineering Lecture 9 – Performance 1

Upload: others

Post on 12-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

2004-09-28 Dave Patterson

(www.cs.berkeley.edu/~patterson)

John Lazzaro (www.cs.berkeley.edu/~lazzaro)

www-inst.eecs.berkeley.edu/~cs152/

CS152 – Computer Architecture andEngineering

Lecture 9 – Performance

1

Page 2: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Last Time: Microcode, Multi-Cycle

5

CS 152 L09 Multicycle (25) Fall 2004 © UC Regents

Controller Design• The state diagrams that arise define the controller for an

instruction set processor are highly structured• Use this structure to construct a simple “microsequencer” • Control reduces to programming this very simple device

microprogramming

op-code

Map ROM

Micro-PC

Z I Ldatapath control

taken

CS 152 L09 Multicycle (26) Fall 2004 © UC Regents

Our Microsequencer

op-code

Map ROM

Micro-PC

Z I Ldatapath control

taken

CS 152 L09 Multicycle (27) Fall 2004 © UC Regents

Adding the Dispatch ROM

•Sequencer-based control– Called “microPC” or “µPC” vs. state register

Control Value Effect00 Next µaddress = 001 Next µaddress = dispatch ROM 10 Next µaddress = µaddress + 1

ROM:

Opcode

microPC

1

µAddressSelectLogic

Adder

ROM

Mux

0012

R-type 000000 0100BEQ 000100 0011ori 001101 0110LW 100011 1000SW 101011 1011

CS 152 L09 Multicycle (28) Fall 2004 © UC Regents

sequencercontrol

micro-PCµ-sequencer:fetch,dispatch,sequential

DispatchROM

Opcode

Inputs

Microprogramming

µ-Code ROM

To DataPath

DecodeDecode

datapath control

microinstruction (µ)

CS 152 L09 Multicycle (29) Fall 2004 © UC Regents

Microprogramming• Microprogramming is a convenient method for

implementing structured control state diagrams:– Random logic replaced by microPC sequencer and ROM– Each line of ROM called a microinstruction:

contains sequencer control + values for control points– To reduce confusion, normal instruction

(e.g., MIPS addu) called “macroinstruction”– limited state transitions:

branch to zero, next sequential,branch to µinstruction address from dispatch ROM

• Control design reduces to Microprogramming– Part of the design process is to develop a “language” that

describes control and is easy for humans to understand

CS 152 L09 Multicycle (30) Fall 2004 © UC Regents

“Macroinstruction” Interpretation

MainMemory

executionunit

controlmemory

CPU

ADDSUBAND

DATA

.

.

.

User program plus Data

this can change!

AND microsequence

e.g., FetchCalc Operand AddrFetch Operand(s)CalculateSave Answer(s)

one of these ismapped into oneof these

2

Page 3: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Today’s Lecture - Performance

Measurement: what, why, how

The performance equation

How energy limits performance

Amdahl’s law

3

Page 4: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

UC Regents Fall 2004 © UCBCS 152 L09 Performance ()

Performance Measurement(as seen by the customer)

4

Page 5: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Who (sensibly) upgrades CPUs often?A professional who turns CPU cycles into money, and who is cycle-limited.

Artist tool: animation,

video special effects.

5

Page 6: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

How to decide to buy a new machine?

Measure After Effects “execution time” on a representative render “workload”

“Night flight”

City map and cloudscomputed

“on the fly” with fractals

CPU intensive Trivial I/O

6

Page 7: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Interpreting Execution Time

Performance 1Execution Time

= = 2.85 renders/hour

1.5 GHz PB (Y) is N times faster than 1.25 GHz PB (X). N is ?

N =Performance (Y)

Execution Time (Y)

Execution Time (X)Performance (X)

= = 1. 19

PB 1.5 Ghz : 3. 4 renders/hour. PB 1.25 : 2.85 renders/hour.Does artist productivity really increase?

Execution Time: 1265 seconds

PowerBookG41.25 GHz

7

Page 8: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Execution Time: Time for 1 job to complete

2 CPUs: Execution Time vs Throughput

Throughput: # jobs/hour completed (not serialized)

Could G5 and Opteron have similar Throughput? Why?

Assume G5 MP executiontime faster because AE doesnot use both Opteron CPUs.

1.8xfaster.What does this imply?

2 CPUs vs1 CPU,otherwisesimilar

8

Page 9: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

UC Regents Fall 2004 © UCBCS 152 L09 Performance ()

Performance Measurement(as seen by a CPU designer)

Q. Why do we care about After Effect’s performance?

A. We want the CPU we are designing to run it well !

9

Page 10: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Step 1: Analyze the right measurement!CPU Time:

Time the CPU spends running program under measurement.

Response Time:

Total time: CPU Time + time spent waiting (for disk, I/O, ...).

Guides CPU design

Guides system design

How to measure CPU time?% time <program name>25.77u 0.72s 0:29.17 90.8%

How do designersuse these two numbers?

10

Page 11: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Administrivia - Adjust Class Time ?We have permission to stay in thisroom past 12:30.

A: Lecture from 11:10 to 12:30B: Lecture from 11:15 to 12:35C: Lecture from 11:20 to 12:40

Does anyone have a class that starts 12:40?

Class time options (all “sharp” time)

11

Page 12: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Administrivia - Mid-Term is Coming!

Mid-term review session: Sunday 10/10, 7-9 PM, 306 Soda.

Mid-term: Tuesday 10/12, 5:30-8:30 PM, 101 Morgan. No class on Tuesday.

After exam: Pizza at LaVal’s, on us!

12

Page 13: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Administrivia - This Week’s Deadlines

Homework 2 due 9/29 (tomorrow)!283 Soda, in CS 152 box at 5 PM

Lab 2 Xilinx demo on Friday 10/1Lab 2 due Monday 10/4, 11:59 PMOn Tuesday 10/5, onto the Pipelining Lab!

13

Page 14: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

CPU time: Proportional to Instruction Count

CPU timeProgram

Machine InstructionsProgram∝

Q. Static count?(lines of program printout)Or dynamic count? (trace of execution)

Rationale: Every additional

instruction you execute takes time.

Q. What type of computer architect influences the number of instructions a given program needs?A. Instruction set architect.

A. Dynamic.

Q. Once ISA is set, who can influence instructioncount?A. Compiler writer,application developer.

14

Page 15: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

CPU time: Proportional to Clock PeriodQ. What ultimately limitsan architect’s ability to reduce clock period ?

TimeProgram

TimeOne Clock Period∝

A. Clock-to-Q, setup times.

Q. How can architects (not technologists) reduce clock period?A. Shorten the machine critical path.

Rationale: We measure each instruction’s

execution time in “number of cycles”. By shortening the period for

each cycle, we shorten execution time.15

Page 16: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Completing the performance equation

SecondsProgram

InstructionsProgram

= SecondsCycle

We need all three terms, and only these terms, to

compute CPU Time!When is it OK to compare clock rates?

What factors make the CPI for a program differfrom the underlying CPIof a CPU implementation?

Instruction mix varies

Cache behavior varies.

Branch prediction varies.

“CPI” -- The Average Number of Clock

Cycles Per Instruction For the Program

InstructionCycles

16

Page 17: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

CPI as an analytical tool to guide design

Multiply

Other A

LU Load

Store

Branch

2221

5

Machine CPI

5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x 20

100= 2.7 cycles/instruction

20%Branch

10%Store

20%Load

20%Other ALU

30%Multiply

ProgramInstruction Mix

15%Branch

7%

15%Load

7%

56%Multiply

Where program spends its time

Q. We lower machine multiply CPI, but program runs slower!What mistake(s) did we make?

17

Page 18: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Amdahl’s Law (of Diminishing Returns)

If enhancement “E” speeds up multiply, but other instructions

are unchanged, what is the maximum speedup S?

17%Branch

8%

17%Load

8%

50%Multiply

Where programspends its time

Smax =1

1 - (% affected / 100 %)1

1 - (50/100)= 2=

Attributed to Gene Amdahl -- “Amdahl’s Law”

What is the lesson of Amdahl’s Law?

Must enhance computers in a balanced way!

18

Page 19: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Peer Instruction: Amdahl’s Law

ProgramWeWishTo RunOn N CPUs

30%Serial

70%Parallel

The program spends 30%of its time running code that can not be recoded to run in parallel.

CPUs 2 3 4 5 ∞

Speedup

Compute speedup for N = 2, 3, 4, 5, and ∞.

19

Page 20: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Peer Instruction: Amdahl’s Law

ProgramWeWishTo RunOn N CPUs

S =1

1 - (30 % + (70% / N) ) / 100 %)

30%Serial

70%Parallel

The program spends 30%of its time in serial code.

CPUs 2 3 4 5 ∞

Speedup 1.54 1.85 2.1 2.3 3.3

Compute speedup for N = 2, 3, 4, 5, and ∞.

S(∞)

2 3 # CPUs

20

Page 21: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Final thoughts: Performance Equation

SecondsProgram

InstructionsProgram= Seconds

Cycle InstructionCycles

Goal is to optimize execution time, notindividualequation

terms.

The CPI of the

program.Reflects

the program’s instruction

mix.

Machinesare

optimizedwith

respect toprogram

workloads.

Clockperiod.

Optimizejointlywith

machineCPI.

21

Page 22: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

UC Regents Fall 2004 © UCBCS 152 L09 Performance ()

Energy and Performance

1 Joule = 0.24 calories. 1 calorie raises 1 gram of water 1℃

1 Joule of energy is dissipated by a 1 Amp current flowing through a 1 Ohm resistor for 1 second.

Sad fact: computers turn electrical energy into heat. Computation is a byproduct.

Air or water carries heat away, or chip melts.

1 Watt: 1 Amp flowing through 1 Ohm.

Also, 1 Watt for 1 second.

22

Page 23: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

IBM Power 4: How does die heat up?

4 dies on amulti-chip

module

2 CPUsper die

23

Page 24: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

IBM Power 4: Dissipating 115 Watts

Fixedpointunits

Hot spots

Cachelogic

24

Page 25: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Switching energy: Fundamental Physics

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

Vdd

C

1

2C VddE1->0=

21

2C VddE0->1=

2

Vdd

Every logic transition dissipates energy.

Strong result: Independent of technology.State-of-the-art CPUs (90 nm):

Switching energy is 70% of total energy. Remainder: at 90nm, “switches” are “dimmers”!

“leakage” currents

How can we limit switching energy?

65nm: 50/50! 25

Page 26: CS152 – Computer Architecture and Engineering Lecture 9 ...Cs152/Fa04/Lecnotes/Lec5-1.pdfController Design ¥ The state diagrams that arise define the controller for an instruction

CS 152 L09 Performance () UC Regents Fall 2004 © UCB

Conclusions

Customers: measure to buyArchitects: measure for design

Tools: Performance Equation, CPI

Energy:

Amdahl’s Law’s lesson: Balance

1

2C VddE0->1=

2 1

2C VddE1->0=

2

26