lecture 02: technology trends and quan6ta6ve design and ... · principles of computer design •...

44
Lecture 02: Technology Trends and Quan6ta6ve Design and Analysis for Performance CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1

Upload: others

Post on 02-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Lecture02:TechnologyTrendsandQuan6ta6veDesignandAnalysisfor

Performance

CSE564ComputerArchitectureSummer2017

DepartmentofComputerScienceandEngineering

YonghongYan

[email protected]

www.secs.oakland.edu/~yan

1

Contents

•  Computersandcomputercomponents

•  Computerarchitecturesandgreatideasinhistoryandnow

•  Trends,CostandPerformance

2

UnderstandingPerformance

•  Algorithm

–  DeterminesnumberofoperaConsexecuted

•  Programminglanguage,compiler,architecture

–  DeterminenumberofmachineinstrucConsexecutedper

operaCon

•  Processorandmemorysystem

–  DeterminehowfastinstrucConsareexecuted

•  I/Osystem(includingOS)

–  DetermineshowfastI/OoperaConsareexecuted

3

BelowYourProgram

•  ApplicaConsoLware–  WriNeninhigh-levellanguage

•  SystemsoLware

–  Compiler:translatesHLLcodetomachinecode

–  OperaCngSystem:servicecode

•  Handlinginput/output•  Managingmemoryandstorage

•  Schedulingtasks&sharingresources

•  Hardware–  Processor,memory,I/Ocontrollers

4

LevelsofProgramCode

•  High-levellanguage–  LevelofabstracConcloserto

problemdomain

–  ProvidesforproducCvityandportability

•  Assemblylanguage

–  TextualrepresentaConofinstrucCons

•  HardwarerepresentaCon–  Binarydigits(bits)–  EncodedinstrucConsanddata

5

TrendsinTechnology

•  Integratedcircuittechnology–  Transistordensity:35%/year–  Diesize:10-20%/year–  IntegraConoverall:40-55%/year

•  DRAMcapacity:25-40%/year(slowing)

•  Flashcapacity:50-60%/year–  15-20Xcheaper/bitthanDRAM

•  MagneCcdisktechnology:40%/year

–  15-25Xcheaper/bitthenFlash–  300-500Xcheaper/bitthanDRAM

6

BandwidthandLatency

•  Bandwidthorthroughput–  TotalworkdoneinagivenCme

–  10,000-25,000Ximprovementfor

processors

–  300-1200Ximprovementfor

memoryanddisks

•  LatencyorresponseCme

–  TimebetweenstartandcompleConofanevent

–  30-80Ximprovementforprocessors

–  6-8Ximprovementformemoryanddisks

7

EndofMoore’sLaw?

8

Costpertransistorisrisingastransistorsizecon6nuesto

shrink

PowerandEnergy

•  Problem:

–  Getpowerinanddistributearound–  getpowerout:dissipateheat

•  Threeprimaryconcerns:

–  Maxpowerrequirementforaprocessor

–  ThermalDesignPower(TDP)

•  CharacterizessustainedpowerconsumpCon

•  Usedastargetforpowersupplyandcoolingsystem

•  Lowerthanpeakpower,higherthanaveragepowerconsumpCon

–  Energyandenergyefficiency

•  Clockratecanbereduceddynamicallytolimitpower

consumpCon

9

EnergyandEnergyEfficiency

•  Power:energyperunitCme

–  1waN=1joulepersecond–  EnergypertaskisoLenabeNermeasurement

•  ProcessorAhas20%higheraveragepowerconsumpCon

thanprocessorB.Aexecutestaskinonly70%oftheCme

neededbyB.

–  SoenergyconsumpConofAwillbe1.2*0.7=0.84ofB

10

DynamicEnergyandPower

•  Dynamicenergy

–  Transistorswitchfrom0->1or1->0

•  Dynamicpower

•  Reducingclockratereducespower,notenergy•  ThecapaciCveload:

–  afuncConofthenumberoftransistorsconnectedtoan

outputandthetechnology,whichdeterminesthecapacitance

ofthewiresandthetransistors.

11

AnExamplefromTextbookpage#21

12

An Example from Textbook

•  Suppose a new CPU has –  85% of capacitive load of old CPU –  15% voltage and 15% frequency reduction

0.520.85FVC

0.85F0.85)(V0.85CPP 4

old2

oldold

old2

oldold

old

new ==××

×××××=

13

PowerTrends

•  In CMOS IC technology

Power =Capacitive load×Voltage2 ×Frequency

×1000×30 5V→1V

14

Power

•  Intel80386consumed~2W

•  3.3GHzIntelCorei7consumes130W

•  Heatmustbedissipatedfrom1.5x1.5cmchip

•  Thisisthelimitofwhatcanbecooledbyair

15

ThePowerWall

•  We can’t reduce voltage further •  We can’t remove more heat

•  Techniquesforreducingpower:–  Donothingwell

•  TurnoffclockofinacCvemodule

–  DynamicVoltage-FrequencyScaling

–  LowpowerstateforDRAM,disks

–  Overclocking,turningoffcores

16

Sta6cPower

•  Becauseofleakagecurrentflowsevenatransistorisoff

•  Scaleswithnumberoftransistors

•  Leakagecanbeashighas50%for–  InpartbecauseoflargeSRAMcaches

•  Toreduce:powergaCng–  TurnoffpowerofinacCvemodules

17

MeasuringPerformance

•  Typicalperformancemetrics:

–  ResponseCme

–  Throughput

•  SpeedupofXrelaCvetoY–  ExecuConCme

Y/ExecuConCme

X

•  ExecuConCme

–  WallclockCme:includesallsystemoverheads

–  CPUCme:onlycomputaConCme

•  Benchmarks

–  Kernels(e.g.matrixmulCply)

–  Toyprograms(e.g.sorCng)

–  SyntheCcbenchmarks(e.g.Dhrystone)

–  Benchmarksuites(e.g.SPEC06fp,TPC-C)18

ResponseTimeandThroughput

•  ResponseCme

–  Howlongittakestodoatask•  Throughput

–  TotalworkdoneperunitCme

•  e.g.,tasks/transacCons/…perhour

•  HowareresponseCmeandthroughputaffectedby

–  Replacingtheprocessorwithafasterversion?–  Addingmoreprocessors?

•  We�llfocusonresponseCmefornow…

19

Rela6vePerformance:Speedup

•  DefinePerformance=1/ExecuConTime

•  �XisnCmefasterthanY�

n== XY

YX

time Executiontime ExecutionePerformancePerformanc

n  Example:Cmetakentorunaprogram

n  10sonA,15sonB

n  ExecuConTimeB/ExecuConTime

A

=15s/10s=1.5

n  SoAis1.5CmesfasterthanB

20

MeasuringExecu6onTime

•  ElapsedCme

–  TotalresponseCme,includingallaspects

•  Processing,I/O,OSoverhead,idleCme

–  Determinessystemperformance

•  CPUCme

–  Timespentprocessingagivenjob

•  DiscountsI/OCme,otherjobs�shares–  ComprisesuserCPUCmeandsystemCPUCme

–  DifferentprogramsareaffecteddifferentlybyCPUandsystem

performance

–  “Cme”commandinLinux

21

CPU Clocking

•  Operation of digital hardware governed by a constant-rate clock

Clock (cycles)

Data transfer and computation

Update state

Clock period

n  Clockperiod:duraConofaclockcycle

n  e.g.,250ps=0.25ns=250×10–12s

n  Clockfrequency(rate):cyclespersecond

n  e.g.,4.0GHz=4000MHz=4.0×109Hz

n  Clockperiod:1/(4.0×109)s=0.25ns 22

CPU Time

•  Performanceimprovedby

–  Reducingnumberofclockcycles

–  Increasingclockrate–  HardwaredesignermustoLentradeoffclockrateagainst

cyclecount

CPU Time =CPU Clock Cycles×Clock Cycle Time

=CPU Clock Cycles

Clock Rate

23

CPU Time Example

•  Computer A: 2GHz clock, 10s CPU time •  Designing Computer B

–  Aim for 6s CPU time –  Can do faster clock, but causes 1.2 × clock cycles of A

•  How fast must Computer B clock be?

Clock RateB =Clock CyclesB

CPU TimeB

=1.2×Clock CyclesA

6sClock CyclesA =CPU TimeA ×Clock RateA

=10s×2GHz = 20×109

Clock RateB =1.2×20×109

6s=

24×109

6s= 4GHz

24

Instruc6onCountandCPI

•  InstrucConCountforaprogram

–  Determinedbyprogram,ISAandcompiler

•  AveragecyclesperinstrucCon–  DeterminedbyCPUhardware

–  IfdifferentinstrucConshavedifferentCPI•  AverageCPIaffectedbyinstrucConmix

Rate ClockCPICount nInstructio

Time Cycle ClockCPICount nInstructioTime CPU

nInstructio per CyclesCount nInstructioCycles Clock

×=

××=

×=

25

CPIExample

•  ComputerA:CycleTime=250ps,CPI=2.0

•  ComputerB:CycleTime=500ps,CPI=1.2

•  SameISA

•  Whichisfaster,andbyhowmuch?

1.2500psI600psI

ATime CPUBTime CPU

600psI500ps1.2IBTime CycleBCPICount nInstructioBTime CPU

500psI250ps2.0IATime CycleACPICount nInstructioATime CPU

×=

×=××=

××=

×=××=

××=

Aisfaster…

…bythismuch

26

CPI in More Detail

•  If different instruction classes take different numbers of cycles

∑=

×=n

1iii )Count nInstructio(CPICycles Clock

n  WeightedaverageCPI

∑=

⎟⎠

⎞⎜⎝

⎛ ×==n

1i

ii Count nInstructio

Count nInstructioCPI

Count nInstructioCycles Clock

CPI

Relative frequency

27

CPI Example

•  AlternaCvecompiledcodesequencesusing

instrucConsinclassesA,B,C

Class A B C CPI for class 1 2 3

IC in sequence #1 2 1 2 IC in sequence #2 4 1 1

n  Sequence#1:IC=5

n  ClockCycles

=2×1+1×2+2×3

=10

n  Avg.CPI=10/5=2.0

n  Sequence#2:IC=6

n  ClockCycles

=4×1+1×2+1×3

=9

n  Avg.CPI=9/6=1.5

28

PerformanceSummary

•  Performancedependson

–  Algorithm:affectsIC,possiblyCPI

–  Programminglanguage:affectsIC,CPI

–  Compiler:affectsIC,CPI

–  InstrucConsetarchitecture:affectsIC,CPI,Tc

The BIG Picture

cycle ClockSeconds

nInstructiocycles Clock

ProgramnsInstructio

Time CPU ××=

29

SPECCPUBenchmark

•  Programsusedtomeasureperformance

–  Supposedlytypicalofactualworkload•  StandardPerformanceEvaluaConCorp(SPEC)

–  DevelopsbenchmarksforCPU,I/O,Web,…

•  SPECCPU2006–  ElapsedCmetoexecuteaselecConofprograms

•  NegligibleI/O,sofocusesonCPUperformance

–  NormalizerelaCvetoreferencemachine

–  SummarizeasgeometricmeanofperformanceraCos

•  CINT2006(integer)andCFP2006(floaCng-point)

30

n

n

1iiratio time Execution∏

=

PrinciplesofComputerDesign

•  TheProcessorPerformanceEquaCon

31

PrinciplesofComputerDesign

•  DifferentinstrucContypeshavingdifferentCPIs

32

33

MetricsofPerformance

Compiler

Programming

Language

ApplicaCon

Datapath

Control

Transistors Wires Pins

ISA

FuncConUnits

(millions)ofInstrucConspersecond:MIPS

(millions)of(FP)operaConspersecond:MFLOP/s

Cyclespersecond(clockrate)

Megabytespersecond

Answersperday/month

ImpactsbyComponents

InstCount CPI ClockRate

Program X

Compiler X (X)

Inst.Set. X X

Architecture X X

Technology X

34

inst count

CPI

Cycle time

PrinciplesofComputerDesign

•  TakeAdvantageofParallelism

–  e.g.mulCpleprocessors,disks,memorybanks,pipelining,

mulCplefuncConalunits

•  PrincipleofLocality–  ReuseofdataandinstrucCons

•  FocusontheCommonCase

–  Amdahl’sLaw

35

Amdahl�sLaw

36

( )enhanced

enhancedenhanced

new

oldoverall

SpeedupFraction Fraction

1 ExTimeExTime Speedup

+−==1

Best you could ever hope to do:

( )enhancedmaximum Fraction - 1

1 Speedup =

( ) ⎥⎦

⎤⎢⎣

⎡+−×=

enhanced

enhancedenhancedoldnew Speedup

FractionFraction ExTime ExTime 1

UsingAmdahl’sLaw

37

Amdahl’sLawforParallelism

•  TheenhancedfracConFisthroughparallelism,perfect

parallelismwithlinearspeedup

–  ThespeedupforFisNforNprocessors•  Overallspeedup

•  Speedupupperbound(whenNà∞):

–  1-F:thesequenCalporConofaprogram

38

Amdahl’sLawforParallelism

39

Pi\all:Amdahl�sLaw

•  ImprovinganaspectofacomputerandexpecCnga

proporConalimprovementinoverallperformance

208020 +=n

n  Can�tbedone!

unaffectedaffected

improved Tfactor timprovemen

TT +=

n  Example:mulCplyaccountsfor80s/100s

n  HowmuchimprovementinmulCplyperformancetoget

5×overall?

n  Corollary:makethecommoncasefast

40

Exercise#1:Amdahl’sLaw

41

Exercise#1:Amdahl’sLawsolu6on

•  Textbookpage#47

42

Exercise#2:CPU6meandspeedup

43

Exercise#2:solu6on,textbookpage51

44