lecture 02: technology trends and quan6ta6ve design and ... · principles of computer design •...
TRANSCRIPT
Lecture02:TechnologyTrendsandQuan6ta6veDesignandAnalysisfor
Performance
CSE564ComputerArchitectureSummer2017
DepartmentofComputerScienceandEngineering
YonghongYan
www.secs.oakland.edu/~yan
1
Contents
• Computersandcomputercomponents
• Computerarchitecturesandgreatideasinhistoryandnow
• Trends,CostandPerformance
2
UnderstandingPerformance
• Algorithm
– DeterminesnumberofoperaConsexecuted
• Programminglanguage,compiler,architecture
– DeterminenumberofmachineinstrucConsexecutedper
operaCon
• Processorandmemorysystem
– DeterminehowfastinstrucConsareexecuted
• I/Osystem(includingOS)
– DetermineshowfastI/OoperaConsareexecuted
3
BelowYourProgram
• ApplicaConsoLware– WriNeninhigh-levellanguage
• SystemsoLware
– Compiler:translatesHLLcodetomachinecode
– OperaCngSystem:servicecode
• Handlinginput/output• Managingmemoryandstorage
• Schedulingtasks&sharingresources
• Hardware– Processor,memory,I/Ocontrollers
4
LevelsofProgramCode
• High-levellanguage– LevelofabstracConcloserto
problemdomain
– ProvidesforproducCvityandportability
• Assemblylanguage
– TextualrepresentaConofinstrucCons
• HardwarerepresentaCon– Binarydigits(bits)– EncodedinstrucConsanddata
5
TrendsinTechnology
• Integratedcircuittechnology– Transistordensity:35%/year– Diesize:10-20%/year– IntegraConoverall:40-55%/year
• DRAMcapacity:25-40%/year(slowing)
• Flashcapacity:50-60%/year– 15-20Xcheaper/bitthanDRAM
• MagneCcdisktechnology:40%/year
– 15-25Xcheaper/bitthenFlash– 300-500Xcheaper/bitthanDRAM
6
BandwidthandLatency
• Bandwidthorthroughput– TotalworkdoneinagivenCme
– 10,000-25,000Ximprovementfor
processors
– 300-1200Ximprovementfor
memoryanddisks
• LatencyorresponseCme
– TimebetweenstartandcompleConofanevent
– 30-80Ximprovementforprocessors
– 6-8Ximprovementformemoryanddisks
7
PowerandEnergy
• Problem:
– Getpowerinanddistributearound– getpowerout:dissipateheat
• Threeprimaryconcerns:
– Maxpowerrequirementforaprocessor
– ThermalDesignPower(TDP)
• CharacterizessustainedpowerconsumpCon
• Usedastargetforpowersupplyandcoolingsystem
• Lowerthanpeakpower,higherthanaveragepowerconsumpCon
– Energyandenergyefficiency
• Clockratecanbereduceddynamicallytolimitpower
consumpCon
9
EnergyandEnergyEfficiency
• Power:energyperunitCme
– 1waN=1joulepersecond– EnergypertaskisoLenabeNermeasurement
• ProcessorAhas20%higheraveragepowerconsumpCon
thanprocessorB.Aexecutestaskinonly70%oftheCme
neededbyB.
– SoenergyconsumpConofAwillbe1.2*0.7=0.84ofB
10
DynamicEnergyandPower
• Dynamicenergy
– Transistorswitchfrom0->1or1->0
• Dynamicpower
• Reducingclockratereducespower,notenergy• ThecapaciCveload:
– afuncConofthenumberoftransistorsconnectedtoan
outputandthetechnology,whichdeterminesthecapacitance
ofthewiresandthetransistors.
11
An Example from Textbook
• Suppose a new CPU has – 85% of capacitive load of old CPU – 15% voltage and 15% frequency reduction
0.520.85FVC
0.85F0.85)(V0.85CPP 4
old2
oldold
old2
oldold
old
new ==××
×××××=
13
Power
• Intel80386consumed~2W
• 3.3GHzIntelCorei7consumes130W
• Heatmustbedissipatedfrom1.5x1.5cmchip
• Thisisthelimitofwhatcanbecooledbyair
15
ThePowerWall
• We can’t reduce voltage further • We can’t remove more heat
• Techniquesforreducingpower:– Donothingwell
• TurnoffclockofinacCvemodule
– DynamicVoltage-FrequencyScaling
– LowpowerstateforDRAM,disks
– Overclocking,turningoffcores
16
Sta6cPower
• Becauseofleakagecurrentflowsevenatransistorisoff
• Scaleswithnumberoftransistors
• Leakagecanbeashighas50%for– InpartbecauseoflargeSRAMcaches
• Toreduce:powergaCng– TurnoffpowerofinacCvemodules
17
MeasuringPerformance
• Typicalperformancemetrics:
– ResponseCme
– Throughput
• SpeedupofXrelaCvetoY– ExecuConCme
Y/ExecuConCme
X
• ExecuConCme
– WallclockCme:includesallsystemoverheads
– CPUCme:onlycomputaConCme
• Benchmarks
– Kernels(e.g.matrixmulCply)
– Toyprograms(e.g.sorCng)
– SyntheCcbenchmarks(e.g.Dhrystone)
– Benchmarksuites(e.g.SPEC06fp,TPC-C)18
ResponseTimeandThroughput
• ResponseCme
– Howlongittakestodoatask• Throughput
– TotalworkdoneperunitCme
• e.g.,tasks/transacCons/…perhour
• HowareresponseCmeandthroughputaffectedby
– Replacingtheprocessorwithafasterversion?– Addingmoreprocessors?
• We�llfocusonresponseCmefornow…
19
Rela6vePerformance:Speedup
• DefinePerformance=1/ExecuConTime
• �XisnCmefasterthanY�
n== XY
YX
time Executiontime ExecutionePerformancePerformanc
n Example:Cmetakentorunaprogram
n 10sonA,15sonB
n ExecuConTimeB/ExecuConTime
A
=15s/10s=1.5
n SoAis1.5CmesfasterthanB
20
MeasuringExecu6onTime
• ElapsedCme
– TotalresponseCme,includingallaspects
• Processing,I/O,OSoverhead,idleCme
– Determinessystemperformance
• CPUCme
– Timespentprocessingagivenjob
• DiscountsI/OCme,otherjobs�shares– ComprisesuserCPUCmeandsystemCPUCme
– DifferentprogramsareaffecteddifferentlybyCPUandsystem
performance
– “Cme”commandinLinux
21
CPU Clocking
• Operation of digital hardware governed by a constant-rate clock
Clock (cycles)
Data transfer and computation
Update state
Clock period
n Clockperiod:duraConofaclockcycle
n e.g.,250ps=0.25ns=250×10–12s
n Clockfrequency(rate):cyclespersecond
n e.g.,4.0GHz=4000MHz=4.0×109Hz
n Clockperiod:1/(4.0×109)s=0.25ns 22
CPU Time
• Performanceimprovedby
– Reducingnumberofclockcycles
– Increasingclockrate– HardwaredesignermustoLentradeoffclockrateagainst
cyclecount
CPU Time =CPU Clock Cycles×Clock Cycle Time
=CPU Clock Cycles
Clock Rate
23
CPU Time Example
• Computer A: 2GHz clock, 10s CPU time • Designing Computer B
– Aim for 6s CPU time – Can do faster clock, but causes 1.2 × clock cycles of A
• How fast must Computer B clock be?
Clock RateB =Clock CyclesB
CPU TimeB
=1.2×Clock CyclesA
6sClock CyclesA =CPU TimeA ×Clock RateA
=10s×2GHz = 20×109
Clock RateB =1.2×20×109
6s=
24×109
6s= 4GHz
24
Instruc6onCountandCPI
• InstrucConCountforaprogram
– Determinedbyprogram,ISAandcompiler
• AveragecyclesperinstrucCon– DeterminedbyCPUhardware
– IfdifferentinstrucConshavedifferentCPI• AverageCPIaffectedbyinstrucConmix
Rate ClockCPICount nInstructio
Time Cycle ClockCPICount nInstructioTime CPU
nInstructio per CyclesCount nInstructioCycles Clock
×=
××=
×=
25
CPIExample
• ComputerA:CycleTime=250ps,CPI=2.0
• ComputerB:CycleTime=500ps,CPI=1.2
• SameISA
• Whichisfaster,andbyhowmuch?
1.2500psI600psI
ATime CPUBTime CPU
600psI500ps1.2IBTime CycleBCPICount nInstructioBTime CPU
500psI250ps2.0IATime CycleACPICount nInstructioATime CPU
=×
×=
×=××=
××=
×=××=
××=
Aisfaster…
…bythismuch
26
CPI in More Detail
• If different instruction classes take different numbers of cycles
∑=
×=n
1iii )Count nInstructio(CPICycles Clock
n WeightedaverageCPI
∑=
⎟⎠
⎞⎜⎝
⎛ ×==n
1i
ii Count nInstructio
Count nInstructioCPI
Count nInstructioCycles Clock
CPI
Relative frequency
27
CPI Example
• AlternaCvecompiledcodesequencesusing
instrucConsinclassesA,B,C
Class A B C CPI for class 1 2 3
IC in sequence #1 2 1 2 IC in sequence #2 4 1 1
n Sequence#1:IC=5
n ClockCycles
=2×1+1×2+2×3
=10
n Avg.CPI=10/5=2.0
n Sequence#2:IC=6
n ClockCycles
=4×1+1×2+1×3
=9
n Avg.CPI=9/6=1.5
28
PerformanceSummary
• Performancedependson
– Algorithm:affectsIC,possiblyCPI
– Programminglanguage:affectsIC,CPI
– Compiler:affectsIC,CPI
– InstrucConsetarchitecture:affectsIC,CPI,Tc
The BIG Picture
cycle ClockSeconds
nInstructiocycles Clock
ProgramnsInstructio
Time CPU ××=
29
SPECCPUBenchmark
• Programsusedtomeasureperformance
– Supposedlytypicalofactualworkload• StandardPerformanceEvaluaConCorp(SPEC)
– DevelopsbenchmarksforCPU,I/O,Web,…
• SPECCPU2006– ElapsedCmetoexecuteaselecConofprograms
• NegligibleI/O,sofocusesonCPUperformance
– NormalizerelaCvetoreferencemachine
– SummarizeasgeometricmeanofperformanceraCos
• CINT2006(integer)andCFP2006(floaCng-point)
30
n
n
1iiratio time Execution∏
=
33
MetricsofPerformance
Compiler
Programming
Language
ApplicaCon
Datapath
Control
Transistors Wires Pins
ISA
FuncConUnits
(millions)ofInstrucConspersecond:MIPS
(millions)of(FP)operaConspersecond:MFLOP/s
Cyclespersecond(clockrate)
Megabytespersecond
Answersperday/month
ImpactsbyComponents
InstCount CPI ClockRate
Program X
Compiler X (X)
Inst.Set. X X
Architecture X X
Technology X
34
inst count
CPI
Cycle time
PrinciplesofComputerDesign
• TakeAdvantageofParallelism
– e.g.mulCpleprocessors,disks,memorybanks,pipelining,
mulCplefuncConalunits
• PrincipleofLocality– ReuseofdataandinstrucCons
• FocusontheCommonCase
– Amdahl’sLaw
35
Amdahl�sLaw
36
( )enhanced
enhancedenhanced
new
oldoverall
SpeedupFraction Fraction
1 ExTimeExTime Speedup
+−==1
Best you could ever hope to do:
( )enhancedmaximum Fraction - 1
1 Speedup =
( ) ⎥⎦
⎤⎢⎣
⎡+−×=
enhanced
enhancedenhancedoldnew Speedup
FractionFraction ExTime ExTime 1
Amdahl’sLawforParallelism
• TheenhancedfracConFisthroughparallelism,perfect
parallelismwithlinearspeedup
– ThespeedupforFisNforNprocessors• Overallspeedup
• Speedupupperbound(whenNà∞):
– 1-F:thesequenCalporConofaprogram
38
Pi\all:Amdahl�sLaw
• ImprovinganaspectofacomputerandexpecCnga
proporConalimprovementinoverallperformance
208020 +=n
n Can�tbedone!
unaffectedaffected
improved Tfactor timprovemen
TT +=
n Example:mulCplyaccountsfor80s/100s
n HowmuchimprovementinmulCplyperformancetoget
5×overall?
n Corollary:makethecommoncasefast
40