10/17/2016 CS152,Fall2016
CS152ComputerArchitectureandEngineering
Lecture13- VLIWMachinesandStaticallyScheduledILP
JohnWawrzynekElectricalEngineeringandComputerSciences
UniversityofCaliforniaatBerkeley
http://www.eecs.berkeley.edu/~johnwhttp://inst.eecs.berkeley.edu/~cs152
10/17/2016 CS152,Fall2016
CS152Administrivia
§Quiz3,ThursdayOct21– L10-L12,PS3,Lab3directedportion
§MakesuretounderstandPS3!§Lab3dueFridayOct28
2
10/17/2016 CS152,Fall2016
LasttimeinLecture12§ UnifiedphysicalregisterfilemachinesremovedatavaluesfromROB– Allvaluesonlyreadandwrittenduringexecution– OnlyregistertagsheldinROB– Allocateresources(ROBslot,destination physicalregister,memoryreorderqueue location)duringdecode
– IssuewindowcanbeseparatedfromROBandmadesmallerthanROB(allocate indecode,freeafterinstructioncompletes)
– Freeresourcesoncommit
§ Speculative storebufferholdsstorevaluesbeforecommittoallowload-storeforwarding
§ Canexecute laterloadspastearlierstoreswhenaddressesknown,orpredictednodependence
3
10/17/2016 CS152,Fall2016
SuperscalarControlLogicScaling
§ EachissuedinstructionmustsomehowcheckagainstW*Linstructions, i.e.,growthinhardware∝W*(W*L)
§ Forin-ordermachines,Lisrelated topipelinelatenciesandcheckisdoneduringissue(interlocksorscoreboard)
§ Forout-of-ordermachines,Lalsoincludestimespentininstructionbuffers(instructionwindoworROB),andcheckisdonebybroadcastingtagstowaitinginstructionsatwriteback(completion)
§ AsWincreases, largerinstructionwindowisneededtofindenoughparallelismtokeepmachinebusy=>greaterL
=>Out-of-ordercontrollogicgrowsfasterthanW2 (~W3)
4
LifetimeL
IssueGroup
PreviouslyIssued
Instructions
IssueWidthW
10/17/2016 CS152,Fall2016
Out-of-OrderControlComplexity:MIPSR10000
5
ControlLogic
[SGI/MIPSTechnologiesInc.,1995]
10/17/2016 CS152,Fall2016
SequentialISABottleneck
6
Checkinstructiondependencies
Superscalarprocessor
a = foo(b);
for (i=0, i<
Sequentialsourcecode
Superscalarcompiler
Findindependentoperations
Scheduleoperations
Sequentialmachinecode
Scheduleexecution
10/17/2016 CS152,Fall2016
VLIW:VeryLongInstructionWord
§Multipleoperationspackedintooneinstruction§ Eachoperationslotisforafixedfunction§ Constantoperationlatenciesarespecified§ Architecturerequiresguaranteeof:
– Parallelismwithinaninstruction=>nocross-operation RAWcheck– Nodatausebeforedataready=>nodatainterlocks
7
TwoIntegerUnits,SingleCycleLatency
TwoLoad/StoreUnits,ThreeCycleLatency TwoFloating-PointUnits,
FourCycleLatency
IntOp2 MemOp1 MemOp2 FPOp1 FPOp2Int Op1
10/17/2016 CS152,Fall2016
EarlyVLIWMachines
§ FPSAP120B(1976)– scientificattachedarrayprocessor– firstcommercialwideinstructionmachine– hand-codedvectormathlibrariesusingsoftwarepipeliningandloopunrolling
§MultiflowTrace(1987)– commercializationofideasfromFisher’sYalegroupincluding“tracescheduling”
– availableinconfigurationswith7,14,or28operations/instruction– 28operationspackedintoa1024-bitinstructionword
§ CydromeCydra-5(1987)– 7operationsencodedin256-bitinstructionword– rotatingregister file
8
10/17/2016 CS152,Fall2016
VLIWCompilerResponsibilities
§Scheduleoperationstomaximizeparallelexecution
§Guaranteesintra-instructionparallelism
§Scheduletoavoiddatahazards(nointerlocks)– TypicallyseparatesoperationswithexplicitNOPs
9
10/17/2016 CS152,Fall2016
LoopExecution
HowmanyFPops/cycle?
10
for (i=0; i<N; i++)
B[i] = A[i] + C;Int1 Int 2 M1 M2 FP+ FPx
loop: fldadd x1
fadd
fsdadd x2 bne
1 fadd / 8 cycles = 0.125
loop: fld f1, 0(x1)
add x1, 8
fadd f2, f0, f1
fsd f2, 0(x2)
add x2, 8
bne x1, x3, loop
Compile
Schedule
10/17/2016 CS152,Fall2016
LoopUnrolling
11
for (i=0; i<N; i++)
B[i] = A[i] + C;
for (i=0; i<N; i+=4)
{
B[i] = A[i] + C;
B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}
Unroll inner loop to perform 4 iterations at once
Need to handle values of N that are not multiples of unrolling factor with final cleanup loop
10/17/2016 CS152,Fall2016
SchedulingLoopUnrolledCode
12
loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)fsd f8, 24(x2)add x2, 32bne x1, x3, loop
Schedule
Int1 Int 2 M1 M2 FP+ FPx
loop:
Unroll 4 ways
fld f1fld f2fld f3fld f4add x1 fadd f5
fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8add x2 bne
How many FLOPS/cycle?4 fadds / 11 cycles = 0.36
10/17/2016 CS152,Fall2016
SoftwarePipelining
HowmanyFLOPS/cycle?
13
loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)add x2, 32fsd f8, -8(x2)bne x1, x3, loop
Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstfld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8
add x1
add x2bne
fld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8
add x1
add x2bne
fld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5
add x1
loop:iterate
prolog
epilog
4 fadds / 4 cycles = 1
10/17/2016 CS152,Fall2016
SoftwarePipeliningvs.LoopUnrolling
14
time
performance
time
performance
Loop Unrolled
Software Pipelined
Startup overhead
Wind-down overhead
Loop Iteration
Loop Iteration
Software pipelining pays startup/wind-down costs only once per loop, not once per iteration
10/17/2016 CS152,Fall2016
Whatiftherearenoloops?
§ Brancheslimitbasicblocksizeincontrol-flowintensiveirregularcode
§ DifficulttofindILPinindividualbasicblocks
15
Basicblock
10/17/2016 CS152,Fall2016
TraceScheduling[Fisher,Ellis]
§ Pickstringofbasicblocks,atrace,thatrepresentsmostfrequentbranchpath
§ Useprofilingfeedback orcompilerheuristicstofindcommonbranchpaths
§ Schedulewhole“trace”atonce§ Addfixup codetocopewithbranchesjumpingoutoftrace
16
10/17/2016 CS152,Fall2016
Problemswith“Classic”VLIW
§Object-codecompatibility– havetorecompileallcodeforeverymachine,evenfortwomachinesinsamegeneration
§Objectcodesize– instructionpaddingwastesinstructionmemory/cache– loopunrolling/softwarepipeliningreplicatescode
§ Schedulingvariablelatencymemoryoperations– cachesand/ormemorybankconflictsimposestaticallyunpredictablevariability
§ Knowingbranchprobabilities– optimalschedulevarieswithbranchpath– Profilingrequiresansignificantextrastepinbuildprocess
17
10/17/2016 CS152,Fall2016
VLIWInstructionEncoding
§ Schemestoreduceeffectofunusedfields– Compressed formatinmemory,expandonI-cacherefill
• usedinMultiflow Trace• introducesinstructionaddressingchallenge
– Markparallelgroups• usedinTMS320C6xDSPs,IntelIA-64
– Provideasingle-opVLIWinstruction• Cydra-5UniOp instructions
18
Group 1 Group 2 Group 3
10/17/2016 CS152,Fall2016
Intel Itanium,EPICIA-64
§ EPICisthestyleofarchitecture(cf.CISC,RISC)– ExplicitlyParallel InstructionComputing(reallyjustVLIW)
§ IA-64isIntel’schosenISA(cf.x86,MIPS)– IA-64=IntelArchitecture64-bit– Anobject-code-compatible VLIW
§ Mercedwasfirst Itaniumimplementation (cf.8086)– Firstcustomershipmentexpected 1997(actually2001)– McKinley,secondimplementation shipped in2002– Recentversion, Poulson,eightcores, 32nm,announced2011
19
10/17/2016 CS152,Fall2016
EightCoreItanium“Poulson”[Intel2011]
§ 8cores§ 1-cycle16KBL1I&Dcaches§ 9-cycle512KBL2I-cache§ 8-cycle256KBL2D-cache§ 32MBsharedL3cache§ 544mm2 in 32nmCMOS§ Over 3billiontransistors
§ Coresare2-waymultithreaded§ 6instruction/cyclefetch
– Two128-bitbundles
§ Upto12insts/cycleexecute
20
10/17/2016 CS152,Fall2016
IA-64InstructionFormat
§ Templatebitsdescribegroupingofthese instructionswithothersinadjacentbundles
§ Eachgroupcontainsinstructionsthatcanexecute inparallel
21
Instruction 2 Instruction 1 Instruction 0 Template
128-bit instruction bundle
group i group i+1 group i+2group i-1
bundle j bundle j+1bundle j+2bundle j-1
10/17/2016 CS152,Fall2016
IA-64Registers
§ 128GeneralPurpose64-bitIntegerRegisters§ 128GeneralPurpose64/80-bitFloatingPointRegisters§ 641-bitPredicateRegisters
§ GPRs “rotate” toreducecodesizeforsoftwarepipelined loops– Rotation isasimple formofregisterrenamingallowingoneinstructiontoaddressdifferentphysicalregistersoneachiteration
22
10/17/2016 CS152,Fall2016
IA-64PredicatedExecutionProblem:Mispredicted brancheslimitILP
20-30%ofperformancegoestobranchmispredictions [Intel98]Solution:Eliminatehardtopredictbrancheswithpredicatedexecution
– AlmostallIA-64instructionscanbeexecutedconditionallyunderpredicate– InstructionbecomesNOPifpredicateregisterfalse
23
Inst 1Inst 2br a==b, b2
Inst 3Inst 4br b3
Inst 5Inst 6
Inst 7Inst 8
b0:
b1:
b2:
b3:
if
else
then
Four basic blocks
Inst 1Inst 2p1,p2 <- cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8
Predication
Single basic block
§ Simplifiesscheduling– Turn“controlflow”into“dataflow”
§ Lesscodeisneeded
10/17/2016 CS152,Fall2016
IA-64DataSpeculation
Problem:Possiblememoryhazardslimitcodescheduling
26
Requires associative hardware in address check table
Inst 1Inst 2Store
Load r1Use r1Inst 3
Can’t move load above store because store might be to same
address
Load.a r1Inst 1Inst 2Store
Load.cUse r1Inst 3
Data speculative load adds address to
address check table
Store invalidates any matching loads in
address check table
Check if load invalid (or missing), jump to fixup
code if so
Solution: Hardware to check pointer hazards
10/17/2016 CS152,Fall2016
LimitsofStaticScheduling
§Unpredictablebranches§ Variablememorylatency(unpredictablecachemisses)§ Codesizeexplosion§ CompilercomplexityDespiteseveralattempts,VLIWhasfailedingeneral-purposecomputingarena(sofar).– MorecomplexVLIWarchitecturesclosetoin-ordersuperscalarincomplexity,norealadvantageonlargecomplexapps.
SuccessfulinembeddedDSPmarket– SimplerVLIWs withmoreconstrainedenvironment, friendliercode.
27
10/17/2016 CS152,Fall2016
Acknowledgements
§ Theseslidescontainmaterialdeveloped andcopyrightby:– Arvind (MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer (Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz (UCB)– DavidPatterson(UCB)
§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252
28