final+solution+f09 with correct
DESCRIPTION
stuffTRANSCRIPT
-
Name: ID:
Page1of14
McGillUniversity
ECSE425ComputerOrganizationandArchitectureFall2009
FINALEXAMINATIONSOLUTIONS
9:00am12:00pm,December11,2009
Duration:180minutes
Question1.ShortAnswers(35points)Thereare2partstothisquestion.
1) Part1:Thereare10subquestionsinthispart(3pointseach)
Foreachquestionbelow,provideashortanswerin12sentences.
a) Whatisthedifferencebetweenmultithreadingandsimultaneousmultithreading(SMT)?MultithreadingexploitsTLP,itcanberunonanymachine(singleormultipleissue,uniormultiprocessor).Onethreadateachclockcycle.SMTexploitsbothTLPandILP,itmustberunonamultipleissuemachinewithdynamicscheduling.MultiplethreadsateachCC.
b) Whatisthesharedmemorymultiprocessormodel?Canitbeappliedtodistributedmemorymultiprocessorsystems?Thesharedmemorymultiprocessormodelusesasharedaddressspaceamongallprocessors.Itcanbeappliedtophysicallydistributedmemorymultiprocessorssystems.
c) Nametwomajorchallengesinparallelprocessingusingmultiprocessors.Onemustparallelizeprograms;theserialportionoftheprogrambecomesthebottleneck.Thelatencytoremotememoryislonger.
d) Whyiscachecoherencynotanissueinuniprocessorsystembutisanissueinsharedmemorymultiprocessorsystems?Inauniprocessor,only1processorhasaccesstothedataandnootherprocessorscanmodifyit.Insharedmemorymultiprocessorsystems,otherprocessorscanaccessandmodifythedata.
e) Thebusbasedbroadcastsnoopingprotocolserializesallthecoherencetraffic.Nameanadvantageandadisadvantageofthisserialization.Serializingpreservesmemoryaccessorder,suchasRAWWAWWARamongallprocessors.Howeverthelatencycanbelonger,thebuscanbecomethebottleneckwithincreasingnumberofprocessors.
-
Name: ID:
Page2of14
f) Achallengeofmultiprocessorsystemsistobuildoperationsthatappearatomic.Giveanexampleofasequenceofinstructionsthatcanbeusedforthispurpose.try:llR1,0(R2)someoperationonR1scR1,0(R2)beqzR1,try
g) Whatisthedifferencebetweenwriteallocateandnowriteallocate?Inwriteallocate,theblockmustbetransferredtothecacheonawritemiss,followedbyawritehitaction(writebackorwritethrough).Innowriteallocate,thereisnoblocktransfertocacheonawritemissandthedataiswrittendirectlytomemory.
h) Whycacheswithvirtualindexphysicaltagcanhelpreducethehittimecomparedtophysicallyaddressedcaches?Withvirtualindexphysicaltag,cachereadusingthevirtualindexandaddresstranslationcanbedoneinparallel,versusanallphysicallyaddressedcache,thehardwaremusttranslatetheaddressfirst,thenreadthecacheusingthetranslated(physical)index.
i) Giveanexampleofatechniquetoreducecachemisspenalty.1.Usemultilevelcaches.2.Fetchthecriticalwordfirstthentherestoftheblock,orearlyrestart(fetchinorderbutcontinuetheCPUinstructionassoonastheneededwordarrives,whilefetchingtherestoftheblock).3.Prioritizereadmissesoverwritesbyusingawritebuffer.
j) WhatisthedifferencebetweenAMATandCPUtime?Whichoneisamoreaccurateperformancemeasureforacomputersystem?AMAT:averagetimeittakestoaccessmemoryCPUtime:averagetimeittakestorunasequenceofinstructions.TheCPUtimeincludestheAMATinitsformula.CPUtimeisamoreaccurateperformancemeasureasitgivesmorerealisticperformancesincenotallinstructionsaccessmemory.
Part2(5points)
ThecurrentfocusincomputerdesignhasshiftedfromgettingmoreinstructionsperclockcycleonasingleCPU(byhavingmultiplepipelineswithmultipleissues)tohavingmultipleCPUs(eachCPUiseithersingleissueormultipleissue).Doesitmeanthatexploitinginstructionlevelparallelismisnolongeruseful?WhatfactorsdoyouthinkwillinfluencethesuccessofmultipleCPUarchitecture?Provideyouranswerinashortparagraph(nomorethanhalfapage).Answer:
ExploitingILPisstilluseful.WhattheshiftmeansisthatcurrenttechniquesforexploitingILPonasingleprocessoraregoodenoughanditisnotworthwhiletoinventnewtechniquestoexploitILPbecauseofdiminishingreturn(atleastwiththecurrenttechnology).
-
Name: ID:
Page3of14
ThecurrentfocusisonnewtechniquestoexploitTLP,wheremultiprocessorsseemmostsuitable.(EachprocessorhowevercanimplementtechniquesforexploitingILPwithineachthread.)
Incurrentandfuturesystems,bothILPandTLParegoingtoexist.Thebalancehoweverisunclearanddependsonapplication.Applicationisamajorfactorthatinfluencesthesuccessofanarchitecture.
Otherimportantfactoristhesoftware.MultipleprocessorsprovidetheplatformforexploitingTLP,butthesoftwaremustalsobeparallelizedtotakeadvantageofthisplatformbeforeweseemajormultiprocessorsuccesses.
Technologythataffectsthespeedofinterprocessorcommunicationsisalsoafactor.
Powerconsumptionislikelytoincreaseformultipleprocessorssoefficientpowermanagementisanotherfactor.
Question2.MultiProcessors(30points)Thereare3partstothisquestion.
Parta)(10points)Considerthewritebackinvalidatesnoopingprotocolwith3states:Invalid,SharedandExclusive.Listallbusrequeststoasharedblockinacacheandshowthecorrespondingstatetransitionsinthefinitestatemachineforthisprotocol.
Answer
Sharedreadmiss>Shared
Sharedwritemiss>Invalid
Sharedinvalidate>Invalid
-
Name: ID:
Page4of14
Partb)(10points)Assumethatwordsx1andx2areinthesamecacheblock,whichisintheSharedstateinthecachesofbothprocessorsP1andP2.Assumingthefollowingsequenceofevents,identifyifeacheventisahitoramiss,andeachmissasatruesharingmissorafalsesharingmiss.Anymissthatwouldoccuriftheblocksizewereonewordisdesignatedatruesharingmiss.
Time P1 P21 Readx1 2 Writex23 Writex1 4 Readx25 Writex2
Answers:
Time P1 P2 Hit/Miss?Trueorfalsesharingmiss?Why?1 Readx1 Hitsincex1insharedstate
2 Writex2 TruesharingmisssinceP1alsohasx2insharedstate
3 Writex1 FalsesharingmisssinceP1alsohasx1insharedstate
4 Readx2 FalsesharingmisssinceP1hastheblockcontainingx2inexclusivestateeventhoughitdidnotmodifyx2
5 Writex2 TruesharingmisssinceP2alsohasx1insharedstate
Partc)(30points)
Considera4processordistributedsharedmemorysystem.Eachprocessorhasasingledirectmappedcachethatholdsfourblocks,eachcontainingtwowordswithaddressesseparatedby4.Tosimplifytheillustration,thecacheaddresstagcontainsthefulladdressandeachwordshowsonlytwohexadecimalcharacters,withtheleastsignificantwordontheright.ThecachestatesaredenotedM,S,andIforModified,Shared,andInvalid.ThedirectorystatesaredenotedDM,DS,andDIforDirectoryModified,DirectoryShared,andDirectoryInvalid.ThissimpledirectoryprotocolusesmessagesgiveninTable2conthenextpage.AssumethecachecontentsofthefourprocessorsandthecontentofthemainmemoryasshowninFigure2cbelow.ACPUoperationisoftheform
P#:[
-
Name: ID:
Page5of14
eachreadoperation?Foreachoperation,whatisthesequenceofmessagespassedonthebus?Youcanusethetableonthefollowingpagetohelpyouwiththebusmessages.
Note:Thetagsareinhexadecimal
P0 P1 P2 P3state tag data state tag data state tag data state tag dataI 100 26 10 I 100 26 10 S 120 02 20 S 120 02 20S 108 15 08 M 128 2D 68 S 108 15 08 I 128 43 30M 110 F7 30 I 110 6F 10 I 110 6F 10 M 130 64 00I 118 C2 10 S 118 3E 18 I 118 C2 10 I 118 40 28
Memoryaddress state Sharers Data100 DI 20 00108 DS P0,P2 15 08110 DM P0 6F 10118 DS P1 3E 18120 DS P2,P3 02 20128 DM P1 3D 28130 DM P3 01 30
P0:read130P3:write130
-
Name: ID:
Page6of14
P=requestingprocessornumber,A=requestedaddress,andD=datacontents
Table2c.Messagesforasimpledirectoryprotocol.
Toshowthebusmessages,usethefollowingformat:
Bus{messagetype,requestingprocessor,address,data}
Example:Bus{readmiss,P0,100,}
Toshowthecontentsinthecacheofaprocessor,usethefollowingformat:
P#{state,tag,data}
Example:P3{S,120,0220}
Toshowthecontentsinthememory,usethefollowingformat:
M{state,[sharers],data}
Example:M{DS,[P0,P3],0220}
Answers:
P0:read130
Bus{datawriteback,110,F730}sentbyP0todirectoryM.110{DI,,F730}Bus{readmiss,P0,130}sentbyP0todirectoryBus{fetch,130}sentbydirectorytoP3P3.B2{S,130,6400}Bus{datawriteback,130,6400}sentbyP3todirectoryBus{datavaluereply,6400}sentbydirectorytoP0P0.B2{S,130,6400};returns00M.130{DS,{P0,P3},6400}
P3:write130
-
Name: ID:
Page7of14
Question3.MemoryHierarchy(30points)Thereare3partstothisquestion.
Parta)(5points)Drawthefinitestatemachinefora2bitlocalpredictorusingthesaturatingcounterinthespacebelow.
T0=11takenT1=10takenN1=01nottakenN0=00nottaken
Partb)(15points)ConsideraVirtualMemory/Cachesystemwiththefollowingproperties:
Virtualaddresssize 64bits,byteaddressablePhysicaladdresssize 30bits,byteaddressableBlocksize 32bytesPagesize 64kbytesTotalcachedatasize 32kbytesCacheassociativity 4waysetassociativeTLBassociativity 1waysetassociateTLBsize 1024entriesintotal
Nametheaddressfieldsandcalculatethebitsizeofeachfieldinthefollowingfigure.
a b c d e f gName TLBtag TLBindex Page
offsetPhysicalpagetable
Cachetag Cacheindex
Blockoffset
Bitsize 38 10 16 14 17 8 5
NotTaken
Taken
TakenTakenTaken
NotTaken
NotTaken
NotTaken
-
Name: ID:
Page8of14
Partc)(10points)ConsideraMIPSmachinewithabyteaddressablemainmemoryandthefollowingspecifications:
Datacachesize 1kBBlocksize 64B
ThefollowingCprogramrepresentingadotproduct(withnooptimizations)isexecutedonthiscomputer.
int i; int a[256], b[256]; int c; for ( i = 0; i < 256; i++ ){ c = a[i] * b[i] + c; }
Assumethatthesizeofeacharrayelementisonewordofsize4bytesandtheelementsarestoredinconsecutivememorylocationsinarrayindexorder.Arrayastartsataddress0x0000,bat0x0400.Whatisthemissrategivena2waysetassociativecache?Showyourcalculations.
Answer:
Giventhata[i]andb[i]willneverreplaceeachother,andgiventhatwereadeveryelementsonceandinorder,thenumberofmisseswillcorrespondtothenumberofblocksrequiredtoholdarraysaandb.Arrayarequires16blockstoholdits256words,therefore16blockswillbetransferredfrommemorytothecachethroughouttheloop.Thesameappliestoarrayb.
Throughouttheloop,32blockswillbetransferredfrommemorytothecache,andtherewillbe512memoryaccesses.Thatmakesatotalof32missesoutof512memoryaccessesandgivesamissrateof6.25%.
-
Name: ID:
Page9of14
Question4.PipeliningandInstructionLevelParallelism(55points)Thereare4partstothisquestion.
Forall4parts,usethefollowingsnippetofcode
loop: L.D F0,0(R1) ADD.D F0,F0,F4 L.D F2,0(R2) MUL.D F2,F0,F2 S.D F2,0(R2) DADDUI R1,R1,#-8 DADDUI R2,R2,#-8 BNEZ R1,loop
Also,usethefollowingexecutiontimeforeachunit:
Functionalunit CyclestoexecuteFPadd 3FPmult 6Load/store 2IntALU 1
Parta)(10points)Identifyallhazardsinthesnippetofcode.
Potentialdatahazards:
RAW:L.DF0,0(R1)>ADD.DF0,F0,F4ADD.DF0,F0,F4>MUL.DF2,F0,F2L.DF2,0(R2)>MUL.DF2,F0,F2MUL.DF2,F0,F2>S.DF2,0(R2)DADDUIR1,R1,#8>BNEZR1,loop
WAW:L.DF2,0(R2)>MUL.DF2,F0,F2L.DF0,0(R1)>ADD.DF0,F0,F4
WAR:L.DF0,0(R1)>DADDUIR1,R1,#8L.DF2,0(R2)>DADDUIR2,R2,#8S.DF2,0(R2)>DADDUIR2,R2,#8
-
Name: ID:
Page10of14
Partb)(15points)
Partb.i)Unrollthelooptwice(2iterationspernewloop)andscheduleitona5issueVLIWmachineusingtheprovidedtable.
loop: L.D F0,0(R1) ADD.D F0,F0,F4 L.D F2,0(R2) MUL.D F2,F0,F2 S.D F2,0(R2) DADDUI R1,R1,#-8 DADDUI R2,R2,#-8 BNEZ R1,loop
Clockcycle
Memoryreference1
Memoryreference2
FPoperation1 FPoperation2 Integeroperation/branch
1 L.DF0,0(R1) L.DF6,8(R1)
2 L.DF2,0(R2) L.DF8,8(R2)
3
4 ADD.DF0,F0,F4 ADD.DF6,F6,F4
5
6
7 MUL.DF2,F0,F2 MUL.DF8,F6,F8
8
9 DADDUIR1,R1,#16
10 DADDUIR2,R2,#16
11 BNEZR1,loop(withdelayslot)
12 S.DF2,16(R2) S.DF8,8(R2) BNEZR1,loop(withoutdelayslot)
Partb.ii)InthisVLIWmachine,atleasthowmanytimesdoyouneedtounrollthelooptogetthemaximumefficiency?Answer:Withoutgettingintotoomanycomplexities,wecanunroll6timeseasilytoget6iterationsper14CC.Wecanalsounroll10timestoget10iterationsper19CCbyperforminganother4iterationswhileidling.Thetruemaximumefficiencyiswhenoneoftheunitisalwaysbusy.Givenenoughregisters,unrolling32timeswillgivethebestefficiencywhereeverymemoryreferenceslotwillalwaysbebusy.Assumeaninfiniteloop,whatistheaveragenumberofclockcyclesperiteration?Answer:6clockcyclesperiterationforpartb.1.5clockcyclesperiterationformaximumefficiency.
ADD.D F0,F0,F4 ADD.D F6,F6,F4
MUL.D F2,F0,F2 MUL.D F8,F6,F8
-
Name: ID:
Page11of14
Partc)(15points)
Partc.i)Considerasingleissuedynamicallyscheduledmachinewithouthardwarespeculation.Assumethatthefunctionalunitsarepipelinedandthatallmemoryaccesseshitthecache.Thereisamemoryunitwith5loadbuffersand5storebuffers.Eachloadorstoretakes2cyclestoexecute,1tocalculatetheaddress,and1toload/storethedata.Therearededicatedintegerfunctionalunitsforeffectiveaddresscalculationandbranchconditionevaluation.Theotherfunctionunitsaredescribedinthefollowingtable.
Func.unittype Numberoffunc.units NumberofreservationstationsIntegerALU 1 5FPadder 1 3FPmultiplier 1 2Load 1 5Store 1 5
Nowdynamicallyschedule2iterationsoftheoriginallooponthismachinewithoutspeculation.Showtheclockcyclenumberofeachstageofthedynamicallyscheduledcodeinthetablebelow.Assumeatleastonecycledelaybetweensuccessivestepsofeveryinstructionexecutionsequence(issue,executionstart,writeback).
Instruction Operands Issue ExecutionStart
WriteBack
L.D F0,0(R1) 1 2 4ADD.D F0,F0,F4 2 5 8L.D F2,0(R2) 3 4 6MUL.D F2,F0,F2 4 9 15S.D F2,0(R2) 5 16* 17DADDUI R1,R1,#-8 6 7 9DADDUI R2,R2,#-8 7 8 10BNEZ R1,loop 8 10 11L.D F0,0(R1) 12(1) 13 16ADD.D F0,F0,F4 13 17 20L.D F2,0(R2) 14 16 18MUL.D F2,F0,F2 15 21 27S.D F2,0(R2) 16 28 29DADDUI R1,R1,#-8 17 18 19DADDUI R2,R2,#-8 18 19 21BNEZ R1,loop 19 20 22
*S.DherecancalculateaddresswhilewaitingforF2.(1)L.Dmustwaitforbranchtoreturnbranchdecision
-
Name: ID:
Page12of14
Partc.ii)Inthisdynamicallyscheduledcode,assumeaninfiniteloop,whatistheaveragenumberofclockcyclesperolditeration?Answer:11CCperiterationgiventhattheseconditerationcanonlystartatcycle12.
Partd)(15points)Nowweusethedynamicschedulinghardwaretobuildaspeculativemachinethatcanissueandcommit2instructionspercycle.Againassumethatthefunctionalunitsarepipelinedandthatallmemoryaccesseshitthecache.Thereisamemoryunitwith8loadbuffers.Thereorderbufferhas50entries.Thereorderbuffercanfunctionasastorebuffer,sotherearenoseparatestorebuffers.Eachloadorstoretakes2cyclestoexecute,1tocalculatetheaddress,and1toload/storethedata.Assumeabranchpredictorwith0%mispredictionrate.Assumetherearededicatedintegerfunctionalunitsforeffectiveaddresscalculationandbranchconditionevaluation.Theotherfunctionunitsaredescribedinthefollowingtable.
Functionalunittype Numberoffunctionalunits
Numberofreservationstationsperfunctionalunit
IntegerALU 2 4FPadder 2 3FPmultiplier 2 2Load 2 4
Partd.i)Schedule2iterationsoftheoriginalcodeonthisspeculativemachineinthetablebelow.Assumeatleastonecycledelaybetweensuccessivestepsofeveryinstructionexecutionsequence(issue,executionstart,writeback,commit).Assumetwocommondatabuses.
Instruction Operands Issue ExecutionStart
WriteBack
Commit
L.D F0,0(R1) 1 2 4 5ADD.D F0,F0,F4 1 5 8 9L.D F2,0(R2) 2 3 5 9MUL.D F2,F0,F2 2 9 15 16S.D F2,0(R2) 3 4 5 16DADDUI R1,R1,#-8 3 4 6 17DADDUI R2,R2,#-8 4 5 6 17BNEZ R1,loop 4 7 8 18L.D F0,0(R1) 5 6 9 18ADD.D F0,F0,F4 5 10 12 19L.D F2,0(R2) 6 7 9 19MUL.D F2,F0,F2 6 13 19 20S.D F2,0(R2) 7 8 10 20DADDUI R1,R1,#-8 7 8 10 21DADDUI R2,R2,#-8 8 9 11 21BNEZ R1,loop 8 11 12 22
137
14 20 2121222223
-
Name: ID:
Page13of14
Partd.ii)AssumeaninfiniteloopandnoROBoverflow,whatistheaveragenumberofclockcyclesperiterationonthisspeculativemachine?Comparewithpartsbandc.Answer:Ittakes4CCperolditeration.TheVLIWperformsbest,howeveritrequiressoftwarescheduling.NotealsothattheVLIWis5issuewhereasthespeculativemachineisdoubleissue.Also,thisspeculativemachinewithdoubleissuesperformstwoloopsinlessthanhalfthecyclesrequiredbythenonspeculativemachineinpartc.
Question5.Performance(30points)Thereare3partstothisquestion.
Parta)ReliabilityandAmdahlslaw.(10points)ConsiderasysteminwhichthecomponentshavethefollowingMTTF(inhours):
CPU 1,000,000Harddisk 200,000Memory 500,000Powersupply 100,000
Parta.i)Assumethatifanycomponentfails,thenthesystemfails.WhatisthesystemMTTF?
Parta.ii)YoubuyanadditionalharddriveandbringthetotalharddiskMTTFto600000hours,whichprovides3timesimprovement.UsingAmdahlslaw,computetheimprovementinthewholesystemreliability?
Theharddiskcontributes
%ofthetotalMTTF.
Partb)Cacheperformance(10points)Consideramemorysystemwithlatencyof60clocks.Thetransferrateis4bytesperclockcycleandthat30%ofthetransfersaredirty.Thereare32bytesperblockand25%oftheinstructionsaredatatransferinstructions.Thereisnowritebuffer.Inaddition,theTLBtakes40clockcyclesonaTLBmiss.ATLBdoesnotslowdownacachehit.FortheTLB,makethesimplifyingassumptionthat0.5%ofallreferencesisnotfoundinTLB,eitherwhenaddressescomedirectlyfromtheCPUorwhenaddressescomefromcachemisses.
IfthebaseCPIwithaperfectmemorysystemis1.5,whatistheCPIfora16KBtwowaysetassociativeunifiedcacheusingwritebackwithcachemissrateof1.6%?
ComputetheeffectiveCPIforthiscachewiththerealTLB.
-
Name: ID:
Page14of14
Answers:
Sincethisisaunifiedcached,bothinstructionanddatasharethesamecacheandhavethesametransferrateonblockreplacements.
Virtuallyaddressedcache:AnaddressfromCPUwillgothroughthecachefirst,andonlyonacachemissitgoesthroughtheTLB.
Physicallyaddressedcache:AnaddressfromCPUwillgothroughtheTLBfirst,thenthroughthecache.
Partc)Branchpredictionperformance(10points)
Supposewehaveadeeplypipelinedprocessor,forwhichweimplementabranchtargetbufferfortheconditionalbranchesandbranchfoldingfortheunconditionalbranches.
Fortheconditionalbranches,assumethatthemispredictionpenaltyisalways4cyclesandthebuffermisspenaltyisalways3cycles.Assume90%branchtargetbufferhitrateand90%targetaddressaccuracy,and15%conditionalbranchfrequency.
Forbranchfoldingthatstoresthetargetinstructionsoftheunconditionalbranches,assumealsoa90%hitrateand5%unconditionalbranchfrequency.Assumealsothatthehittargetinstructioncanbypassthefetchstageandstartimmediatelyinthedecodestage.
Howmuchfasteristhisprocessorversusaprocessorthathasafixed2cyclebranchpenaltyforbothunconditionalandconditionalbranches?AssumeabaseCPIwithoutbranchstallsof1.
Answer:
CPIofdeeplypipelinedprocessorassumingthatthebypassingonlyhappensforunconditionalbranches
Thedeeplypipelinedprocessoris1.31timesfasterthanthefixed2cyclebranchprocessor.