nvidia's fermi the first complete gpu architecture · 2009-09-29 · create a demand for...
TRANSCRIPT
NVIDIA’sFermi:TheFirstCompleteGPUComputingArchitecture
AwhitepaperbyPeterN.Glaskowsky
PreparedundercontractwithNVIDIACorporation
Copyright©September2009,PeterN.Glaskowsky
2
PeterN.Glaskowskyisaconsultingcomputerarchitect,technologyanalyst,andprofessionalbloggerinSiliconValley.Glaskowskywastheprincipal
systemarchitectofchipstartupMontalvoSystems.Earlier,hewasEditorinChiefoftheaward‐winningindustrynewsletterMicroprocessorReport.
GlaskowskywritestheSpeedsandFeedsblogfortheCNETBlogNetwork:
http://www.speedsnfeeds.com/
ThisdocumentislicensedundertheCreativeCommonsAttributionShareAlike3.0License.Inshort:youarefreetoshareandmakederivativeworksofthefileundertheconditionsthatyouappropriatelyattributeit,and
thatyoudistributeitonlyunderalicenseidenticaltothisone.
http://creativecommons.org/licenses/by‐sa/3.0/
Companyandproductnamesmaybetrademarksoftherespectivecompanieswithwhichtheyareassociated.
3
ExecutiveSummary
After38yearsofrapidprogress,conventionalmicroprocessortechnologyisbeginningtoseediminishingreturns.Thepaceofimprovementinclockspeedsandarchitecturalsophisticationisslowing,andwhilesingle‐threadedperformancecontinuestoimprove,thefocushasshiftedtomulticoredesigns.
Thesetooarereachingpracticallimitsforpersonalcomputing;aquad‐coreCPUisn’tworthtwicethepriceofadual‐core,andchipswithevenhighercorecountsaren’tlikelytobeamajordriverofvalueinfuturePCs.
CPUswillnevergoaway,butGPUsareassumingamoreprominentroleinPCsystemarchitecture.GPUsdelivermorecost‐effectiveandenergy‐efficientperformanceforapplicationsthatneedit.
TherapidlygrowingpopularityofGPUsalsomakesthemanaturalchoiceforhigh‐performancecomputing(HPC).Gamingandotherconsumerapplicationscreateademandformillionsofhigh‐endGPUseachyear,andthesehighsalesvolumesmakeitpossibleforcompanieslikeNVIDIAtoprovidetheHPCmarketwithfast,affordableGPUcomputingproducts.
NVIDIA’snext‐generationCUDAarchitecture(codenamedFermi),isthelatestandgreatestexpressionofthistrend.WithmanytimestheperformanceofanyconventionalCPUonparallelsoftware,andnewfeaturestomakeiteasierforsoftwaredeveloperstorealizethefullpotentialofthehardware,Fermi‐basedGPUswillbringsupercomputerperformancetomoreusersthaneverbefore.
FermiisthefirstarchitectureofanykindtodeliverallofthefeaturesrequiredforthemostdemandingHPCapplications:unmatcheddouble‐precisionfloating‐pointperformance,IEEE754‐2008complianceincludingfusedmultiply‐addoperations,ECCprotectionfromtheregisterstoDRAM,astraightforwardlinearaddressingmodelwithcachingatalllevels,andsupportforlanguagesincludingC,C++,FORTRAN,Java,Matlab,andPython.
Withthesefeatures,plusmanyotherperformanceandusabilityenhancements,FermiisthefirstcompletearchitectureforGPUcomputing.
4
CPUComputing—theGreatTradition
Thehistoryofthemicroprocessoroverthelast38yearsdescribesthegreatestperiodofsustainedtechnicalprogresstheworldhaseverseen.Moore’sLaw,whichdescribestherateofthisprogress,hasnoequivalentintransportation,agriculture,ormechanicalengineering.ThinkhowdifferenttheIndustrialRevolutionwouldhavebeen300yearsagoif,forexample,thestrengthofstructuralmaterialshaddoubledevery18monthsfrom1771to1809.Nevermindsteam;the19thcenturycouldhavebeenpoweredbypea‐sizedinternal‐combustionenginescompressinghydrogentoproducenuclearfusion.
CPUperformanceistheproductofmanyrelatedadvances:
• Increasedtransistordensity• Increasedtransistorperformance• Widerdatapaths• Pipelining• Superscalarexecution• Speculativeexecution• Caching• Chip‐andsystem‐levelintegration
Thefirstthirtyyearsofthemicroprocessorfocusedalmostexclusivelyonserialworkloads:compilers,managingserialcommunicationlinks,user‐interfacecode,andsoon.Morerecently,CPUshaveevolvedtomeettheneedsofparallelworkloadsinmarketsfromfinancialtransactionprocessingtocomputationalfluiddynamics.
CPUsaregreatthings.They’reeasytoprogram,becausecompilersevolvedrightalongwiththehardwaretheyrunon.SoftwaredeveloperscanignoremostofthecomplexityinmodernCPUs;microarchitectureisalmostinvisible,andcompilermagichidestherest.Multicorechipshavethesamesoftwarearchitectureasoldermultiprocessorsystems:asimplecoherentmemorymodelandaseaofidenticalcomputingengines.
ButCPUcorescontinuetobeoptimizedforsingle‐threadedperformanceattheexpenseofparallelexecution.Thisfactismostapparentwhenoneconsidersthatintegerandfloating‐pointexecutionunitsoccupyonlyatinyfractionofthedieareainamodernCPU.
Figure1showstheportionofthedieareausedbyALUsintheCorei7processor(thechipcode‐namedBloomfield)basedonIntel’sNehalemmicroarchitecture.
5
Figure1.Intel’sCorei7processor(thechipcodenamedBloomfield,basedonthe
Nehalemmicroarchitecture)includesfourCPUcoreswithsimultaneousmultithreading,8MBofL3cache,andonchipDRAMcontrollers.Madewith45nmprocesstechnology,eachchiphas731milliontransistorsandconsumesupto130Wofthermaldesignpower.Redoutlineshighlighttheportionofeachcoreoccupiedbyexecutionunits.(Source:IntelCorporationexceptredhighlighting)
Withsuchasmallpartofthechipdevotedtoperformingdirectcalculations,it’snosurprisethatCPUsarerelativelyinefficientforhigh‐performancecomputingapplications.MostofthecircuitryonaCPU,andthereforemostoftheheatitgenerates,isdevotedtoinvisiblecomplexity:thosecaches,instructiondecoders,branchpredictors,andotherfeaturesthatarenotarchitecturallyvisiblebutwhichenhancesingle‐threadedperformance.
Speculation
Attheheartofthisfocusonsingle‐threadedperformanceisaconceptknownasspeculation.Atahighlevel,speculationencompassesnotonlyspeculativeexecution(inwhichinstructionsbeginexecutingevenbeforeitispossibletoknowtheirresultswillbeneeded),butmanyotherelementsofCPUdesign.
6
Caches,forexample,arefundamentallyspeculative:storingdatainacacherepresentsabetthatthedatawillbeneededagainsoon.Cachesconsumedieareaandpowerthatcouldotherwisebeusedtoimplementandoperatemoreexecutionunits.Whetherthebetpaysoffdependsonthenatureofeachworkload.
Similarly,multipleexecutionunits,outoforderprocessing,andbranchpredictionalsorepresentspeculativeoptimizations.Allofthesechoicestendtopayoffforcodewithhighdatalocality(wherethesamedataitems,orthosenearbyinmemory,arefrequentlyaccessed),amixofdifferentoperations,andahighpercentageofconditionalbranches.
Butwhenexecutingcodeconsistingofmanysequentialoperationsofthesametype—likescientificworkloads—thesespeculativeelementscansitunused,consumingdieareaandpower.
Theeffectofprocesstechnology
TheneedforCPUdesignerstomaximizesingle‐threadedperformanceisalsobehindtheuseofaggressiveprocesstechnologytoachievethehighestpossibleclockrates.Butthisdecisionalsocomeswithsignificantcosts.Fastertransistorsrunhotter,leakmorepowerevenwhentheyaren’tswitching,andcostmoretomanufacture.
Companiesthatmakehigh‐endCPUsspendstaggeringamountsofmoneyonprocesstechnologyjusttoimprovesingle‐threadedperformance.Betweenthem,IBMandIntelhaveinvestedtensofbillionsofdollarsonR&Dforprocesstechnologyandtransistordesign.Theresultsareimpressivewhenmeasuredingigahertz,butlesssofromtheperspectiveofGFLOPSperdollarorperwatt.
Processormicroarchitecturealsocontributestoperformance.WithinthePCandservermarkets,theextremesofmicroarchitecturaloptimizationarerepresentedbytwoclassesofCPUdesign:relativelysimpledual‐issuecoresandmorecomplexmulti‐issuecores.
Dual‐issueCPUs
ThesimplestCPUmicroarchitectureusedinthePCmarkettodayisthedual‐issuesuperscalarcore.Suchdesignscanexecuteuptotwooperationsineachclockcycle,sometimeswithspecial“pairingrules”thatdefinewhichinstructionscanbeexecutedtogether.Forexample,someearlydual‐issueCPUscouldissuetwosimple
7
integeroperationsatthesametime,oroneintegerandonefloating‐pointoperation,butnottwofloating‐pointoperations.
Dual‐issuecoresgenerallyprocessinstructionsinprogramorder.Theydeliverimprovedperformancebyexploitingthenaturalinstruction‐levelparallelism(ILP)inmostprograms.TheamountofavailableILPvariesfromoneprogramtoanother,butthere’salmostalwaysenoughtotakeadvantageofasecondpipeline.
Intel’sAtomprocessorisagoodexampleofafullyevolveddual‐issueprocessor.Likeotheradvancedx86chips,Atomtranslatesx86instructionsintointernal“micro‐ops”thataremoreliketheinstructionsinoldRISC(reducedinstructionsetcomputing)processors.InAtom,eachmicro‐opcantypicallyperformoneALUoperationplusoneormoresupportingoperationsuchasamemoryloadorstore.
Dual‐issueprocessorslikeAtomusuallyoccupythelowendofthemarketwherecost‐efficiencyisparamount.Forthisreason,Atomhasfewerperformance‐orientedoptimizationsthanmoreexpensiveIntelchips.Atomexecutesinorder,withnospeculativeexecution.MuchofthenewengineeringworkinAtomwentintoimprovingitspowerefficiencywhennotoperatingatfullspeed.
Atomhassixexecutionpipelines(twoforfloatingpointoperations,twoforintegeroperations,andtwoforaddresscalculations;thelatterarecommoninthex86architecturebecauseinstructionoperandscanspecifymemorylocations).Onlytwoinstructions,however,canbeissuedtothesepipelinesinasingleclockperiod.Thislowutilizationmeansthatsomeexecutionunitswillalwaysgounusedineachcycle.
Likeanyx86processor,alargepartofAtomisdedicatedtoinstructioncaching,decoding(inthiscase,translatingtomicro‐ops),andamicrocodestoretoimplementthemorecomplexx86instructions.ItalsosupportsAtom’stwo‐waysimultaneousmultithreading(SMT)feature.Thiscircuitry,whichIntelcallsthe“frontendcluster,”occupiesmoredieareathanthechip’sfloating‐pointunit.
SMTisbasicallyawaytoworkaroundcasesthatfurtherlimitutilizationoftheexecutionunits.Sometimesasinglethreadisstalledwaitingfordatafromthecache,orhasmultipleinstructionspendingforasinglepipeline.Inthesecases,thesecondthreadmaybeabletoissueaninstructionortwo.Thenetperformancebenefitisusuallylow,only10%–20%onsomeapplications,butSMTaddsonlyafewpercenttothesizeofthechip.
8
Asaresult,theAtomcoreissuitableforlow‐endconsumersystems,butprovidesverylownetperformance,wellbelowwhatisavailablefromotherIntelprocessors.
Intel’sLarrabee
LarrabeeisIntel’scodenameforafuturegraphicsprocessingarchitecturebasedonthex86architecture.ThefirstLarrabeechipissaidtousedual‐issuecoresderivedfromtheoriginalPentiumdesign,butmodifiedtoincludesupportfor64‐bitx86operationsandanew512‐bitvector‐processingunit.
Apartfromthevectorunit,theLarrabeecoreissimplerthanAtom’s.Itdoesn’tsupportIntel’sMMXorSSEextensions,insteadrelyingsolelyonthenewvectorunit,whichhasitsownnewinstructions.Thevectorunitiswideenoughtoperform16single‐precisionFPoperationsperclock,andalsoprovidesdouble‐precisionFPsupportatalowerrate.
SeveralfeaturesinLarrabee’svectorunitarenewtothex86architecture,includingscatter‐gatherloadsandstores(formingavectorfrom16differentlocationsinmemory—aconvenientfeature,thoughonethatmustbeusedjudiciously),fusedmultiply‐add,predicatedexecution,andthree‐operandfloating‐pointinstructions.
Larrabeealsosupportsfour‐waymultithreading,butnotinthesamewayasAtom.WhereAtomcansimultaneouslyexecuteinstructionsfromtwothreads(hencetheSMTname),Larrabeesimplymaintainsthestateofmultiplethreadstospeedtheprocessofswitchingtoanewthreadwhenthecurrentthreadstalls.
Larrabee’sx86compatibilityreducesitsperformanceandefficiencywithoutdeliveringmuchbenefitforgraphics.AswithAtom,asignificant(ifnothuge)partoftheLarrabeedieareaandpowerbudgetwillbeconsumedbyinstructiondecoders.Asagraphicschip,Larrabeewillbeimpairedbyitslackofoptimizedfixed‐functionlogicforrasterization,interpolating,andalphablending.Lackingcost‐effectiveperformancefor3Dgames,itwillbedifficultforLarrabeetoachievethekindofsalesvolumesandprofitmarginsIntelexpectsofitsmajorproductlines.
LarrabeewillbeIntel’ssecondattempttoenterthePCgraphics‐chipmarket,afterthei740programof1998,whichwascommerciallyunsuccessfulbutlaidthefoundationforIntel’slaterintegrated‐graphicschipsets.(Intelmadeanevenearlierrunatthevideocontrollerbusinesswiththei750,andbeforethat,thecompany’si860RISCprocessorwasusedasagraphicsacceleratorinsomeworkstations.)
9
Intel’sNehalemmicroarchitecture
Nehalemisthemostsophisticatedmicroarchitectureinanyx86processor.Itsfeaturesarelikealaundrylistofhigh‐performanceCPUdesign:four‐widesuperscalar,outoforder,speculativeexecution,simultaneousmultithreading,multiplebranchpredictors,on‐diepowergating,on‐diememorycontrollers,largecaches,andmultipleinterprocessorinterconnects.Figure2showstheNehalemmicroarchitecture.
Figure2.TheNehalemcoreincludesmultiplex86instructiondecoders,queues,
reorderingbuffers,andsixexecutionpipelinestosupportspeculativeoutofordermultithreadedexecution.(Source:"File:IntelNehalemarch.svg."WikimediaCommons)
10
FourinstructiondecodersareprovidedineachNehalemcore;theseruninparallelwhereverpossible,thoughonlyonecandecodecomplexx86instructions,whicharerelativelyinfrequent.Themicro‐opsgeneratedbythedecodersarequeuedanddispatchedoutoforderthroughsixportsto12computationalexecutionunits.Thereisalsooneloadunitandtwostoreunitsfordataandaddressvalues.
Nehalem’s128‐bitSIMDfloating‐pointunitsaresimilartothosefoundonpreviousgenerationIntelprocessors:oneforFMULandFDIV(floating‐pointmultiplyanddivide),oneforFADD,andoneforFPshuffleoperations.The“shuffle”unitisusedtorearrangedatavalueswithintheSIMDregisters,anddoesnotcontributetotheperformanceofmultiply‐addintensivealgorithms.
Thepeaksingle‐precisionfloating‐pointperformanceofafour‐coreNehalemprocessor(notcountingtheshuffleunit)canbecalculatedas:
4 cores * 2 SIMD ops/clock * 4 values/op * clock rate
Also,whileNehalemprocessorsprovide32GB/sofpeakDRAMbandwidth—acommendablefigureforaPCprocessor—thisfigurerepresentsalittlelessthanonebyteofDRAMI/Oforeachthreefloating‐pointoperations.Asaresult,manyhigh‐performancecomputingapplicationswillbebottleneckedbyDRAMperformancebeforetheysaturatethechip’sfloating‐pointALUs.
TheXeonW5590isIntel’shigh‐endquad‐coreworkstationprocessorbasedontheNehalem‐EP“Gainestown”chip.TheW5590ispricedat$1,600eachwhenpurchasedin1,000‐unitquantities(asofAugust2009).
Atits3.33GHzclock,theW5590deliversapeaksingle‐precisionfloating‐pointrateof106.56GFLOPS.TheW5590hasa130Wthermaldesignpower(TDP)rating,or1.22watts/GFLOPS—notincludingthenecessarycore‐logicchipset.
Nehalemhasbeenoptimizedforsingle‐threadedperformanceandclockspeedattheexpenseofsustainedthroughput.Thisisadesirabletradeoffforachipintendedtobeamarket‐leadingPCdesktopandserverprocessor,butitmakesNehalemanexpensive,power‐hungrychoiceforhigh‐performancecomputing.
11
“TheWall”
Themarketdemandsgeneral‐purposeprocessorsthatdeliverhighsingle‐threadedperformanceaswellasmulti‐corethroughputforawidevarietyofworkloadsonclient,server,andhigh‐performancecomputing(HPC)systems.Thispressurehasgivenusalmostthreedecadesofprogresstowardhighercomplexityandhigherclockrates.
Thisprogresshasn’talwaysbeensteady.Intelcancelledits“Tejas”processor,whichwasrumoredtohavea40‐stagepipeline,andlaterkilledofftheentirePentium4“NetBurst”productfamilybecauseofitsrelativeinefficiency.ThePentium4ultimatelyreachedaclockrateof3.8GHzinthe2004“Prescott”model,aspeedthatIntelhasbeenunabletomatchsince.
InthemorerecentCore2(Conroe/Penryn)andCorei7(Nehalem)processors,IntelusesincreasedcomplexitytodeliversubstantialperformanceimprovementsoverthePentium4line,butthepaceoftheseimprovementsisslowing.Eachnewgenerationofprocesstechnologyrequiresevermoreheroicmeasurestoimprovetransistorcharacteristics;eachnewcoremicroarchitecturemustworkdisproportionatelyhardertofindandexploitinstruction‐levelparallelism(ILP).
Asthesechallengesbecamemoreapparentinthe1990s,CPUarchitectsbeganreferringtothe“powerwall,”the“memorywall,”andthe“ILPwall”asobstaclestothekindofrapidprogressseenupuntilthattime.Itmaybebettertothinkoftheseissuesasmountainsratherthanwalls—mountainsthatbeginasmildslopesandbecomesteeperwitheachstep,makingfurtherprogressincreasinglydifficult.
Nevertheless,theinexorableadvanceofprocesstechnologyprovidedCPUdesignerswithmoretransistorsineachgeneration.By2005,thecompetitivepressuretousetheseadditionaltransistorstodeliverimprovedperformance(atthechiplevel,ifnotatthecorelevel)droveAMDandInteltointroducedual‐coreprocessors.Sincethen,theprimaryfocusofPCprocessordesignhasbeencontinuingtoincreasethecorecountonthesechips.
Thatapproach,however,hasreachedapointofdiminishingreturns.Dual‐coreCPUsprovidenoticeablebenefitsformostPCusers,butarerarelyfullyutilizedexceptwhenworkingwithmultimediacontentormultipleperformance‐hungryapplications.Quad‐coreCPUsareonlyaslightimprovement,mostofthetime.By2010,therewillbeeight‐coreCPUsindesktops,butitwilllikelybedifficulttosellmostcustomersonthevalueoftheadditionalcores.Sellingfurtherincreaseswillbeevenmoreproblematic.
12
Oncetheincreaseincorecountstalls,thefocuswillreturntosingle‐threadedperformance,butwithallthelow‐hangingfruitlonggone,furtherimprovementswillbehardtofind.Inthenearterm,AMDandIntelareexpectedtoemphasizevectorfloating‐pointimprovementswiththeforthcomingAdvancedVectorExtensions(AVX).LikeSSE,AVX’sprimaryvaluewillbeforapplicationswherevectorizablefloating‐pointcomputationsneedtobecloselycoupledwiththekindofcontrol‐flowcodeforwhichmodernx86processorshavebeenoptimized.
CPUcoredesignwillcontinuetoprogress.Therewillcontinuetobefurtherimprovementsinprocesstechnology,fastermemoryinterfaces,andwidersuperscalarcores.Butabouttenyearsago,NVIDIA’sprocessorarchitectsrealizedthatCPUswerenolongerthepreferredsolutionforcertainproblems,andstartedfromacleansheetofpapertocreateabetteranswer.
13
TheHistoryoftheGPU
It’sonethingtorecognizethefuturepotentialofanewprocessingarchitecture.It’sanothertobuildamarketbeforethatpotentialcanbeachieved.Therewereattemptstobuildchip‐scaleparallelprocessorsinthe1990s,butthelimitedtransistorbudgetsinthosedaysfavoredmoresophisticatedsingle‐coredesigns.
TherealpathtowardGPUcomputingbegan,notwithGPUs,butwithnon‐programmable3G‐graphicsaccelerators.Multi‐chip3Drenderingenginesweredevelopedbymultiplecompaniesstartinginthe1980s,butbythemid‐1990sitbecamepossibletointegratealltheessentialelementsontoasinglechip.From1994to2001,thesechipsprogressedfromthesimplestpixel‐drawingfunctionstoimplementingthefull3Dpipeline:transforms,lighting,rasterization,texturing,depthtesting,anddisplay.
NVIDIA’sGeForce3in2001introducedprogrammablepixelshadingtotheconsumermarket.Theprogrammabilityofthischipwasverylimited,butlaterGeForceproductsbecamemoreflexibleandfaster,addingseparateprogrammableenginesforvertexandgeometryshading.ThisevolutionculminatedintheGeForce7800,showninFigure3.
Figure3.TheGeForce7800hadthreekindsofprogrammableenginesfordifferentstages
ofthe3Dpipelineplusseveraladditionalstagesofconfigurableandfixedfunctionlogic.(Source:NVIDIA)
14
So‐calledgeneral‐purposeGPU(GPGPU)programmingevolvedasawaytoperformnon‐graphicsprocessingonthesegraphics‐optimizedarchitectures,typicallybyrunningcarefullycraftedshadercodeagainstdatapresentedasvertexortextureinformationandretrievingtheresultsfromalaterstageinthepipeline.Thoughsometimesawkward,GPGPUprogrammingshowedgreatpromise.
Managingthreedifferentprogrammableenginesinasingle3Dpipelineledtounpredictablebottlenecks;toomucheffortwentintobalancingthethroughputofeachstage.In2006,NVIDIAintroducedtheGeForce8800,asFigure4shows.Thisdesignfeatureda“unifiedshaderarchitecture”with128processingelementsdistributedamongeightshadercores.Eachshadercorecouldbeassignedtoanyshadertask,eliminatingtheneedforstage‐by‐stagebalancingandgreatlyimprovingoverallperformance.
Figure4.TheGeForce8800introducedaunifiedshaderarchitecturewithjustonekind
ofprogrammableprocessingelementthatcouldbeusedformultiplepurposes.Somesimplegraphicsoperationsstillusedspecialpurposelogic.(Source:NVIDIA)
The8800alsointroducedCUDA,theindustry’sfirstC‐baseddevelopmentenvironmentforGPUs.(CUDAoriginallystoodfor“ComputeUnifiedDeviceArchitecture,”butthelongernameisnolongerspelledout.)CUDAdeliveredaneasierandmoreeffectiveprogrammingmodelthanearlierGPGPUapproaches.
15
Tobringtheadvantagesofthe8800architectureandCUDAtonewmarketssuchasHPC,NVIDIAintroducedtheTeslaproductline.CurrentTeslaproductsusethemorerecentGT200architecture.
TheTeslalinebeginswithPCIExpressadd‐inboards—essentiallygraphicscardswithoutdisplayoutputs—andwithdriversoptimizedforGPUcomputinginsteadof3Drendering.WithTesla,programmersdon’thavetoworryaboutmakingtaskslooklikegraphicsoperations;theGPUcanbetreatedlikeamany‐coreprocessor.
Unliketheearlyattemptsatchip‐scalemultiprocessingbackinthe’90s,Teslawasahigh‐volumehardwareplatformrightfromthebeginning.ThisisdueinparttoNVIDIA’sstrategyofsupportingtheCUDAsoftwaredevelopmentplatformonthecompany’sGeForceandQuadroproducts,makingitavailabletoamuchwideraudienceofdevelopers.NVIDIAsaysithasshippedover100millionCUDA‐capablechips.
Atthetimeofthiswriting,thepricefortheentry‐levelTeslaC1060add‐inboardisunder$1,500fromsomeInternetmail‐ordervendors.That’slowerthanthepriceofasingleIntelXeonW5590processor—andtheTeslacardhasapeakGFLOPSratingmorethaneighttimeshigherthantheXeonprocessor.
TheTeslalinealsoincludestheS1070,a1U‐heightrackmountserverthatincludesfourGT200‐seriesGPUsrunningatahigherspeedthanthatintheC1060(upto1.5GHzcoreclockvs.1.3GHz),sotheS1070’speakperformanceisover4.6timeshigherthanasingleC1060card.TheS1070connectstoaseparatehostcomputerviaaPCIExpressadd‐incard.
Thiswidespreadavailabilityofhigh‐performancehardwareprovidesanaturaldrawforsoftwaredevelopers.Justasthehigh‐volumex86architectureattractsmoredevelopersthantheIA‐64architectureofIntel’sItaniumprocessors,thehighsalesvolumesofGPUs—althoughdrivenprimarilybythegamingmarket—makesGPUsmoreattractivefordevelopersofhigh‐performancecomputingapplicationsthandedicatedsupercomputersfromcompanieslikeCray,Fujitsu,IBM,NEC,andSGI.
AlthoughGPUcomputingisonlyafewyearsoldnow,it’slikelytherearealreadymoreprogrammerswithdirectGPUcomputingexperiencethanhaveeverusedaCray.AcademicsupportforGPUcomputingisalsogrowingquickly.NVIDIAsaysover200collegesanduniversitiesareteachingclassesinCUDAprogramming;theavailabilityofOpenCL(suchasinthenew“SnowLeopard”versionofApple’sMacOSX)willdrivethatnumberevenhigher.
16
IntroducingFermi
GPUcomputingisn’tmeanttoreplaceCPUcomputing.Eachapproachhasadvantagesforcertainkindsofsoftware.Asexplainedearlier,CPUsareoptimizedforapplicationswheremostoftheworkisbeingdonebyalimitednumberofthreads,especiallywherethethreadsexhibithighdatalocality,amixofdifferentoperations,andahighpercentageofconditionalbranches.
GPUdesignaimsattheotherendofthespectrum:applicationswithmultiplethreadsthataredominatedbylongersequencesofcomputationalinstructions.Overthelastfewyears,GPUshavebecomemuchbetteratthreadhandling,datacaching,virtualmemorymanagement,flowcontrol,andotherCPU‐likefeatures,butthedistinctionbetweencomputationallyintensivesoftwareandcontrol‐flowintensivesoftwareisfundamental.
ThestateoftheartinGPUdesignisrepresentedbyNVIDIA’snext‐generationCUDAarchitecture,codenamedFermi.Figure5showsahigh‐levelblockdiagramofthefirstFermichip.
Figure5.NVIDIA’sFermiGPUarchitectureconsistsofmultiplestreaming
multiprocessors(SMs),eachconsistingof32cores,eachofwhichcanexecuteonefloatingpointorintegerinstructionperclock.TheSMsaresupportedbyasecondlevelcache,hostinterface,GigaThreadscheduler,andmultipleDRAMinterfaces.(Source:NVIDIA)
17
Atthislevelofabstraction,theGPUlookslikeseaofcomputationalunitswithonlyafewsupportelements—anillustrationofthekeyGPUdesigngoal,whichistomaximizefloating‐pointthroughput.
Sincemostofthecircuitrywithineachcoreisdedicatedtocomputation,ratherthanspeculativefeaturesmeanttoenhancesingle‐threadedperformance,mostofthedieareaandpowerconsumedbyFermigoesintotheapplication’sactualalgorithmicwork.
TheProgrammingModel
ThecomplexityoftheFermiarchitectureismanagedbyamulti‐levelprogrammingmodelthatallowssoftwaredeveloperstofocusonalgorithmdesignratherthanthedetailsofhowtomapthealgorithmtothehardware,thusimprovingproductivity.ThisisaconcernthatconventionalCPUshaveyettoaddressbecausetheirstructuresaresimpleandregular:asmallnumberofcorespresentedaslogicalpeersonavirtualbus.
InNVIDIA’sCUDAsoftwareplatform,aswellasintheindustry‐standardOpenCLframework,thecomputationalelementsofalgorithmsareknownaskernels(atermhereadaptedfromitsuseinsignalprocessingratherthanfromoperatingsystems).Anapplicationorlibraryfunctionmayconsistofoneormorekernels.
KernelscanbewrittenintheClanguage(specifically,theANSI‐standardC99dialect)extendedwithadditionalkeywordstoexpressparallelismdirectlyratherthanthroughtheusualloopingconstructs.
Oncecompiled,kernelsconsistofmanythreadsthatexecutethesameprograminparallel:onethreadislikeoneiterationofaloop.Inanimage‐processingalgorithm,forexample,onethreadmayoperateononepixel,whileallthethreadstogether—thekernel—mayoperateonawholeimage.
Multiplethreadsaregroupedintothreadblockscontainingupto1,536threads.AllofthethreadsinathreadblockwillrunonasingleSM,sowithinthethreadblock,threadscancooperateandsharememory.Threadblockscancoordinatetheuseofglobalsharedmemoryamongthemselvesbutmayexecuteinanyorder,concurrentlyorsequentially.
Threadblocksaredividedintowarpsof32threads.ThewarpisthefundamentalunitofdispatchwithinasingleSM.InFermi,twowarpsfromdifferentthreadblocks(evendifferentkernels)canbeissuedandexecutedconcurrently,increasinghardwareutilizationandenergyefficiency.
18
Threadblocksaregroupedintogrids,eachofwhichexecutesauniquekernel.
Threadblocksandthreadseachhaveidentifiers(IDs)thatspecifytheirrelationshiptothekernel.TheseIDsareusedwithineachthreadasindexestotheirrespectiveinputandoutputdata,sharedmemorylocations,andsoon.
Atanyonetime,theentireFermideviceisdedicatedtoasingleapplication.Asmentionedabove,anapplicationmayincludemultiplekernels.Fermisupportssimultaneousexecutionofmultiplekernelsfromthesameapplication,eachkernelbeingdistributedtooneormoreSMsonthedevice.Thiscapabilityavoidsthesituationwhereakernelisonlyabletousepartofthedeviceandtherestgoesunused.
Switchingfromoneapplicationtoanotherisabout20timesfasteronFermi(just25microseconds)thanonprevious‐generationGPUs.ThistimeisshortenoughthataFermiGPUcanstillmaintainhighutilizationevenwhenrunningmultipleapplications,likeamixofcomputecodeandgraphicscode.Efficientmultitaskingisimportantforconsumers(e.g.,forvideogamesusingphysics‐basedeffects)andprofessionalusers(whooftenneedtoruncomputationallyintensivesimulationsandsimultaneouslyvisualizetheresults).
Thisswitchingismanagedbythechip‐levelGigaThreadhardwarethreadscheduler,whichmanages1,536simultaneouslyactivethreadsforeachstreamingmultiprocessoracross16kernels.
ThiscentralizedschedulerisanotherpointofdeparturefromconventionalCPUdesign.Inamulticoreormultiprocessorserver,nooneCPUis“incharge”.Alltasks,includingtheoperatingsystem’skernelitself,mayberunonanyavailableCPU.Thisapproachallowseachoperatingsystemtofollowadifferentphilosophyinkerneldesign,fromlargemonolithickernelslikeLinux’stothemicrokerneldesignofQNXandhybriddesignslikeWindows7.Butthegeneralityofthisapproachisalsoitsweakness,becauseitrequirescomplexCPUstospendtimeandenergyperformingfunctionsthatcouldalsobehandledbymuchsimplerhardware.
WithFermi,theintendedapplications,principlesofstreamprocessing,andthekernelandthreadmodel,wereallknowninadvancesothatamoreefficientschedulingmethodcouldbeimplementedintheGigaThreadengine.
InadditiontoC‐languagesupport,FermicanalsoaccelerateallthesamelanguagesastheGT200,includingFORTRAN(withindependentsolutionsfromThePortlandGroupandNOAA,theNationalOceanicandAtmosphericAdministration),Java,Matlab,andPython.SupportedsoftwareplatformsincludeNVIDIA’sownCUDA
19
developmentenvironment,theOpenCLstandardmanagedbytheKhronosGroup,andMicrosoft’sDirectComputeAPI.
ThePortlandGroup(PGI)supportstwowaystouseGPUstoaccelerateFORTRANprograms:thePGIAcceleratorprogrammingmodelinwhichregionsofcodewithinaFORTRANprogramcanbeoffloadedtoaGPU,andCUDAFORTRAN,whichallowstheprogrammerdirectcontrolovertheoperationofattachedGPUsincludingmanaginglocalandsharedmemory,threadsynchronization,andsoon.NOAAprovidesalanguagetranslatorthatconvertsFORTRANcodeintoCUDAC.
Fermibringsanimportantnewcapabilitytothemarketwithnewinstruction‐levelsupportforC++,includinginstructionsforC++virtualfunctions,functionpointers,dynamicobjectallocation,andtheC++exceptionhandlingoperations“try”and“catch”.ThepopularityoftheC++language,previouslyunsupportedonGPUs,willmakeGPUcomputingmorewidelyavailablethanever.
TheStreamingMultiprocessor
Fermi’sstreamingmultiprocessors,showninFigure6,comprise32cores,eachofwhichcanperformfloating‐pointandintegeroperations,alongwith16load‐storeunitsformemoryoperations,fourspecial‐functionunits,and64KoflocalSRAMsplitbetweencacheandlocalmemory.
20
Figure6.EachFermiSMincludes32cores,16load/storeunits,fourspecialfunction
units,a4Kwordregisterfile,64KofconfigurableRAM,andthreadcontrollogic.Eachcorehasbothfloatingpointandintegerexecutionunits.(Source:NVIDIA)
Floating‐pointoperationsfollowtheIEEE754‐2008floating‐pointstandard.Eachcorecanperformonesingle‐precisionfusedmultiply‐addoperationineachclockperiodandonedouble‐precisionFMAintwoclockperiods.Atthechiplevel,Fermiperformsmorethan8×asmanydouble‐precisionoperationsperclockthanthepreviousGT200generation,wheredouble‐precisionprocessingwashandledbyadedicatedunitperSMwithmuchlowerthroughput.
IEEEfloating‐pointcomplianceincludesallfourroundingmodes,andsubnormalnumbers(numbersclosertozerothananormalizedformatcanrepresent)arehandledcorrectlybytheFermihardwareratherthanbeingflushedtozeroorrequiringadditionalprocessinginasoftwareexceptionhandler.
21
Fermi’ssupportforfusedmultiply‐add(FMA)alsofollowstheIEEE754‐2008standard,improvingtheaccuracyofthecommonlyusedmultiply‐addsequencebynotroundingofftheintermediateresult,asotherwisehappensbetweenthemultiplyandaddoperations.InFermi,thisintermediateresultcarriesafull106‐bitmantissa;infact,161bitsofprecisionaremaintainedduringtheaddoperationtohandleworst‐casedenormalizednumbersbeforethefinaldouble‐precisionresultiscomputed.TheGT200supportedFMAfordouble‐precisionoperationsonly;FermibringsthebenefitsofFMAtosingle‐precisionaswell.
FMAsupportalsoincreasestheaccuracyandperformanceofothermathematicaloperationssuchasdivisionandsquareroot,andmorecomplexfunctionssuchasextended‐precisionarithmetic,intervalarithmetic,andlinearalgebra.
TheintegerALUsupportstheusualmathematicalandlogicaloperations,includingmultiplication,onboth32‐bitand64‐bitvalues.
Memoryoperationsarehandledbyasetof16load‐storeunitsineachSM.Theload/storeinstructionscannowrefertomemoryintermsoftwo‐dimensionalarrays,providingaddressesintermsofxandyvalues.Datacanbeconvertedfromoneformattoanother(forexample,fromintegertofloatingpointorvice‐versa)asitpassesbetweenDRAMandthecoreregistersatthefullrate.TheseformattingandconvertingfeaturesarefurtherexamplesofoptimizationsuniquetoGPUs—notworthwhileingeneral‐purposeCPUs,butheretheywillbeusedsufficientlyoftentojustifytheirinclusion.
AsetoffourSpecialFunctionUnits(SFUs)isalsoavailabletohandletranscendentalandotherspecialoperationssuchassin,cos,exp,andrcp(reciprocal).FouroftheseoperationscanbeissuedpercycleineachSM.
WithintheSM,coresaredividedintotwoexecutionblocksof16coreseach.Alongwiththegroupof16load‐storeunitsandthefourSFUs,therearefourexecutionblocksperSM.Ineachcycle,atotalof32instructionscanbedispatchedfromoneortwowarpstotheseblocks.Ittakestwocyclesforthe32instructionsineachwarptoexecuteonthecoresorload/storeunits.Awarpof32special‐functioninstructionsisissuedinasinglecyclebuttakeseightcyclestocompleteonthefourSFUs.Figure7showsasequenceofinstructionsbeingdistributedamongtheavailableexecutionblocks.
22
Figure7.Atotalof32instructionsfromoneortwowarpscanbedispatchedineachcycle
toanytwoofthefourexecutionblockswithinaFermiSM:twoblocksof16coreseach,oneblockoffourSpecialFunctionUnits,andoneblockof16load/storeunits.Thisfigureshowshowinstructionsareissuedtotheexecutionblocks.(Source:NVIDIA)
ISAimprovements
FermidebutstheParallelThreadeXecution(PTX)2.0instruction‐setarchitecture(ISA).PTX2.0definesaninstructionsetandanewvirtualmachinearchitecturethatamountstoanidealizedprocessordesignedforparallelthreadoperation.
Becausethisvirtualmachinemodeldoesn’tliterallymodeltheFermihardware,itcanbeportablefromonegenerationtothenext.NVIDIAintendsPTX2.0tospanmultiplegenerationsofGPUhardwareandmultipleGPUsizeswithineachgeneration,justasPTX1.0did.
CompilerssupportingNVIDIAGPUsprovidePTX‐compliantbinariesthatactasahardware‐neutraldistributionformatforGPUcomputingapplicationsandmiddleware.Whenapplicationsareinstalledonatargetmachine,theGPUdrivertranslatesthePTXbinariesintothelow‐levelmachineinstructionsthataredirectlyexecutedbythehardware.(PTX1.0binariescanalsobetranslatedbyFermiGPUdriversintonativeinstructions.)
23
Thisfinaltranslationstepimposesnofurtherperformancepenalties.Kernelsandlibrariesforeventhemostperformance‐sensitiveapplicationscanbehand‐codedtothePTX2.0ISA,makingthemportableacrossGPUgenerationsandimplementations.
AllofthearchitecturallyvisibleimprovementsinFermiarerepresentedinPTX2.0.PredicationisoneofthemoresignificantenhancementsinthenewISA.
Allinstructionssupportpredication.Eachinstructioncanbeexecutedorskippedbasedonconditioncodes.Predicationallowseachthread—eachcore—toperformdifferentoperationsasneededwhileexecutioncontinuesatfullspeed.Wherepredicationisn’tsufficient,Fermialsosupportstheusualif‐then‐elsestructurewithbranchstatements.
MostCPUsrelyexclusivelyonconditionalbranchesandincorporatebranch‐predictionhardwaretoallowspeculationalongthelikelypath.That’sareasonablesolutionforbranch‐intensiveserialcode,butlessefficientthanpredicationforstreamingapplications.
AnothermajorimprovementinFermiandPTX2.0isanewunifiedaddressingmodel.AlladdressesintheGPUareallocatedfromacontinuous40‐bit(oneterabyte)addressspace.Global,shared,andlocaladdressesaredefinedasrangeswithinthisaddressspaceandcanbeaccessedbycommonload/storeinstructions.(Theload/storeinstructionssupport64‐bitaddressestoallowforfuturegrowth.)
TheCacheandMemoryHierarchy
LikeearlierGPUs,theFermiarchitectureprovidesforlocalmemoryineachSM.NewtoFermiistheabilitytousesomeofthislocalmemoryasafirst‐level(L1)cacheforglobalmemoryreferences.Thelocalmemoryis64Kinsize,andcanbesplit16K/48Kor48K/16KbetweenL1cacheandsharedmemory.
Sharedmemory,thetraditionaluseforlocalSMmemory,provideslow‐latencyaccesstomoderateamountsofdata(suchasintermediateresultsinaseriesofcalculations,oneroworcolumnofdataformatrixoperations,alineofvideo,etc.).Becausetheaccesslatencytothismemoryisalsocompletelypredictable,algorithmscanbewrittentointerleaveloads,calculations,andstoreswithmaximumefficiency.
Thedecisiontoallocate16Kor48Kofthelocalmemoryascacheusuallydependsontwofactors:howmuchsharedmemoryisneeded,andhowpredictablethekernel’saccessestoglobalmemory(usuallytheoff‐chipDRAM)arelikelytobe.
24
Alargershared‐memoryrequirementarguesforlesscache;morefrequentorunpredictableaccessestolargerregionsofDRAMarguesformorecache.
Someembeddedprocessorssupportlocalmemoryinasimilarway,butthisfeatureisalmostneveravailableonaPCorserverprocessorbecausemainstreamoperatingsystemshavenowaytomanagelocalmemory;thereisnosupportforitintheirprogrammingmodels.Thisisoneofthereasonswhyhigh‐performancecomputingapplicationsrunningongeneral‐purposeprocessorsaresofrequentlybottleneckedbymemorybandwidth;theapplicationhasnowaytomanagewherememoryisallocatedandalgorithmscan’tbefullyoptimizedforaccesslatency.
EachFermiGPUisalsoequippedwithanL2cache(768KBinsizefora512‐corechip).TheL2cachecoversGPUlocalDRAMaswellassystemmemory.
TheL2cachesubsystemalsoimplementsanotherfeaturenotfoundonCPUs:asetofmemoryread‐modify‐writeoperationsthatareatomic—thatis,uninterruptible—andthusidealformanagingaccesstodatathatmustbesharedacrossthreadblocksorevenkernels.Normallythisfunctionalityisprovidedthroughatwo‐stepprocess;aCPUusesanatomictest‐and‐setinstructiontomanageasemaphore,andthesemaphoremanagesaccesstoapredefinedlocationorregioninmemory.
Fermicanimplementthatsamesolutionwhenneeded,butit’smuchsimplerfromthesoftwareperspectivetobeabletoissueastandardintegerALUoperationthatperformstheatomicoperationdirectlyratherthanhavingtowaituntilasemaphorebecomesavailable.
Fermi’satomicoperationsareimplementedbyasetofintegerALUslogicallythatcanlockaccesstoasinglememoryaddresswhiletheread‐modify‐writesequenceiscompleted.Thismemoryaddresscanbeinsystemmemory,intheGPU’slocallyconnectedDRAM,oreveninthememoryspacesofotherPCIExpress‐connecteddevices.Duringthebrieflockinterval,therestofmemorycontinuestooperatenormally.LocksinsystemmemoryareatomicwithrespecttotheoperationsoftheGPUperformingtheatomicoperation;softwaresynchronizationisordinarilyusedtoassignregionsofmemorytoGPUcontrol,thusavoidingconflictingwritesfromtheCPUorotherdevices.
Considerakerneldesignedtocalculateahistogramforanimage,wherethehistogramconsistsofonecounterforeachbrightnesslevelintheimage.ACPUmightloopthroughthewholeimageandincrementtheappropriatecountervaluebasedonthebrightnessofeachpixel.AGPUwithoutatomicoperationsmightassignoneSMtoeachpartoftheimageandletthemrununtilthey’realldone(by
25
imposingasynchronizationbarrier)andthenrunasecondshortprogramtoaddupalltheresults.
WiththeatomicoperationsinFermi,oncetheregionalhistogramsarecomputed,thoseresultscanbecombinedintothefinalhistogramusingatomicaddoperations;nosecondpassisrequired.
Similarimprovementscanbemadeinotherapplicationssuchasraytracing,patternrecognition,andlinearalgebraroutinessuchasmatrixmultiplicationasimplementedinthecommonlyusedBasicLinearAlgebraSubprograms(BLAS).AccordingtoNVIDIA,atomicoperationsonFermiare5×to20×fasterthanonpreviousGPUsusingconventionalsynchronizationmethods.
ThefinalstageofthelocalmemoryhierarchyistheGPU’sdirectlyconnectedDRAM.Fermiprovidessix64‐bitDRAMchannelsthatsupportSDDR3andGDDR5DRAMs.Upto6GBofGDDR5DRAMcanbeconnectedtothechipforasignificantboostincapacityandbandwidthoverNVIDIA’spreviousproducts.
FermiisthefirstGPUtoprovideECC(errorcorrectingcode)protectionforDRAM;thechip’sregisterfiles,sharedmemories,L1andL2cachesarealsoECCprotected.ThelevelofprotectionisknownasSECDED:single(bit)errorcorrection,doubleerrordetection.SECDEDistheusuallevelofprotectioninmostECC‐equippedsystems.
Fermi’sECCprotectionforDRAMisuniqueamongGPUs;soisitsimplementation.Insteadofeach64‐bitmemorychannelcarryingeightextrabitsforECCinformation,NVIDIAhasaproprietary(andundisclosed)solutionforpackingtheECCbitsintoreservedlinesofmemory.
TheGigaThreadcontrollerthatmanagesapplicationcontextswitching(describedearlier)alsoprovidesapairofstreamingdata‐transferengines,eachofwhichcanfullysaturateFermi’sPCIExpresshostinterface.Typically,onewillbeusedtomovedatafromsystemmemorytoGPUmemorywhensettingupaGPUcomputation,whiletheotherwillbeusedtomoveresultdatafromGPUmemorytosystemmemory.
26
Conclusions
Injustafewyears,NVIDIAhasadvancedthestateoftheartinGPUdesignfromalmostpurelygraphics‐focusedproductslikethe7800seriestotheflexibleFermiarchitecture.
FermiisstillderivedfromNVIDIA’sgraphicsproducts,whichensuresthatNVIDIAwillsellmillionsofsoftware‐compatiblechipstoPCgamers.InthePCmarket,Fermi’scapacityforGPUcomputingwilldeliversubstantialimprovementsingameplay,multimediaencodingandenhancement,andotherpopularPCapplications.
ThosesalestoPCusersgenerateanindirectbenefitforcustomersinterestedprimarilyinhigh‐performancecomputing:Fermi’saffordabilityandavailabilitywillbeunmatchedbyanyothercomputingarchitectureinitsperformancerange.
AlthoughitmaysometimesappearthatGPUsarebecomingmoreCPU‐likewitheachnewproductgeneration,fundamentaltechnicaldifferencesamongcomputingapplicationsleadtolastingdifferencesinhowCPUsandGPUsaredesignedandused.
CPUswillcontinuetobebestfordynamicworkloadsmarkedbyshortsequencesofcomputationaloperationsandunpredictablecontrolflow.ModernCPUs,whichdevotelargeportionsoftheirsiliconrealestatetocaches,largebranchpredictors,andcomplexinstructionsetdecoderswillalwaysbeoptimizedforthiskindofcode.
Attheotherextreme,workloadsthataredominatedbycomputationalworkperformedwithinasimplercontrolflowneedadifferentkindofprocessorarchitecture,oneoptimizedforstreamingcalculationsbutalsoequippedwiththeabilitytosupportpopularprogramminglanguages.Fermiisjustsuchanarchitecture.
Fermiisthefirstcomputingarchitecturetodeliversuchahighlevelofdouble‐precisionfloating‐pointperformancefromasinglechipwithaflexible,error‐protectedmemoryhierarchyandsupportforlanguagesincludingC++andFORTRAN.Assuch,Fermiistheworld’sfirstcompleteGPUcomputingarchitecture.