the numachine multiprocessor - computer … report presents the architecture of the numachine...

33
The NUMAchine Multiprocessor Z. Vranesic, S. Brown, M. Stumm, S. Caranci, A. Grbic, R. Grindley, M. Gusat, O. Krieger, G. Lemieux, K. Loveless, N. Manjikian Z. Zilic, T. Abdelrahman, B. Gamsa, P. Pereira, K. Sevcik, A. Elkateeb, S. Srbljic Department of Electrical and Computer Engineering Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 1A4 June 28, 1995 Abstract NUMAchine is a cache-coherent shared-memory multiprocessor designed to have high-performance, be cost-effective, modular, and easy to program for efficient parallel execution. Processors, caches, and memory are distributed across a number of stations interconnected by a hierarchy of unidirectional bit- parallel rings. The simplicity of the interconnection network permits the use of wide datapaths at each node, and a novel scheme for routing packets between stations enables high-speed operation of the rings in order to reduce latency. The ring hierarchy provides useful features, such as efficient multicasting and order-preserving message transfers, which are exploited by the cache coherence protocol, for low-latency invalidation of shared data. The hardware is designed so that cache coherence traffic is restricted to localized sections of the machine whenever possible. NUMAchine is optimized for applications with good locality, and system software is designed to maximize locality. Results from detailed behavioral simulations to evaluate architectural tradeoffs indicate that a prototype implementation will perform well for a variety of parallel applications. 1 Introduction Multiprocessors have existed for many years, but they have not achieved the level of success that many experts initially felt would be reached. The lack of stronger acceptance of multiprocessors is due in part to the following reasons: (1) an over-reliance on custom hardware solutions, making it difficult to track the rapid improvements in mainstream workstation technology, (2) a focus on scalability to thousands of processors, involving considerable up-front costs that preclude reasonably-priced small configurations, and (3) a lack of adequate system software, impeding development of application programs that can exploit the performance potential of the machines. These three factors have influenced our approach to multiprocessor architecture, as discussed below. Multiprocessor systems designed using workstation technology can provide large computing capability at a reasonable cost. Future demand is likely to be the greatest for machines that give good performance and are modular, cost-effective, scalable to a reasonable size, and easy to use efficiently. A key requirement is that a multiprocessor system be viable and affordable in a relatively small configuration, which precludes a large up-front cost. However, it also must be easy to expand the system, necessitating a modular design. While scalability is an important issue, and has strongly influenced research in recent years, it is apparent 1

Upload: votu

Post on 07-Apr-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

TheNUMAchineMultiprocessor

Z. Vranesic,S.Brown, M. Stumm,S.Caranci,A. Grbic,R. Grindley,M. Gusat,O. Krieger, G. Lemieux,K. Loveless,N. ManjikianZ. Zilic,

T. Abdelrahman,B. Gamsa,P. Pereira,K. Sevcik, A. Elkateeb,S.Srbljic

Departmentof ElectricalandComputerEngineeringDepartmentof ComputerScience

Universityof TorontoToronto,Ontario,CanadaM5S1A4

June28,1995

Abstract

NUMAchineis acache-coherentshared-memorymultiprocessordesignedtohavehigh-performance,becost-effective,modular, andeasyto programfor efficient parallelexecution.Processors,caches,andmemoryaredistributedacrossa numberof stationsinterconnectedby a hierarchyof unidirectionalbit-parallelrings. Thesimplicity of the interconnectionnetworkpermitstheuseof wide datapathsat eachnode,andanovel schemefor routingpacketsbetweenstationsenableshigh-speedoperationof theringsin orderto reducelatency. Thering hierarchyprovidesusefulfeatures,suchasefficientmulticastingandorder-preservingmessagetransfers,whichareexploitedby thecachecoherenceprotocol,for low-latencyinvalidationof shareddata. The hardwareis designedso that cachecoherencetraffic is restrictedtolocalizedsectionsof the machinewhenever possible. NUMAchine is optimizedfor applicationswithgoodlocality, andsystemsoftwareis designedto maximizelocality. Resultsfrom detailedbehavioralsimulationsto evaluatearchitecturaltradeoffs indicatethataprototypeimplementationwill performwellfor a varietyof parallelapplications.

1 Introduction

Multiprocessorshave existed for many years,but they have not achieved the level of successthat manyexpertsinitially felt would be reached.The lack of strongeracceptanceof multiprocessorsis duein partto the following reasons:(1) an over-relianceon customhardwaresolutions,makingit difficult to trackthe rapid improvementsin mainstreamworkstationtechnology, (2) a focuson scalabilityto thousandsofprocessors,involving considerableup-frontcoststhatprecludereasonably-pricedsmallconfigurations,and(3) a lack of adequatesystemsoftware,impedingdevelopmentof applicationprogramsthatcanexploit theperformancepotentialof themachines.Thesethreefactorshave influencedourapproachto multiprocessorarchitecture,asdiscussedbelow.

Multiprocessorsystemsdesignedusingworkstationtechnologycanprovide largecomputingcapabilityata reasonablecost.Futuredemandis likely to bethegreatestfor machinesthatgivegoodperformanceandaremodular, cost-effective,scalableto a reasonablesize,andeasyto useefficiently. A key requirementisthata multiprocessorsystembeviableandaffordablein a relatively smallconfiguration,which precludesa largeup-frontcost. However, it alsomustbeeasyto expandthesystem,necessitatinga modulardesign.While scalabilityis an importantissue,andhasstronglyinfluencedresearchin recentyears,it is apparent

1

thatdemandfor hugemachines(with thousandsof processors)will continueto below. Commercialinterestis likely to beconcentratedondesignsthatarescalablein therangeof hundredsof processors.

Froma user’s perspective, it is desirablethata machineprovide high performanceandbeeasyto use,requiring little effort to structureprogramsfor parallel execution. One way to facilitate easeof use isto provide a sharedmemoryprogrammingmodelwith a singleflat addressspacefor all processors.Thisallowsparallelprogramstocommunicatebynormalmemoryreadsandwrites,asopposedtocommunicatingbasedon softwaremessagepassingwith its attendantoverhead.In addition,by providing hardware-basedcache-coherencefor thesharedmemory, thetaskof developingparallelprogramsis simplified,bothbecauseprogrammersare given a familiar abstractionfor accessingmemory, and becauseit is simpler to createcompilersthatcanautomaticallyparallelizeprograms.

In order for multiprocessortechnologyto reacha much greaterlevel of commercialsuccessthan ispresentlyheld,it is crucialthatsystemsoftwarefor multiprocessorsevolveconsiderablybeyondthecurrentstate-of-the-art.In orderfor this to occur, it is necessarythat multiprocessormachinesbecomeavailablefor useassoftwareresearchplatforms. Sucha machineshouldallow a largedegreeof flexibility to allowsoftwareto controlthehardwareresourcesavailablein themachine.

This reportpresentsthe architectureof the NUMAchine multiprocessoranddescribesa 64-processorprototypethatis beingconstructed.Thishardwareis partof alargerNUMAchine project thatincludesdevel-opmentof a new operatingsystem,parallelizingcompilers,a numberof toolsfor aidingin correctnessandparallelperformancedebugging,anda largesetof applications

�. Theoverallobjectivesof theNUMAchine

projectareto designa multiprocessorsystemthatmeetsthecriteriadiscussedabove andis scalablein therangeof hundredsof processors.

The NUMAchine architecturehasmany interestingfeatures,the most importantof which are listedbelow:

� Simplicity of interconnection network. NUMAchine has a hierarchicalstructure,with bus-basedstations (containingprocessorsandmemory)interconnectedby bit-parallelrings. Thepoint-to-pointnatureof ring-basedinterconnectionprovideshigh-bandwidthcommunicationpathsthatdonotrequirecomplex wiring. Moreover, becauseof thesimplicity of this structure,NUMAchine offersexcellentmodularityandeaseof expandability.

� Cache coherence. A hardware-supportedcachecoherenceschemeis used,whichis efficient,inexpen-siveto implement,andhaslittle or noimpactonthetimerequiredby amemorymoduleto servicereadandwrite requests.Theprotocolexploits theorderingandmulticastpropertiesof theinterconnectionnetworkto optimizeperformance.It is scalablein that the numberof statebits requiredper-cacheline growsonly logarithmicallywith thenumberof processorsin thesystem.Thismeansthatthecostof modularity is low, wheresufficient SRAM to allow for large configurations(of several hundredprocessors)involvesa negligible extra costcomparedto smallconfigurations.

� Natural multicast mechanism. Thering hierarchyanda novel routingschemeallow efficient broad-casting. This is fully exploited by our coherenceschemeandcanalsobe exploited by softwareinseveralways.

� Hardware support for enhancing locality. Eachstation includesa large network cache for datafrom remotememorymodules.The networkcacheprovidesa numberof attractive advantagesthatreducelatency of accessesto remotememory, including: 1) servingasa sharedtertiary cacheforthe processorson the station,2) obviating the needfor snoopingat stations,3) combiningmultiplerequestsfor thesamecacheline, and4) servingasa targetfor broadcasts.

�Reports describing various aspects of the NUMAchine project can be obtained via the WWW

from:������������ � � ������������������������������������ ����!�"���#���$���%�#�!�&���'�(���(�)�)���)�$�*�+ .

2

� Extensive monitoring support. NUMAchine includesnon-intrusive monitoringhardwarein all ofits main subsystems.Collecteddatais availableto a runningapplicationso that adaptive run-timedecisionscanbemadebasedupontheobservedstateof themachine.

� Flexibility. The NUMAchine hardwareexposesmany of the low-level capabilitiesof the hardwareto software,allowing systemsoftwarecontrol over the low-level functionsof the hardware. Thisfacilitatesexperimentationinto the interactionbetweensoftwareandhardwarein the machine. Asan example,softwarehasthe ability to bypassthe hardwarecachecoherencemechanismsto takeadvantageof application-specificsemanticsandreducecoherenceoverhead.We have foundthatthehardwaresupportneededto enablesuchsoftwarecontrol is simple,anddoesnot adverselyaffect thespeedof common-caseoperations.

Theoverall NUMAchine projectis still in anearlyphase.The hardwarefor an initial prototypeusingMIPS R4400processorsis currentlybeing fabricated. A detailedbehavioral simulatoris being usedtoevaluatearchitecturaltradeoffs andthe expectedperformancefor a prototypeimplementation.The finalversionof theprototypesystem,targetedfor completionin 1996,will consistof 64 processors,connectedby a two-level hierarchyof rings. Initial implementationsof muchof thesystemsoftwarefor NUMAchinehave beendevelopedonhardwaresimulatorsandexistingmultiprocessorplatforms.

Therestof this documentprovidesmoredetailson theNUMAchinearchitecture(andprototype)andisorganizedasfollows: Section2 providesanoverview of theNUMAchine architecture,Section4 presentstheresultsof simulationsto evaluatethearchitecturefor a varietyof parallelapplications,Section5 refersto someexamplesof relatedwork,andSection6 concludes.

2 Architectural Overview

NUMAchine is a sharedmemorymultiprocessorwith the memorydistributedacrossthe stations. A flatphysicaladdressingschemeis usedwith a specificaddressrangeassignedto eachstation. All processorsaccessall memorylocationsin thesamemanner. Thetimeneededby aprocessorto accessagivenmemorylocationdependsuponthe distancebetweenthe processorand the memory. Thus, the architectureis ofNUMA (Non-UniformMemoryAccess)type.

NUMAchineusesaring-basedhierarchicalinterconnectionnetwork.At thelowestlevel of thehierarchyit hasstationsthatcontainseveralprocessors.Thestationsareinterconnectedby bit-parallelrings,asshownin Figure1. For simplicity, thefigureshowsonly two levelsof rings— local ringsconnectedby a centralring. Ourprototypemachinewill have4 processorsin eachstation,4 stationsperlocalring and4 localringsconnectedby a centralring.

Theuseof ring-basedinterconnectionnetworksprovidesnumerousadvantages,including: (1) thereis auniquepathbetweenany two pointson thenetwork,so that theorderingof packetsis alwaysmaintained,(2) informationcanbesentfrom onepoint in thenetworkto oneor moreotherpoints,providing a naturalmulticastmechanism,and(3) asimpleroutingschemecanbeused,allowing for high-speedoperationof therings. Oneof thekey designfeaturesof NUMAchine is thattheabove strengthsof ring-basednetworksarefully exploited to provide anefficient implementationof our cachecoherenceprotocol,asdescribedlater.Finally, rings engendera modulardesignthat minimizesthe costof small machineconfigurations,whileallowing for relatively largesystems.

Thehierarchicalstructurein Figure1 supportshigh throughputwhencommunicatingnodeslie withina localizedpartof thehierarchy, becausemany concurrenttransferscantakeplace.Suchis thecasewhenthereis a high degreeof locality in dataaccesses,so that most transfersarewithin a stationor betweenstationson thesamelocal ring. The longesttransferstraverseall levelsof thehierarchy, but thesetransfer

3

Central Ring,

Local RingLocal Ring

Stations-

Figure1: TheNUMAchinehierarchy.

Network Cache

RingInterfaceMemoryI/O

SCache SCache

Station Bus

Disks

Ethernet

Local Ring

Proc Proc

Figure2: StationOrganization.

timesareconsiderablyshorterthanif all stationswereconnectedby a singlering. An obviousdrawbackofthehierarchicalstructureis its limited bisectionbandwidth,whichmeansthatsoftwarethatdoesnotexhibitlocality may performpoorly. While therearesomeapplicationsin which locality is inherentlylow, webelieve thatwith sufficientoperatingsystem,compiler, andprogramdevelopmentsupport,datalocality canbehigh for a largeclassof applications.

Within eachstation,modulesare interconnectedby a singlebus, asshown in Figure2. A processormodulecontainsaprocessorwith anon-chipprimarycacheandanexternalsecondarycache.EachmemorymoduleincludesDRAM to storedataandSRAM to holdstatusinformationabouteachcacheline for usebythecachecoherenceprotocol.Thenetworkcacheis relatively largein size,andunlikethesecondarycaches,

4

it usesDRAM to storedatato allow for largercachesizesat a reasonablecost. It alsoincludesSRAM tostorethetagsandstatusinformationneededfor cachecoherence.The local ring interfacecontainsbuffersandcircuitry neededto handlepacketsflowing betweenthestationandthering. TheI/O modulecontainsstandardinterfacesfor connectingdisksandotherI/O devices.

Thefollowingsubsectionprovidesadditionaldetailsonvariousaspectsof theNUMAchinearchitecture,includingthememoryhierarchy, communicationsscheme,cachecoherenceprotocol,andtheprocedurebywhichflow-controlis maintainedanddeadlockavoidedin NUMAchine.

2.1 Memory Hierarchy

TheNUMAchinememoryhierarchyconsistsof four levelswith respectto aprocessorwithin astation.Theprimaryon-chipprocessorcacheis theclosestlevel, followedby theexternalsecondarySRAM cache.Thenext level consistsof theDRAM memorylocatedin thesamestation.This includesthememorymodule(s)for the physicaladdressrangeassignedto the station,andthestation’s networkcache,which is usedasacachefor datawhosehomememoryis in aremotestation.Thefinal level in thememoryhierarchyconsistsof all memorymodulesthatarein remotestations.

Within eachstation,processormodulesshareacentralizedmemoryvia thestationbus.Thisarrangementhastheadvantageof centralizingcachecoherencemechanismswithin astation,whichsimplifiesthememorysystemdesign.Furthermore,separatingtheprocessorsfrom thememorypermitstheprocessortechnologyto beimprovedwithoutaffectingtherestof thesystem..

Eachstation’snetworkcacheservestwo relatedpurposes:it cachesdatawhosehomememorylocationis in a remotestation,andit confinescachecoherenceoperations(asmuchaspossible,accordingto thecoherenceprotocol)for theremotedatasothatthey arelocalizedwithin thestation.In addition,thenetworkcachereducesnetworktraffic by servingasatargetfor multicastsof remotedata,andby combiningmultipleoutstandingrequestsfrom the stationfor the sameremotecacheline. For simplicity, in our prototypemachinethe networkcacheis direct-mapped.Its designdoesnot enforceinclusionof the datacachedinthestation’sprocessorcaches,but thesizeof thenetworkcache,which is at leastaslargeasthecombinedprocessorsecondarycaches,impliesthatinclusionin thenetworkcachewill usuallyexist.

2.2 Communication Scheme

TheNUMAchineringsconnectanumberof nodeswith unidirectionallinks thatoperatesynchronouslyusingaslotted-ringprotocol.Eachslotcarriesonepacketandadvancesfrom nodeto nodeeveryclockcycle. Thering interfaceateachnodecontainsabidirectionallink to astationor to anotherring. To placeapacketontothering, thering interfacewaitsfor anemptyslot. After removing apacketfrom thering, thering interfacesendsanemptyslot to thenext node.

Packetsareusedto transferrequestsandresponsesbetweenstations. A single transfermay consistof oneor morepackets,andmay be of several types: cachedanduncachedreadsandwrites,multicasts,block transfers,invalidationandinterventionrequests,interrupts,andnegativeacknowledgements.All datatransfersthatdo not includethecontentsof a cacheline or a blockrequireonly a singlepacket.Cachelineandblock transfersrequiremultiple packets.Sincethesepacketsarenot necessarilyin consecutive slots,they areassignedanidentifierto enablereassemblingthecachelinesor blocksat thedestinationstation.

Theroutingof packetsthroughtheNUMAchinering hierarchybeginsandendsatstationsin thelowestlevel of the ring hierarchy. The unidirectionalring topology guaranteesa uniquerouting path between

/Although thefirst working NUMAchine implementationwill usetheMIPS R4400processor, decouplingtheprocessorsand

memoryin our implementationof theNUMAchinearchitecturepermitsswitchingto theMIPSR10000processorwhenit becomesavailable.

5

0 0 1 1 0 0 1 1Stations Rings

0 0 0 1 0 0 0 1

Global ring

0 0 1 0

Station 0 (ring 0)

Station 1 (ring 0)

Station 0 (ring 1)

Station 1 (ring 1)

Localring 0

Localring 1

0 0 1 0

Unique routing maskfor Station 1 in Ring 1

Stations Rings

= Desired stations

Unique routing maskfor Station 0 in Ring 0

OR−ing routing masks togetherspecifies more stations than required

= Overspecified stations

Figure3: An exampleof aninexactroutingmask.

any two stations.Stationaddressesarespecifiedin packetsby meansof routing masks. Eachlevel in thehierarchyhasacorrespondingbit field in theroutingmask,andthenumberof bits in eachfield correspondsto the numberof links to the lower level. For example,a two-level systemconsistingof a centralringconnectedto4 localrings,with eachlocalring connectedto4 stations,requirestwo4-bit fieldsin theroutingmask;onefield specifiesa particularring, andtheotherfield indicatesa specificstationon that ring. Theroutingof packetsthroughthelevelsof thehierarchyis determinedby settingbits in theappropriatefieldsof theroutingmask. Sincea singlefield is usedfor eachlevel of thehierarchy, thenumberof bits neededfor routinggrows logarithmicallywith thesizeof thesystem.In additionto specifyingthepathof packetsthroughthering hierarchy, theroutingmasksarealsousedin maintainingstatusinformationneededfor thecachecoherenceprotocol;theroutingbits identify thelocationswhich mayhave a copyof eachcacheline.Thesmallsizeof theroutingmasklimits thestoragecostfor thisstatusinformation.

Whenonly onebit is set in eachfield of the routing mask,it uniquely identifiesa singlestationforpoint-to-pointcommunication.Multicastcommunicationto morethanonestationis enabledby OR-ingbitmasksfor multiple destinations.As a result,morethanonebit may be set in eachfield. Sincea singlefield is usedfor eachlevel, ratherthanindividualfieldsfor eachring at a givenlevel, settingmorethanonebit perfield mayspecifymorestationsthanactuallyrequired.This is illustratedin Figure3, which showsthatwhenthebitmasksthatspecifystation0 on ring 0 andstation1 on ring 1 areOR’d, thenstation1 onring 0 andstation0 on ring 1 will alsobesentthemessage.Theimprecisenatureof theroutingbits resultsin somepacketsbeingroutedto morestationsthannecessary, but theextra traffic generatedundernormalconditions(i.e. wheredatalocality exists) is smallandrepresentsa goodtradeoff for thesavingsinvolved(thesignificanceof thesavingsis in boththenumberof bitsneededperpacketand,moreimportantly, in thenumberof coherencestatusbitsneededper cacheline).

The rulesfor routing packetsin the ring hierarchyusingthe routing maskaresimple. An ascendingpackethasat leastone bit set in the field correspondingto the next higher level, and ring interfacestohigher-level ringsalwaysswitchthesepacketsup to thenext level. Oncethehighestlevel specifiedby theroutingmaskis reached,thepacketmustdescend.At eachring interfaceconnectedto a lower level of thehierarchy, the packetmay potentiallybe switcheddown to the lower level if the bit correspondingto the

6

HomeMemory

RemoteNetwork Cache

RemoteNetwork Cache

Network−level

Station−level

Processor caches

Processor caches Processor caches

Figure4: Two-level NUMAchinecachecoherenceprotocol.

downwardlink is setto onein theroutingmask.A copyof thepacketmayalsobepassedto thenext ringinterfaceat thesamelevel if morethanonebit is setin thesamefield. Whenapacketis switcheddownwardto a lower level, all bits in thehigher-level field areclearedto zero. Thesimplicity of this schemepermitsa high-speedimplementation,sinceonly onefield of theroutingmaskis involvedin theroutingdecisionateachring interface.

2.3 Cache Coherence

This sectiondescribesdetailsof the NUMAchine cachecoherenceprotocol. Sincecachecoherenceishighly complex, wenecessarilycannotdescribeall possiblecoherenceoperations,but enoughexamplesarepresentedto enablea knowledgeablereaderto understandhow NUMAchine’scachecoherenceoperates.

NUMAchine’s cachecoherenceschemeis optimizedspecificallyfor theNUMAchine architecture.Inparticular, it leveragesthenaturalmulticastmechanismavailablevia therings,it utilizesthefeaturethattheringsprovideauniquepathfromonenodetoany othernode,andit isdesignedto localizecoherencetraffic towithin asinglelevel of theringhierarchywheneverpossible.Theprotocolis enforcedsimultaneouslyattwolevels,asillustratedin Figure4. Network-level coherenceis enforcedbetweenthe homememorymodulefor a givencacheline, andnetworkcachesin otherstationsholdingcopiesof this cacheline. Station-levelcoherencefor a given cacheline is enforcedbetweenthe processorcachesandthe homememoryon thesamestationor betweentheprocessorcachesandthenetworkcacheonthesamestationif thehomememoryof thecacheline is a remotestation.

To maintaincachecoherenceat both thenetworkandstationlevels,a hierarchical,two-level directoryexists. Thedirectoryis storedin SRAM locatedin thememorymoduleandnetworkcaches.At thenetworklevel, thehomememorymaintainsa full directoryof routing masks for eachcacheline. Theroutingmaskcanidentify a singlestationor multiple stationsasdescribedin Section2.2. In the directory, the routingmaskindicateswhichstationsmayhavecopiesof acacheline. At thestationlevel, thedirectoryconsistsofa simplebitmask, or processor mask, for eachcacheline. Sincethereis only a smallnumberof processorsperstation,eachprocessorhasa dedicatedbit in theprocessormask.Thesebits indicatewhich processorson thestationhave a copyof thecacheline. Processormasksthatarestoredin a memorymoduleindicatewhich processorswithin the local stationhave a copy of a cacheline. The processormasksfor copiesof

7

LV LI

GV GI

RemReadEx LocalReadExRemRead

RemReadExRemRead

LocalRead

RemRead,LocalRead

RemReadEx

LocalReadExLocalReadEx,LocalUpgd

RemReadEx,RemUpgd

RemWrBack,LocalRead,RemRead

LocalWrBack,LocalRead

LocalReadEx,LocalUpgd

Figure5: StateTransitionTablesfor memory.

cachelineson remotestationsaremaintainedin their respectivenetworkcaches.In additionto thedirectory, thememoryandnetworkcachescontaina valid/invalid (V/I) bit percache

line, which indicateswhetherthecopy they have is valid. Thenetworkcachesalsocontaina local/global(L/G) bit, which indicateswhetherthe only valid copiesof the cacheline areon the local station. In thememorymodule,a separateL/G bit is not neededbecausethis informationis providedby theroutingmaskin thedirectory.

While threebasicstates(dirty, sharedandinvalid) aredefinedfor thesecondarycachein thestandardway for write-backinvalidateprotocols,four basicstatesaredefinedfor a cacheline in a memorymoduleor a networkcache.TheL/G andV/I bits areusedto indicatethestateof thecacheline andcanhave thefollowing meanings:local valid (LV), local invalid (LI), global valid (GV) andglobal invalid (GI). TheLV andLI statesindicatethatvalid copiesof thecacheline exist only on this station. In theLV state,thememory(or networkcache)aswell asthesecondarycachesindicatedby theprocessormaskhave a validcopy. In the LI state,only oneof the local secondarycacheshasa copy(which would be dirty), andtheparticularcacheis identifiedby theprocessormask.In GV, thememory(or networkcache)hasavalid copyof thecacheline, andit is beingsharedby severalstations,indicatedby theroutingmaskin thedirectory.Themeaningof theGI statediffersslightly for thememorymoduleandfor thenetworkcache.In bothcases,theGI statemeansthatthereis no valid copyon this station.However, theGI statein thememorymodulealsoindicatesthatthereexistsaremotenetworkcache(identifiedby theroutingmask)with acopyin LV orLI state.Eachof thebasicstatesalsohasa locked version.Thelockedversionsareusedto preventaccessesto a cacheline while theline is undergoingsometransition.Any requestsfor a cacheline in a lockedstatearenegatively acknowledged,andtherequesterwill try again.

TheNUMAchinecachecoherenceprotocolemploysa write-back/invalidateschemeatbothlevels. Theprotocolis illustratedusingfour basicexamples:localwrite, local read,remotereadandremotewrite. Thefirst threeof theseexamplesillustratebasicoperationof theprotocolby indicatinghow thedirectoriesandstatesaremanipulated.Thefourthexampleprovidesadditionaldetailsby showing someof theactualstepstakenin thehardware.For readerswhoareinterestedin theentireprotocol,full statetransitiondiagramsfora cacheline in memoryandfor acacheline in a networkcachearegivenin figures 5 and 6.

8

LV LI

GV GI

RemReadExLocalReadEx

RemRead

RemReadExRemRead

LocalRead

RemRead,LocalRead

RemReadEx

LocalReadExLocalReadEx,LocalUpgd

RemReadEx,RemUpgd

LocalWrBack,LocalRead

LocalReadEx,LocalUpgd

Not In

Ejection

Ejection

LocalReadEx,LocalUpgd

Ejection

Ejection

LocalRead

LocalRead

LocalRead

Figure6: StateTransitionTablesfor networkcache.

Let us first considera local write requestby a processoron stationY, for a cacheline whosehomelocationis alsoonstationY. Let usassumethattherearevalid copiesof thecacheline onstationY andthatthecacheline is sharedonanotherstation,Z; therefore,thecacheline is in theGV statein boththememoryonstationY andthenetworkcacheonstationZ. After theprocessorissuesa write to memoryonstationY,thememorycontrollerwill sendaninvalidaterequestto theremotestationZ indicatedby theroutingmask,andto the local processorsindicatedby the processormaskin the directory. All thebits in the processormaskareresetexceptfor thebit correspondingto theprocessorrequestingthewrite. Also, theroutingmaskbits in thedirectoryaresetto indicatethelocalstation.Thenew stateof thecacheline will beLI indicatingthatthememoryno longerhasa valid copy, but thatthecopyis in oneof thesecondarycacheson thelocalstation.

Upon receiving an invalidationpacket,the remotenetworkcachecontroller on stationZ invalidatescopieson thestationaccordingto its processormask(if thecacheline hasbeenejectedfrom theNC, thenthe invalidationmessageis broadcastto all four processors),which is thencleared.Thestateof thecacheline is setto GI, indicatingthatneitherthenetworkcachenor any of thesecondarycachescontaina validcopyof thecacheline.

Let usnow considera readby a processoron stationY for thesamecacheline which is in theLI statein thememorymoduleonstationY. Thememorycontrollerdetermineswhichprocessorhasthedirty copy,andthat processorthenforwardsa copyof the cacheline to the requestingprocessorandto the memorymodule.Uponreceiving thedata,thememorycontrollerwritesit to DRAM andORsthebit correspondingto therequestingprocessorto theprocessormaskin thedirectory. The new stateof thecacheline will beLV indicatingthatcopiesof thecacheline arelocatedon thisstationonly. Thememoryandtheprocessors

9

GI

GIDirty

LI

locked GV

locked GV

locked GI

Invalid

RingHierarchyMemory

Station Y0

Station X0

NCProcessor

RE

RE

Data

Data

INV1

INV1

INV1

INV1

Tim

e

To other2stations

Figure7: Coherenceactionsfor a remotewrite.

indicatedby theprocessormaskhavevalid copiesof thecacheline.Next we considerthecasewherea sharedreadrequestissuedby a processoron stationX arrivesat a

memorymoduleon stationY, wherethecacheline is in theGI state. In this example,we assumethat thecacheline is dirty on anotherstationZ. We alsoassumethaton stationZ, thenetworkcacheentryfor thiscacheline is in LI state.Thehomememorymodulesendsareadrequestmessage(identifyingtherequestingprocessoron stationX) to stationZ usingtheroutingmask. Using the informationin its processormask,thenetworkcacheonstationZ obtainsthedirty copyfrom thesecondarycache,causingthestateto changeto GV in thenetworkcache.The dirty datais forwardedto stationX anda copy is alsosentto thehomememorymodule(in separatetransmission).WhenthedataarrivesatstationX, acopyis written to boththenetworkcacheandtherequestingprocessor. In thenetworkcachethestateof thecacheline is changedtoGV andthe processormaskis setto indicatetherequestingprocessor. Whenthedataarrivesat thehomememorymodule,it is written into DRAM. Theexisting routingmaskin thememoryis OR’edwith thebitscorrespondingto StationsX andY, andthestateof thecacheline is changedto GV.

As a final example,we considera write requestby a processoron stationX for a cacheline whosehomelocationis on stationY. In this final examplewe would alsolike to describethelocking mechanismthatallowscachecoherenceandprovidessupportfor differentconsistency models.Figure7 illustratesthenecessaryactions. Let us assumethat the networkcachestateon stationX is GI (i.e., thereis no validcopyin theNC or in any of theprocessoron thestation),andthatthecacheline is in GV statein thehomememory. Theprocessor’s requestgoesfirst to thenetworkcacheonstationX. Thenetworkcachelocksthislocationandsendsawrite requestpacketto stationY (awrite request meansthatthememorymoduleshouldprovidethedataandgivewrite permission).Whentherequestreachesthehomememorymodule,thedataissentto stationX andall othercopiesareinvalidated.Theinvalidationschemeis implementedaspreviouslysuggestedin [11]. Thehomememorylocationis lockedwhentheinvalidaterequestpacketis issued.Theinvalidatepacketreachesthehighestlevel of (sub)hierarchyneededto multicastit to stationswith copies;it

10

is thendistributedaccordingto theroutingmask,which identifiesall stationswith valid copies,plusstationX. WhentheinvalidatepacketreturnstostationY (whereit originated),thememorylocationisunlockedandplacedin GI state,andtheroutingmaskis updatedto indicatestationX astheonly stationwith acopy. It isimportantto notethattheinvalidationrequestsdonothave to beacknowledgedby thecachesthatinvalidatetheir copiesof thecacheline.

Whenthe cacheline reachesstationX, the networkcachewrites it into its DRAM andwaits for theinvalidatepacketto arrive. It is guaranteedthat the datapacketwill arrive before the invalidatemessage,becausethememorymodulesendsthedatafirst andthering hierarchypreservesorderingof message.Uponarrival of theinvalidatepacket,thenetworkcachesendsthedatafrom its DRAM to therequestingprocessorandputsthe cacheline into LI state. Also, the processormaskis set to indicatewhich processoron thestationhasthecopy.

Somefurtheraspectsof theNUMAchinecoherenceprotocolaresummarizedbelow.

� The protocol exploits the naturalsupportfor multicastingin the ring hierarchyfor low overheadinvalidationof shareddatain remotestations.Theroutingmaskin thenetwork-level directoryof thehomememoryspecifiesthestationsthatmustreceiveaninvalidationmessage.Fromthehomestation,a singleinvalidationmessagewith theroutingmaskascendsthering hierarchyto thefirst level fromwhich all stationsspecifiedby the routing maskcanbe reached.After passinga sequencingpointfor orderinginvalidationmessagesat that level, copiesof the invalidationmessagethendescendtoall stationsspecifiedby therouting masks,including the homestationandthe stationwhich issuedthewrite requestcausingthe invalidation. The sequencingpoint in eachring is the connectionto ahigher-level ring, except in the central,whereoneof the interfacesis designatedasthe sequencingpoint. The sequencingpointsanduniquepathsin the ring topologyguaranteea globalorderingofinvalidationsfor differentcachelinesperformedby differentprocessors.Thesepropertiesenableanefficient implementationof sequentialconsistency.

� The protocolsupportsa sequentialconsistency modelby exploiting the order-preservingpropertiesof the ring hierarchy[10]. The arrival of an invalidation messageat the station that issuedthecorrespondingwrite requestserves as an acknowledgementand permitsthe write to proceed. Asshown in Figure7 the networkcachereceivesthe dataandremainslockedfor that cacheline untiltheinvalidationreturns.Uponreceiving this invalidation,thedatais sentto therequestingprocessor.Sequentialconsistency is ensuredbecausethedataresponsesentto theprocessoris globallyorderedwith all invalidationsto othercachelines. The delaybetweenreceiving the dataresponseandtheinvalidationcanbeveryshortif thedataresponseandtheinvalidationmessagetraversethesamepathin thering hierarchy. Thedelaycanalsoberelatively long if thedatacomesfrom a nearbyprocessorandthe invalidationmessagemustgo to the top of the hierarchy. Simulationresultsshow that thepercentageof caseswith large delaysis insignificantto systemperformance;whencomparedto asystemin which no locking mechanismfor thepurposeof consistency is used,only a 2% differencein overall performancewasnoted.

� The protocolcombinesrouting masksfor multiple stationsto maintaina full directorywith a costthatis boundedlogarithmicallyby thenumberof processorsin thesystem.Thismakesit possibletomaintaina full directoryentirely in hardware.Whena stationhasexclusive accessto a cacheline,thentheroutingmaskin thedirectoryunambiguouslyindicatesthis station. Whenmultiple stationshave sharedcopies,the routing maskin the directorymay specifymorestationsthanactuallyhavesharedcopies,dueto theinexactnatureof theroutingmasks.Theprotocolis designedto handlethisimprecisionwhenspecifyingmultiplestations;thesetof possiblestatesin thenetworkcachesreducethe impactof this ambiguityon performance.For example,if an invalidationarrivesat a network

11

cachefor acacheline in theGI statedueto anambiguousroutingmask,thentheinvalidationwill notbesentto any of thelocalprocessors.

� The protocolexploits the presenceof the networkcacheto confinecoherenceoperationswithin aremotestationfor bothsharedandexclusivedata.In particular, thenetworkcacheavoidsthelatencyof retrieving shareddatafrom theremotehomememory, which wouldotherwisebethecasebecausecontemporarymicroprocessorcoherenceprotocolstypically do not supportcache-to-cachetransfersof shareddata(i.e., in responseto arequestfor acacheline, theprocessorwill respondwith dataonlyif thecacheline is dirty). TheLV andLI statesin thenetworkcacheavoid sendingwrite requeststothehomememorybecausethesestatesindicatethat theonly valid copiesof dataarelocatedon thisstation.

� Theprotocolavoidsunnecessarydatatransfersbetweenstationsfor writepermissionrequestswheneverpossible.Whenaprocessorrequestswritepermissionfor acacheline for whichit alreadyhasasharedcopy, the homememorynormally respondsonly with an acknowledgement,therebyavoiding thecommunicationoverheadof sendingthe data. However, the directory in the memorymodulemayindicatethat the sharedcopy wasinvalidatedbeforethe requestarrived (the remotestationhadnotyet seentheinvalidationmessagewhenit sentthewrite permissionrequest),in which casethehomememoryforwardsthe datato the requestingstation,in order to avoid the latency of issuinga newrequestfor the data. On the otherhand,the directorymay be ambiguousas to whetheror not therequestingstationstill hasa valid copy, dueto theinexactnatureof theroutingmasks.In this case,the protocol optimistically assumesthat datais still valid at the remotestationand respondsonlywith anacknowledgement.While thelatency increasesif theassumptionprovesto bewrong,overallperformanceimprovesif theassumptionis correct. Simulationresultsin Section4 indicatethat theassumptionis correctin mostcases.

In summary, the basiccoherencemechanismfor the NUMAchine cachecoherenceprotocol is write-back/invalidate. This mechanismis the bestchoicefor today’s applicationswhich aredesignedto exhibitasmuchlocality in accessingdataaspossible.Sincewe efficiently enforcethestrictestmodelof memoryconsistency (sequentialconsistency), our implementationalsoenablesus to efficiently supportany othermemoryconsistency modelthatis supportedby theprocessor/secondarycachesubsystem.In orderto maketheprotocolefficient, communicationoverheadmustbeminimized,which we have successfullyachieved.At thesametime,wehavekeptlatency low for all casesexceptfor theoptimisticonedescribedabovewhichmakesa decisionin favor of low communicationoverheadat the expenseof a slight increasein latency.Sincelatency canbetoleratedusingtechniquessuchasprefetching,wehaveconcentratedmorestronglyonreducingthecommunicationsoverhead.Finally, theNUMAchinecachecoherenceprotocolis conducivetolow-costimplementationin hardware,becausetheamountof memoryrequiredfor thedirectoriesis smallandthelogic circuitry neededto manipulatethosedirectoriesis reasonable.

2.4 Deadlock Avoidance and Flow Control

NUMAchine ensuresthatdeadlockwill not occurby dealingwith messagesthatelicit responsesfrom thetargetstationsdifferentlyfrom thosethatdonotelicit a response;wereferto theformerasnonsinkable andthe latterassinkable. Sinkablemessagesincludereadresponses,write-backs,multicasts,andinvalidationcommands,whilenonsinkablemessagesincludeall typesof readrequests(includinginterventions).Toavoiddeadlock,certainfunctionalmodulesin NUMAchineuseseparatequeuesto holdsinkableandnonsinkablemessages.3

4A singlequeuewouldsufficeif onecouldguaranteethatanonsinkablemessageis receivedonly whenit canbefully processed

withoutblockingsubsequent sinkablemessages.

12

Thefollowing rulesgovernthehandlingof sinkableandnonsinkablemessagesoncethey have enteredthenetwork:

� theorderingof sinkablemessagesis maintained,

� adownwardpathalwaysexistsfor sinkablemessages,and

� sinkablemessagesaregivenpriority overnonsinkablemessages.

For implementationpurposes,anadditionalrequirementis that thenumberof nonsinkablemessagesin thesystemis bounded.This is guaranteedbecausethe numberof nonsinkablemessagesthatareissuedfroma stationis limited by thelocal ring interface(up to 16 in our prototype).This boundimpliesthat thesizeof someof the nonsinkablequeuesin NUMAchine grows asa linear function of the numberof stations.Althoughthis is not a scalableapproach,it doesnot imposeany practicallimitationsfor the targetsystemsizes. (An alternative scalablestrategy is to negatively acknowledgenonsinkablemessageswhena queuebecomesfull, turninga nonsinkablemessageinto a sinkableone.) For example,a queuesizeof 32KBytesper ring interfacewould be sufficient to handlethe nonsinkablemessagesin a systemwith onethousandprocessors.

In additionto thedeadlockavoidanceschemedescribedabove, NUMAchine hasa simpleflow controlmechanismoptimizedfor expectedbandwidthrequirements.Eachring interfacehasaninput queuelargeenoughto handleshorttermtraffic bursts. Whenever theinput queueis closeto capacity, theoperationofthering that is feedingthebuffer is temporarilyhalted;otherringsareunaffectedandcontinueto operate.Meanwhile,packetsat theheadof thefull queueareprocesseduntil it is emptyenoughto startup theringagain. Theflow controlensuresthat theorderof sinkablerequestsis maintained,andit canbeshown thatthis allows for importantcommoncaseoptimizationsto thecoherenceprotocol. However, it canresultinpoorperformanceunderextremeconditions,suchaswhenmany processorssimultaneouslyflushmodifieddatafrom theircachesto remotememory.

3 Prototype Design Details

TheNUMAchine prototype(currentlyunderconstruction)consistsof a 64-processorsystemusingtwolevelsof rings,with the4x4geometryshown in Figure8. Theinitial CPUboardswill utilize 150MHz MIPSR4400processors,with 1 MB of L2 cache.TheNetworkCacheon eachstationwill containat least4 MBof memory.

3.1 System Modules

In the following subsections,the varioussystemmodulesin the prototypeare discussed,followed by adescriptionof thehardwaresupportfor monitoring. All modulesaredescribedat theblock diagramlevel;wedonot providedetailsof theactualhardwarecircuitry.

3.1.1 Processor Module

Figure9 providesa block diagramof a processormodule. It containsa MIPS R4400processor(thiswill likely bechangedto theMIPSR10000processorwhenit becomesavailable),anda1-MBytesecondarycache. The R4400requiresthat the userprovide (complex) circuitry to handlesignalling betweentheprocessorandthe restof the system;this circuitry is calledtheexternal agent in the figure. The externalagenthandlesformattingof all dataand commandsin both directions: thosethat are generatedby theprocessor, and thosebeingsentto the processor. The normalpath for informationflowing betweenthe

13

Central5 ring

Localring 1

Localring 2

Localring 3

Localring 0

P P P P

M NCI/O RI

Figure8: Geometryfor prototypemachine.Eachstationcontains4 processors.

processorandthe NUMAchine bus is throughthe FIFOsshown in the figure. The FIFOsareincludedinthedesignbecausethey allow for efficientcommunicationover thebus,sincetheFIFOsallow theprocessormoduleto bereadyto receive dataevenif theR4400itself is not readyfor anexternalrequest.Thebypassregister in Figure9 allowstheoutgoingFIFOto bebypassedfor certainoperations,but we will not explainthedetailsof its usagehere.

TheBus Interface blockhandlesflow of databetweentheFIFOsandtheNUMAchinebus(whichusesthemechanicalandelectricalspecificationsof theFutureBusstandard,but employscustom-designedcontrol).BusInterfacealsoperformsarbitrationwhentheprocessormodulewishesto communicatewith any othermoduleover theNUMAchine bus. Theotherblocksin theprocessormodulearefor anxillary purposes,asexplainedbelow:

� Local Bus Interface. This allowsfor connectionof miscellaneousdevicesthatwill beusedfor initialdebugging. Thelocal businterfaceconnectstheexternalagentto a busthat is local to theprocessorcard only. Connectedto this bus are a ROM (for providing initialization codefor the R4400toexecute),andan interface,calledGizmo Interface, for connectingto a microcomputerboard. Themicrocomputerboardis employedasasimpleandconvenientwayof providing accessto I/O devices,like UARTs andethernet.Note that theGizmoInterfacewill beusedonly during initial debugging,sinceeachNUMAchinestationwill have a dedicatedcardfor I/O devices.

� Monitoring, Interrupt and Barrier Registers. Varioustypesof monitoring circuits for measuringperformancewith respectto theprocessormoduleareprovidedin this block. Also, for convenience,two registersthatarenot part of the monitoringcircuitry aresituatedin this module. The InterruptRegister is usedto sendaninterruptto theR4400;it allowsa numberof modulesin theNUMAchineprototypeto interrupttheR4400processor.

14

Processor Card

NUMAchine Bus

6�7�8

9�:

6�9

6�7�8

9�:

In FIFOOut FIFO

;=<�>@?BADC@E 9�: ;=<�>@?BADC@E 9�:

ExternalAgent

Monitoring

Bus Interface

Barrier RegInterrupt Reg

Local BusInterface

GizmoInterface

Local

IO and

ROM

Bypass

R4400Secondary

Cache7�F�FDGDH�I6DGDJLK ?B<

Legend:

CMD Bus

Proc AD Bus

Proc Cmd Bus

AD Bus

Local Bus

Figure9: A NUMAchineProcessorModule.

3.1.2 Memory Module

A blockdiagramof aNUMAchinememorymoduleappearsin Figure10. Dataandcommandsenterandleave thememorymodulethroughFIFOs,in a similar mannerasdescribedabove for a processormodule.The maincontrol circuitry in thememorymoduleis calledtheMaster Controller, which controlsreadingandwriting to theFIFOs(on thesideoppositefrom thebus)andwhich controlstheotherfunctionalblocksin thememorymodule.TheDRAM blockcontainsup to 2 GBytesof memoryanda DRAM controller;thememoryis split into two banksandis interleaved. TheDRAM controllersupportsaccessesby cachelines,

15

Special Functionsand Interrupts

Memory Card

NUMAchine Bus

Legend:AD1 Bus

AD2 Bus

CMD Bus

ControlDRAM

MasterController

SRAM

Monitoring

Hardware

Cache

Coherence

BTL bus transceivers

IN FIFOOut FIFOData

Out FIFOControl

BusInterface

Figure10: A NUMAchineMemoryModule.

andalsoallowsaccessto individualbytes,words,etc.The Hardware Cache Coherence block maintainsthe cachecoherencedirectoriesin SRAM. Cache

Coherenceactionstakeplacein parallelwith DRAM activity andaresynchronized(viatheMasterController)whenevernecessary. Thecachecoherenceblockimplementsall of thecoherenceactionsandstatetransitionsfor cache-linestatusbitsneededfor theNUMAchinecachecoherenceprotocol,asdescribedin Section2.3.

TheSpecial Functions and Interrupts block providesoperationsin additionto normalmemoryaccesscommands.Examplesof specialfunctionsareblock transfersof datafrom DRAM, kill operationsfor a

16

InputFIFO

Output FIFO

Packethandler

Nonsink. queue

Sink.queue

Packetgenerator

Latches

Local ring

Station bus

Figure11: LocalRingInterface.

rangeof cachelines,writesdirectly to SRAM (whichallowssystemsoftwareto bypassthehardwarecachecoherenceactions),etc. Thisblockalsocontainscircuitry for forminginterruptpackets,sothatthememorymodulecansendaninterruptto a processor, eitherbecauseof anerrorcondition,or to indicatecompletionof a specialfunction.

3.1.3 Ring Interfaces

Two typesof ring interfacesareneededin theNUMAchine architecture.The local ring interface providesthe functionality neededto transmitmessagesto and from a given stationand its associatedlocal ring.This includesthecapabilityfor formattingof outgoingpacketsandinterpretationof incomingpackets.Theinter-ring interface is muchsimpler, sinceit merelyactsasa switch betweentwo levels of rings in thehierarchy.

Local Ring Interface A blockdiagramof thelocalring interfaceis depictedin Figure11. Its upwardpath(to thering) consistsof a packetgenerator, anoutputFIFO, anda latch. Thepacket generator transformsincoming bus transactionsinto one or more ring packetsand placesthem into the output FIFO. If themessagemustbesplit into multiplepackets,thenadistincttagis assignedto eachoutgoingpacketto enablere-assemblyatthedestination.Packetsin theoutputFIFOareplacedontotheringasslotsbecomeavailable.

The downwardpathconsistsof an input FIFO, a packethandler, a sinkablequeue,anda nonsinkablequeue. The input FIFO is usedasa buffer betweenthe high speedring andthe packethandler. Sincea

17

slotted-ringprotocolis usedto transferpacketsonrings,acacheline is notnecessarilytransferredin packetsthatoccupyconsecutiveslots.As aresult,thepacket handler’s primarytaskis to reassemblemessages,suchascachelines,basedon thetagsassignedwhenthering packetsweresent.

Inter-Ring Interface Both upwardanddownwardpathsof theinter-ring interfaceareimplementedwithsimpleFIFO buffers. They areneededbecausea packetthathasto go from onering to anothercando soonly whenanemptyslot is availableonthetargetring. Thesebuffersmustbelargeenoughto accommodateburstswheremany consecutivepacketson onering have to go to thenext level ring. In simulationsof ourprototypemachinethesebuffersnevercontainmorethan60packets.

Routingdecisionsin the inter-ring interfacearevery simplein our communicationsprotocol. Becauseof this simplicity it is feasibleto operatethehigher-level ringsathigherspeed,which might bea pragmaticapproachif bisectionbandwidthwereto prove to beanissuein largesystems.

3.1.4 Network Cache

Thenetworkcache(NC)issharedbyall processorsin astationandisusedtocachedataoriginating fromotherstations.Thecachelinesaremaintainedin DRAM sothatverylargenetworkcachescanbeimplementedatreasonablecost. TheNC shouldbeat leastaslargeasthecombinedcapacitiesof thesecondarycachesonthestation,andcanbemadelarger. SRAM is usedto maintainstatusandcontrolinformationof eachcacheline sothatit canbeaccessedquickly.

TheNC servesa numberof usefulpurposes.It actsasa sharedtertiarycachefor thestation,asa targetfor broadcasts,andasa targetfor prefetchingwhentheprocessordoesnotsupportprefetchingdirectly intoits primary or secondarycaches.It alsoperformsa functionakin to snooping,which is usuallyfound inbus-basedsystems.In this section,usesof theNC aredescribed.Section4 will show theeffectivenessoftheNC, basedonsimulationsof ourNUMAchineprototype.

A readrequestto non-localmemoryis alwaysdirectedto the local NC. If the cacheline is presentinthe NC, thenthe NC respondswith the data. If the NC knows that the cacheline is dirty in a secondarycacheon the station,it causesthe datato be transferredto therequester. Otherwise,the NC sendsa readrequestto thehomememorymodule.Whenthedataarrivesfrom theremotestation,it is forwardedto therequestingprocessoranda copy is kept in the NC. Subsequentreadrequestsfor the samecacheline byanotherprocessoron thestationaresatisfiedby thenetworkcache,avoiding remotememoryaccesses.Ineffect, theNC replicatesshareddatafrom remotememoriesinto the station. This featureis referredto asthemigration effect of theNC.

TheNC retainsshareddatathat is overwrittenin a processor’s secondarycache,if thenew cachelinedoesnot mapinto thesamelocationin theNC asthecacheline thatis overwritten.Also, dirty dataejectedfrom a processor’s secondarycachedue to limited capacityor conflicts is written back into the networkcache,but not necessarilyto theremotememory. If suchdatais neededagain,it will beavailablefrom thenetworkcache.This featureof theNC is referredto asthecaching effect.

TheNC “combines”concurrentreadrequeststo thesameremotememorylocationinto a singlerequestthatpropagatesthroughthenetworktotheremotehomememorymodule.Thisoccursasadirectconsequenceof locking the location reserved for the desiredcacheline; subsequentrequestsfor the locked line arenegativelyacknowledged,forcingtheprocessorsto try again.After theresponseto theinitial requestarrives,subsequentrequestsaresatisfiedby the NC. In this respect,the NC reducesnetworktraffic andalleviatescontentionat theremotememorymodule.This featureis referredto asthecombining effect of theNC.

TheNC localizescoherencetraffic for cachelinesusedonly within a stationbut whosehomelocationis in a remotememorymodule. Suchlines exist in eitherLV or LI statesin the NC, andall coherenceactionsfor theselinesinvolveonly theNC andnot theremotehomememorymodule.For example,assume

18

that a dirty cacheline exists in a secondarycacheandthat its statein the networkcacheis LI. If anotherprocessoron thesamestationreadsthiscacheline, thentheNC determinesfrom its processormaskwhichprocessorhasthedirty copyandthatprocessorsendsthedatato boththerequestingprocessorandtheNC.Thestatein theNC now becomesLV. If oneof thesetwo processorslater requestsexclusive accessto thesamecacheline, theline becomesdirty again,andtheNC invalidatestheothercopy. Thestateof the linein theNC becomesLI. All this is donelocally, without having to sendany messagesto thehomememory,whichmaintainsthecacheline in theGI state.This featureis referredto asthecoherence localization effectof theNC.

Thenetworkcacheis a convenienttargetfor broadcasts(multicasts).Dataproducedby oneprocessor,andneededby other processorscan be broadcast,to avoid hot-spotcontentionat memorymodulesandin the interconnectionnetwork. Otherpossibilities for broadcasttargetsare lessattractive: broadcastinginto secondarycachesrequirescomplicatedhardwareon eachprocessorandcanejectdatain useby theprocessor;broadcastinginto memorymodulesis impracticalfor addressingreasons.

The NC canalsobe usedfor prefetchingdataif the processordoesnot supportprefetchingdirectly.Prefetchingcanbe implementedeasilyasa “write” requestto a specialmemoryaddresswhich causesthehardwareto initiate the prefetch[16]. The disadvantageof prefetchinginto the networkcacheis that thedatais notplacedascloseaspossibleto therequestingprocessor.

Theuseof theNC obviatestheneedfor snoopingon thestationbus,saving costandreducinghardwarecomplexity. A processorcanobtainremotedatafrom theNC insteadof obtainingit fromanotherprocessor’ssecondarycachethroughsnooping.In fact, theNC providescompletesnoopingfunctionality. It respondswith shareddatadirectly, andcausesforwardingof dirty dataasexplainedabove. This functionalitymaynot beeasyto achieve usingsnooping,becausemany modernprocessorsmakeit difficult to implementamechanismthatallowsshareddatato beobtainedfrom asecondarycache.

3.2 Hardware/Software Interaction

Wehavechosento givesoftwareaccessto thelow-levelcapabilitiesof ourhardware.Thislow-level control,in conjunctionwith thenaturalmulticastcapabilityof our interconnect,allows systemsoftwareto provideapplicationswith a rich setof features.We first describesomeof the low-level control that is provided tosystemsoftware,and thenbriefly describesomeof the capabilitiesthis control givesto applicationsandsystemsoftware.

� Systemsoftwarecan (bypassingthe coherenceprotocol of the hardware)readand write the tags,stateinformation,anddataof: any memorymodulein thesystem,any networkcache,andthe localsecondarycache.Theseaccessescanbeperformedatomicallywith respectto coherenceactionsbythe hardwareandwith respectto othersuchaccessesby software.M The dataof the cachescanbeaccessedeitherby index or by address.

� A processorcanrequestany memoryin the systemto: invalidatesharedcopiesof any of its cachelines, kill dirty copies,andobtain(at memory)a cleanexclusive copy. Similarly, a processorcanrequestany networkcacheto: invalidateany sharedcopiesof cachelines local to thestation,kill adirty local copy, prefetcha cacheline thatwill soonbeaccessed,or write a dirty cacheline backto(remote)memory.

� Someof theabove operations(e.g. requeststo setthestateof a cacheline, invalidatea cacheline,andwrite-backadirty cacheline) canbedoneusingablock operationthataffectsall cachelinesin a

NAtomic accessis performedusing ����(�� PO���� )���� Q and ��PO���� PO���� ����)���� Q operations.Thelock acquiredis theper-cache

line lock usedby thecoherenceprotocol,allowing systemsoftwareto manipulatecachelinesevenwhile they arebeingaccessedby applicationsoftware.

19

rangeof physicalmemory. For example,asinglerequestto anetworkcachecanbeusedto invalidatethelocally cachedcopiesof databelongingto asequenceof physicalpages.(Theinitiating processorreceivesaninterruptwhentherequestedoperationhascompleted.)Also, ablockprefetchrequestcanbemadeto thenetworkcache,which will asynchronouslyprefetchtherequestedblock of datafromremotememory.

� NUMAchine supportsefficient coherentmemory-to-memoryblock copyoperations,wherethe unitof transferis a multiple, R , of thecacheline size.Therequestto copya region of memoryis madetothetargetmemorymodule,which for eachR cachelines: kills any existingcachedcopiesof thecachelines,andmakesarequestto thesourcememorymoduleto transferthem.Thesourcememorymodulecollectsany outstandingdirty copiesof affectedcachelinesfrom secondarycaches,andtransfersthedatato the target memoryusinga singlelarge requestof R cachelines. An efficient block transfercapabilityfacilitatespagemigrationandreplication,which areusedin NUMA systemsto improveperformancethroughimprovedlocality [7, 14].

� Systemsoftwarecandirectlyspecifysomeof thefieldsof packetsgeneratedin theprocessormodule.Thiscanbeusedto implementspecialcommandsandtomulticastpacketstomany targetsbyspecifyingtheroutingmaskusedfor distributing thepacket.For example,softwarecansupplya routingmaskto beusedfor write-backpackets,causingsubsequentwrite-backsof cachelinesfrom thesecondarycacheto bemulticastdirectly to thesetof networkcachesspecifiedby theroutingmask(aswell astomemory).

� Eachprocessormodulehastwo interrupt registers,one for cross-processorinterruptsand one fordevice interrupts.Datawritten to an interruptregisteris ORedwith currentcontentsof theregister,andcausesaninterrupt. Whentheprocessorreadsaninterruptregister, theregisteris automaticallycleared.For cross-processorinterrupts,the requestingprocessoridentifiesthe locationof the targetregister(s)usinga routingmaskanda per-stationprocessorbit mask. This allows an interruptto bemulticastto any numberof processors,which may be usedfor an efficient implementationof TLBshoot-down[2] or othercoordinatedOSactivities[19]. Onarequestto anI/O device,systemsoftwarecanspecifytheprocessorto be interruptedaswell asthe bit patternto bewritten to the processor’sinterruptregisterwhenthe requesthascompleted.This flexibility allows a processorto efficientlyhandleconcurrentrequests(suchasI/O requestsandmemory-to-memorytransfers)involvingmultipledevicesdistributedacrossthesystem.

� The performanceof SPMD applicationsthat barrieroften dependson the efficiency of the barrierimplementation[18]. To efficiently supportbarrier synchronization,eachprocessormodulehasabarrierregisterthatdiffersfrom theinterruptregisteronly in thatwritesto thebarrierregisterdo notcauseaninterrupt. In a simpleuseof theseregisters,whena processorreachesa barrierit multicastsa requestthat setsa bit correspondingto its ID in the barrier register of eachof the participatingprocessors,andthenspinson its local registeruntil all participatingprocessorshave written their bit.

� At systemboot time, the latency and bandwidthof all componentsof the system(e.g., networkcomponents,processors)canbeconstrained.This will alow practicalexperimentationto determinethe effect of different relative performancebetweensystemcomponents,suchasprocessorspeed,networklatency, andnetworkbandwidth.

While muchof thefunctionalitythatresultsfrom theabovecontrolis obvious,sophisticatedapplicationandoperatingsystemsoftwarecanmakeuseof thiscontrol in anumberof non-obviousways.In theremainderof thissectionwe give threenon-trivial examplesof how thiscontrolcouldbeused.

20

Update of shared data

Considerthecasewheremany processorsarespinningon a dataelement(e.g.,a eurekavariable[18]) andsomeprocessorwritesthatdata.With awrite invalidateprotocol,whentheprocessormodifiesthedataall thesharedcopiesof thedataareinvalidated.Hence,dataaccessedin this fashioninvolvesbotha largelatencyto makethemodificationandcontentionat thememorymodulewhenthespinningprocessorsobtainanewcopy. With theabove control,softwarecaninsteadtemporarilybypassthehardwarecoherence,modifyingshareddataandmulticastingit to theaffectednetworkcacheswithout first invalidatingthesharedcopies.

In particular, thesystemsoftwareinteractswith thehardwareto: 1) obtaintheroutingmaskof networkcachesat stationscachingthedata,2) lock thecacheline to ensurethatadditionalstationsarenot grantedaccessto it, 3) modify thestateof thecacheline in thesecondarycacheto dirty, 4) modify thecontentsof thecacheline in thesecondarycache,and5) multicastthecachelinesusingtheroutingmaskobtainedearlier.Whenthe updatesarrive at a networkcache,the networkcacheinvalidatesany copiesin local secondarycaches.Whentheupdatearrivesat memory, thecacheline is unlocked.

Software managed caching

NUMAchine allows systemsoftwarea fair bit of control over how datais cachedandhow coherenceismaintained.At thesimplestlevel, systemsoftwarecanspecifyona per-pagebasis:(i) if cachingshouldbedisabledor enabled,(ii) if thecoherenceof cacheddatashouldbeenforcedby hardware,(iii) if hardwareshouldallow multiple processorsto have datain a sharedstate(or only allow exclusive accessby a singlecache),and(iv) if theprocessorsupportsit, if coherenceshouldbemaintainedusinganupdateor invalidateprotocol.

We arecurrentlyevaluatingsupportingbothsequentialconsistency, anda weakermodel(that doesn’tquite fit any of the establishedweakconsistency definitions). The full overheadof this is not yet clear,andmoreimportantlyit is not clearwhattheperformanceadvantageswill be,sinceon our architecturethetopologyof theinterconnectallowssequentialconsistency to beimplementedatmuchloweroverheadthanotherarchitectures.

For cachecoherentpages,softwarecanusesomeof thehardwarecontroldescribedabove to improveperformance.For example,multicastingdatacanbe usedby softwareto reducelatency, anddatacanbewrittenbackfrom any cacheundersoftwarecontrolto reducethecostof coherence.Similarly, with a writeupdatehardwareprotocol,processorsthatareno longerusingthedatacanexplicitly invalidateit from theirsecondaryandnetworkcachesin orderto reducetheoverheadof updates.

Cacheablebut non-coherentpagescanbe usedto enablesoftwarecontrolledcachecoherence.Suchtechniquescantakeadvantageof applicationspecificsemanticstoreducetheoverheadof coherencefor manyapplications[21]. To makethe implementationof thesetechniquesmoreefficient, NUMAchine maintainsstateaboutcachelines(suchaswhich processorshave thedatacached)thatcanbedirectlyaccessedby thesoftware.We alsoexpectthat thesupportfor multicastinterruptsprovidedby our hardwarewill beusefulfor someof thesetechniques.

In-cache zeroing/copying

Theoperatingsystemmustfor securityreasonszero-fill memorypageswhenpassingthembetweenappli-cations.Similarly, operatingsystemsoftenhave to copydatabetweendifferentbuffers. For bothof theseoperations,thecostof readingthedatathatis to beover-writtencanin many casesdominateperformance.

NUMAchine minimizesthe overheadfor zeroingor copyingdataby allowing theseoperationsto bedonewithout loadingthe datathat will be overwritteninto the processorcache. To copy databetweenasourceandtargetpage,theoperatingsystem:(1) makesa singlerequestto theaffectedmemorymoduleto

21

invalidateany cachedlinesof thetargetpage,markthestateasdirty, andsettheroutingmask(or processormask)to theprocessorperformingthecopy, (2) createsthecachelinesof the targetpagein thesecondarycacheby modifying the tag andstateinformationof the secondarycache,and(3) copy databetweenthesourceandtargetpage.Zero-filling pagesis identicalto copyingpages,exceptfor thefinal stage,wherethecreatedcachelinesareinsteadzero-filled.

3.3 Performance Monitoring Hardware

NUMAchineincludesconsiderablehardwarededicatedto monitoringsystemperformancein anon-intrusivefashion.Monitoringhardwareis distributedamongstall majorsub-systemsof themultiprocessor, includingthe memorysystem,the processormodules,thering-interfaces,andthe I/O subsystem.For convenience,monitoringhardwareis collectively referredto in thissectionas“the monitor”.

The overall usesof the monitor are as follows: 1) InvestigateandevaluatearchitecturalfeaturesofNUMAchine, 2) provide real-timefeedbackconcerningutilization of systemresources,to allow tuningofapplicationprograms,3) accumulatetracehistory information to aid hardwareandsoftwaredebugging,4) validateour NUMAchine hardwaresimulator, and5) characterizeapplicationworkloadsfor high-levelperformancemodellingor networksimulations.

A key featureof the monitor is that it is implementedin high-capacityprogrammablelogic devices(PLDs). BecausethePLDsbeingused(Altera MAX7000 Complex PLDsandFLEX8000FPGAs)arere-programmable,thesamemonitoringcircuitscanbere-configuredto performdifferentfunctions.Thisofferstremendousflexibility becausea wide variety of measurementscanbe madewithout incurring excessivecost.

In general,the monitor comprisesa numberof dedicatedhardwarecounters,flexible SRAM-basedcounters,and tracememory. The dedicatedhardwarecountersmonitor critical resourcessuchas FIFObuffer depthsand network utilization. For example,bus and ring-link utilization are importantoverallperformancemetricsthatcanbemonitoredby dedicatedcounters.TheSRAM-basedcountersareusedtocategorizeandmeasureeventsin a tableformat. A goodexampleof this is in thememorymodule,wheretransactionscanbecategorizedbaseduponthetypeof transactionandits originator;a tablecountingeachtransactionfrom eachoriginatorwould bemonitored.This informationcanhelpidentify resource“hogs”,or even programbottlenecks. In addition to counters,tracememory(DRAM) is usedto recall historyinformationaboutbustraffic, for example.Thisallowsnon-intrusiveprobinginto activity justbeforeor afteranimportanteventsuchasa hardwareerroror softwarebarrier.

A novel featureof themonitoris that informationgatheredcanbecorrelatedwith executionof specificsegmentsof code,byparticularprocessors.Thisis implementedby asmallregister, calledaphase identifier,ateachprocessor. As executingcodeentersregionsthatshouldbedistinguishablefor monitoringpurposes,thecodewrites into thephaseidentifierregister;this informationis appendedto eachtransactionfrom theprocessorandis usedby themonitorthroughoutthesystem.

In this paper, we discussin more detail only thosemonitoring circuits associatedwith the memorysubsystem.Thereasonfor this focusis thatmemorysystemperformanceis akey aspectof shared-memorymultiprocessordesign,and offers many opportunitiesfor improving performance. A memorymodulein NUMAchine, as mentionedearlier, consistsof an incoming FIFO, DRAM for data,SRAM for stateinformation,andanoutgoingFIFO. Themonitormeasurestheway in which thememoryis beingusedbyall processorsin thesystem;to accomplishthis, it monitorstheincomingandoutgoingFIFOs,andsomeofthestateinformationfor accessedmemorylocations.Therearetwo maintypesof monitoringcircuitsin thememorymodule: multipurposecounters,andhistogramtables.The purposeof eachof theseis discussedbelow.

22

3.3.1 Multipurpose Counters

Countersin thememorymodulecounttheoccurrenceof eventsthatcanprovidesoftwarewith anindicationof how memoryis being utilized. One usageof suchinformation would be for softwareto recognizebottlenecksby measuringdepthsof thememoryFIFOs. As anotherexample,if themonitorshows a largenumberof invalidatesat a memorylocation,thenthis might indicatethat falsesharingis occurring. Thefollowing areexamplesof eventsthatcanbecounted:

� totalnumberof transactionsto memorymodule

� numberof transactionsof aspecifictype(e.g.,readrequest,write request,write, etc.)

� numberof invalidatessentoutof memorymodule

� depthof incomingandoutgoingFIFOs

3.3.2 Histogram Tables

Themostinterestingandusefulmonitoringcircuitsin thememorymodulesarehistogramtablesthatallowaccumulationof statisticsconcerningmemoryaccesses.The hardwarefor generatingthesetablesis of ageneralstructure,andcanbeconfiguredto collectdifferenttypesof information. For eachtable,therearetwohalves:onethatisbeingcurrentlygenerated,andanotherthatwasalreadygeneratedandhasoverflowed.Theideabehindthis is thatonceany entryin a tableoverflows,aninterruptis generatedsothatsoftwarecanexaminethe informationasdesired,but in themeantimemonitoringcanstill continueusingtheotherhalfof thetable.For brevity, weprovideonly asingleexampleof sucha tablebelow, but thereareseveralothersthatareavailable.

3.3.3 Cache Coherency Histogram Table

When designinga cachecoherencescheme,or evaluatingits effectiveness,it is importantto know thetypical accesspatternsthatcanbeexpected.To someextent,suchinformationcanbediscoveredthroughsimulations,but, for practicalreasons,simulatedbehaviour is alwayslimited in scope.Thecachecoherencyhit tableprovidesa way for monitoringhardwareto gatherdetailedinformationaboutthecachecoherencystateof accessedmemorylocations.More specifically, the following informationis gatheredin this table:for eachtypeof memorytransaction(e.g.,readrequest,write permissionrequest,etc.),thetableaccumulatesa countof thenumberof timesthateachpossiblecacheline stateis encountered.In NUMAchine’s cachecoherencescheme,asoutlinedin Section2.3, therearefour possiblecacheline states:local valid, localinvalid, globalvalid, andglobal invalid. In addition,a cacheline canbe lockedor unlockedin eachstate.The histogramtablewould thencontaineight rows for the cacheline states,andenoughcolumnsfor alltransactiontypesof interest.

Togeneratethistable,themonitoringhardwareneedsinputsconnectedtothedataoutputsof theincomingFIFO(thebits thatspecifythetypeof memoryaccess,andthememoryaddressinvolved(sothatmonitoringcanberestrictedto anaddressrange),aswell asthebits in theSRAM thatspecifythestateof thecachelinebeingaccessed.In addition,someof theothertables(notdescribedhere)thatcanbegeneratedrequireothersignals.Sincethesamehardware(PLDs) is reconfiguredfor eachtypeof table,therearein generalsomeinputsandoutputsconnectedto themonitoringchipsthatmaynot beusedwhengeneratingany particulartable.

23

4 Performance Results

4.1 Simulation Goals

This sectiondescribessimulationsresultsof the NUMAchine prototype. Thereare threemain reasonsfor developing a simulatorfor NUMAchine: to estimatethe performanceof the prototypemachine,tolocateany weaknessesin the NUMAchine architectureand/orthe particularparametersof the prototype,and to investigatea numberof tradeoffs. In this document,the only simulationsshown are thosethatprovideindicatorsof NUMAchine’sperformance;however, thesimulatorisbeingusedonacontinuingbasisto investigatearchitecturalparameters,to improve the NUMAchine architecture.A completebehavioralsimulation of a "virtual" machinehas beenimplementedin software,using state-of-the-artsimulationtechniques.The simulationenvironmentdescribedin thenext sectionis onestepremoved from a circuit-level simulation;all behaviour affecting timing andfunctionalitywasmodelledin detail for eachsystemcomponentthatcouldactindependently.

For designverificationour primary concernswerethe efficienciesof: the rings, the Network Cache,andthecachecoherenceprotocol.For therings,theobviousquestionis whethernetworkcontentioncausesseriousperformancedegradation.For theNetworkCache,we areinterestedin its effectivenessat reducingnetworktraffic. Finally, asmentionedin Section2.3, the coherenceprotocolwasdesignedoptimistically,assumingthatcertaincaseswouldnothappenfrequently;thuswewish to determinetheactualfrequency ofthesecases,in orderto assesswhethertheprotocolwill performasefficiently ashoped.

Beyond the above goals,therearemany otherquestionson enhancedfunctionality that caneasilybeaskedin asimulation.For example,thebenefitsof prefetching,broadcastingandweakerconsistency modelsareall of interest. The answersto thesequestions(andothers)arecurrentlyunderinvestigation,but forbrevity will not bereportedhere.

4.2 Simulation Environment

The performanceof the prototypehasbeeninvestigatedby meansof an execution-driven simulationusingthe SPLASH-2[3] benchmarksuiteasinput. The simulatoritself usesMint [22] asa front-endtointerpretthenative MIPS binariesproducedfor thesuite. Theback-enddoesbehavioral modellingof thesystematacyclelevel. Thisincludesall timingdetails(e.g.busarbitration,DRAM andSRAMaccesstimes)aswell asfunctionaldetails,suchasL1 andL2 dataandinstructioncaches,andapacket-by-packetmodeloftherings. Figure12 illustratestheNUMAchinesimulationenvironment.A singlebinaryrunningon eitherSGIor SUNworkstationssimulatesboththemulti-threadedapplicationandtheNUMAchineconfiguration,all of whosedetailsarespecifiedin thetext parameterfile. Run-timeis quitegoodgiventhelevel of detail,with native versussimulatedexecutionslowdown ratiosof 100-300whenrunningon an SGI Challengemachine.Althoughaspectssuchasinstructionfetchingandserialcodeexecutioncanbemodelledin thesimulator, they aretime consuminganddo not significantlyaffect results.S For this reasonthe resultsintherestof this reportwill assumethatonly datacachesandfetchesareimplemented,andonly theparallelsectionof the codeis modelledin detail. (The serialsectionof codestill executes,but doesnot generateevents.)Resultsfrom moredetailedsimulationswill becontainedin [4].

Table1 givesthecontention-freelatenciesfor differenttypesof accessesasa yardstickfor comparisonwith resultsin latersections.For thisdata,wemanuallycalculatethenumberof clockcyclesrequiredin thehardwaretoperformthevarioustypesof accesses(i.e.,thesenumberstonotreflectsucharchitecturalfeaturesascaches).Thetwo typesof remoteaccessesrepresent:requeststhattraverseonly asinglelower-level ring,andrequeststhatspanthewhole network. (Note thatdueto the single-pathnatureof a ring, the distance

TTheSPLASH-2applicationstakeanywherefrom 5 minutesto 24 hoursto run on anSGI Challengefor eachdifferentsystem

configuration.Thesetimescandoubleif botheffectsareturnedon.

24

MIPSBinary

(Parallel)

MINT

Virtualparallelmachine

NUMAchine Simulator

Network,buses, caches,memory,etc.

Param. file

e.g. reads, writes, barriers

e.g. synch time, request latency

Events

Timing

Figure12: Simulationenvironment. The parameterfile is a text file containingall informationon timingandgeometry.

Table 1: Contention-freerequestlatenciesin the simulatedprototype. Readsand interventionsinvolve64-bytecacheline fills. Upgradescontainnodata,only permissionto write.

DataAccessType Latency (ns) Latency (CPUcycles)

Local:Read 668 100Upgrade 284 43Intervention 717 108

Remote,samering:Read 1652 248Upgrade 1167 175Intervention 1656 249

Remote,differentring:Read 1908 286Upgrade 1508 226Intervention 1932 290

betweenany two stationsthatarenoton thesamering is equalto thespanof thenetwork,regardlessof thepositionof thetwo stations.)Evenwithout theeffectof theNetworkCache,thesenumbersindicatethattheprototypebehavesasa mildly NUMA architecture.

25

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

Par

alle

l Spe

edup

U

Number of Processors

SPLASH-2 Parallel Speedup Curves - Kernels

IdealRadix

LU ContiguousLU Noncontiguous

FFTCholesky

Figure13: Parallelspeedupfor SPLASH-2kernels

4.3 Overall Performance

In orderto gaugetheoverall performanceof theNUMAchineprototype,theSplash2suitewasrun throughthesimulatorto measureparallel speedup; for this datawe consideronly theparallel section of thecode,andignorethe sequential section. In theSplash2suite,theparallelsectionis definedasthe time from thecreationof the masterthread,until themasterthreadhassuccessfullycompleteda wait() call for all of itschildren. This is not a true speedup,but is in line with otherperformancemeasurementsof this nature(e.g.,see[5] citesplash2.suite).In orderto beconservative,all fixedhardwarelatenciesaresetto theiractualvaluesin the hardwareif thosevaluesareknown, and to pessimisticvaluesotherwise. In addition, theresultsshown usea simpleround-robinpageplacementschemewhich is expectedto performmorepoorlythanif intelligentplacementweredone. (For example,pagescontainingdatausedby only oneprocessor,alsocalledprivatepages,arenot placedin thememorylocal to thatprocessor, which would besimpletoensurein a realsystem.)For thesereasons,Weexpecttheactualprototypehardwareto have equalor betterperformancethantheresultsshown hereindicate.

Figures13 and 14 show the parallel speedupsfor the SPLASH-2benchmarks.All benchmarksareunmodified,except for LU-Contig, which useda slightly modified block-allocationschemeto improveworkloadbalance.V . Table2 givestheproblemsizesusedfor generatingthespeedupcurves.

Highly parallelizableapplicationssuchasBarnesandWatershow excellentspeedups,ashigh as57.Of more interestis NUMAchine’s performancefor codethat hasa higher degreeof datasharing. ForFFT andLU, examplesof suchcode,thespeedupsarestill good,especiallygiventhesmallproblemsizes.Theseresultscomparefavorablywith measurementsof theSPLASH-2suitein citeref:splash2.suiteusingaperfectmemorysystem.This leadsus to believe thatwith suitabletuningof bothhardwareandsoftware,performancewill beonparwith theexistingstate-of-the-art.

WFor thesakeof otherSPLASH-2experimenters,theBlockOwnerroutinewaschangedto XZY\[^]`_�acb�dfe

26

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

Par

alle

l Spe

edup

U

P (Number of Processors)

SPLASH-2 Parallel Speedup Curves - Applications

IdealWater Spatial

RadiosityBarnes

Water NsquaredOcean

FMMRaytrace

Figure14: Parallelspeedupfor SPLASH-2applications.

Table2: Problemsizesusedfor theSPLASH-2benchmarks.Benchmark ProblemSize/Input

LU Contiguous& Non-contiguous 512x512Matrix, 16x16BlocksFFT 65536Complex Doubles(M=16)RadixSort 262144Keys,Radix1024Cholesky tk18.OInputFileWaterSpatial& Nsquared 512Molecules,3 StepsOcean 258x258GridBarnes 16384ParticlesFMM 16384ParticlesRaytrace TeapotGeometryRadiosity Roomin batchmode

4.4 Performance of Communication Paths

The efficiency of NUMAchine’s interconnectionnetworkcan be shown usinga numberof performancemetrics.Figure17 depictstheutilization of thestationbuses,local ringsandcentralring. It indicatesthatnoneof thesecomponentsis likely to becomea performancebottleneck.Figure8 shows thedelaysin ringinterfaces.Eachverticalbarshowstwo componentsof thedelay. Thelowerportionof eachbarcorrespondsto theminimumdelay(in theabsenceof networktraffic) andthe upperportion indicatesadditionaldelaydueto traffic contention.The averagepacketdelaysin the upwardanddownwardpathsin the local ringinterfacesareshown in Figure4.5. Theupwardpathdelayis smallfor all applications.Thelargerdelaysforthedownwardpathsaredueto theway in which we have implementedthepackethandlerandthequeues,whichwearecurrentlyredesigningto reducethesedelays.Packetdelaysfrom thecentralring to alocalring

27

0

5

10

15

20

25

30

35

40

BarnesRadix FFT LU Ocean Water

Hit

Rat

e (%

)

g

Network Cache Hit Rate

Migration effectCaching effect

Figure15: Networkcachetotalhit rate.

0

5

10

15

20

25

30

35

40

45

50

BarnesRadix FFT LU Ocean Water

Hit

Rat

e (%

)

g

Network Cache Combining Rate

Figure16: Networkcachecombiningrate

have the samesmall delaysasfor the upwardpathin Figure4.5. The averagepacketdelaysfrom a localring to thecentralring areonly slightly larger, asshown in Figure4.5. This indicatesthat for our targetedsystemsizetheringsareexpectedto performwell.

4.5 Performance of the Network Cache

Thesimplestmeasureof NetworkCacheperformanceis thehit rate,definedas,

hit rate h numberof requestssatisfiedlocallytotal numberof requests,notcountingretriesi

andshown in Figure15. (Notethatlocal interventionsarecountedin thenumerator.) Retriesaregeneratedlocally whena cacheline is lockedin theNC dueto a pendingremoterequest.This locking couldbedueto a requestto thesameor a differentcacheline (cacheconflict) from anotherprocessor. However, given

28

0

10

20

30

40

50

60

70

Barnes Radix FFT LU Ocean Water

Util

izat

ion

(%)

Average Utilizations

BusLocal Ring

Central Ring

Figure17: Averageutilizationof communicationpaths.

0

5

10

15

20

25

30

35

40

Barnes Radix FFT LU Ocean Water

Del

ay (

Clo

ck C

ycle

s)

j

Average Local Ring Interface Delays

SendDownward Path (non-sinkable)

Downward Path (sinkable)

(a)Delaysin thelocal ring interface.

0

2

4

6

8

10

BarnesRadix FFT LU Ocean Water

Del

ay (

Clo

ck C

ycle

s)

j

Average Central Ring Delay

(b) Delayin theupwardpathof thecentralring interface.

Figure18: Localandcentralring interfacedelays.

thelargesizeof theNC andthefact thattherecanonly be4 outstandingrequestsatonetime(becauseeachR4400processorcangenerateonlyonerequestatatime),thechancesof suchaconflictareslim. Mostretriesaredueto concurrentrequeststo thesamecacheline. Whenthependingrequestreturnsthroughthenetworkandunlocksthe line, the next retry will succeed.(Assumingthe line is not ejectedin the interim, whichis unlikely.) This masking-outof simultaneousrequestsis termedthe ’combining’ effect, sincemultiplerequestsresultin only asinglenetworkaccess.Thiseffect is displayedin Figure16.

Anotherreductionin networktraffic is gainedfrom what is termedthe ’migration’ effect. In essence,whendatabroughtontoastationis thenaccessedby anotherprocessor, a remoteaccessis potentiallysaved.This is truebothfor datathatis dirty, aswell asdatathatis shared.It is worthyof notethatasystemutilizingbussnoopingwouldalsoseethisbenefit,but only for dirty data.

29

Table3: Percentageof local requeststo NC thatresultin a falseremoterequestbeingsentto memory.

Application Percentageof falseremotes

Cholesky < 0.5FMM < 1.0Ocean < 0.3Radiosity < 0.2Radix < 0.5All others << 0.01

4.6 Performance of the Coherence Protocol

As mentionedat thebeginningof this section,thecoherenceprotocolwasdesignedundertheassumptionthatcertaincaseswouldoccuronly infrequently. To assessthevalidity of thatassumption,we measurethefrequency of thosecaseshere.

Thefirst caseinvolvesthe inexactnessof the filter mask. It is possiblethatan "old" write permissionrequestthathasbeentravelling throughthenetworkfor sometimecanreachmemoryafter previousrequestshaveinvalidatedtherequester’ssharedcopy. Sincethefilter maskisnotprecise,it ispossiblefor thememorymoduleto erroneouslybelieve that therequesterstill hasa sharedcopy, in which caseit will respondwithonly anacknowledgement,grantingownershipto therequester. Therequesterwill seetheownershipgrant,butwill nothavevaliddata.In thiscase,therequestermustsendaspecial write request tomemory, indicatingthat datamustbe returned. The above schemeis an optimistic design,in that a memorymodulealwaysassumesthata requesterhascorrectdata,in spiteof theambiguousdirectoryinformation. Thealternativewouldbefor thememorymoduleto assumethattherequesterdoesnothavevalid datawhensuchambiguityarises;this implies that datawould alwaysbe sent,andthis would be wastefulunlessthat datais almostalwaysneeded.The simulationresultsshown below indicatethat the optimistic choiceis the right one.Acrossall the applicationsand for all systemgeometries(representinghundredsof millions of requeststo memory)only 4 specialreadrequestswereever sent. This result is a manifestationof the well-knownpropertyof multiprocessorsystemsthatagivencacheline is almostalwayssharedby 1, 2 or all processors,andvery rarelyby somenumberin between;thechancesthatthreestationssharea line in just theright wayfor theoptimisticassumptionto fail aresmall.

Thesecondcaseof interestarisesdueto thedirect-mappednatureof thenetworkcaches.It is possiblefor thenetworkcacheto losedirectoryinformationdueto replacementsby otherrequests.Themostcostlyeffectof thischoiceiswhendatahasbeenmadedirty locallyonastation,but thisinformationissubsequentlythrown out of theNC. A requestfor this line now misses,andis sentto memory, which sendstherequestbackindicatingthatits filter masksindicatesthatthelocal stationalreadyhasthatdata,in LV state.At thispoint theNC doestheinterventionthatit couldhave doneimmediatelyif thedirectoryinformationhadnotbeenlost. We call thesetypesof missesfalse remote requests. Again thesimulationsshow that this casehappensveryinfrequently. Table3 indicatesthepercentageof all local requeststhatendupgeneratingfalseremoterequests.Only for oneapplication,FMM, doesthepercentageapproach1 %.

Both of theabove casesarisedueto a lossof informationin thecoherenceprotocol. (In onecaseit isimprecisionin thedirectorybits, in theotherit is thewholesalelossof all localdirectoryinformation.)Theconclusionis thatfull state/directoryinformationis not necessaryfor theefficiency of thecachecoherenceprotocol. The casesfor which the protocolchosesimplicity over efficiency arethosethat happenrarelyenoughthatoverallperformanceis notaffected.

30

5 Related Work

Overthepastfew years,anumberof scalablemultiprocessorsthatsupportasinglecoherentview of memoryhave beendesignedand/orbuilt. In this section,someof thefeaturesof recentmachinesareconsidered,inorderto show how NUMAchinecomparesto otherapproaches.

TheStanfordDASH multiprocessor[15] usesclustersof processorsthatshareasinglebus,with clustersinterconnectedby amesh.It usesa directory-basedhardwarecachecoherenceprotocolthat,onawrite to asharedcacheline, requiresseparateinvalidatesbesentfor eachof thecopies,andrequiresacknowledgmentsfor eachinvalidate.In theNUMAchineprotocol,only a singleinvalidatemessageis used,andnoacknowl-edgementsarerequired.DASHemploysasmallcachein eachclustercalledaRemoteAccessCache(RAC).NUMAchine’s networkcacheincludesthe functionalityof theRAC; however, thekey to theeffectivenessof NUMAchine’s networkcacheis its largecapacity, beingat leastaslargeasthecombinedcapacitiesofthesecondarycachesonastation.

TheFLASHmultiprocessor[13], underdevelopmentatStanfordUniversity,will provideasingleaddressspacewith integratedmessagepassingsupport.A programmableco-processor, calledMAGIC, servesasamemoryandI/O controller, anetworkinterface,andasacommunicationandcoherenceprotocolprocessor.Throughthis programmableco-processor, FLASH providesa high degreeof flexibility . NUMAchine usesa differentapproachto providing flexibility . The basicprotocols,suchascoherence,areimplementedinhardwareto ensuregoodperformance,but softwarehastheability to overridethehardwarewhendifferentprotocolsaredesirable.

TheAlewife machinefrom MIT [1] sharesthe FLASH approachof integratinga singleaddressspacewith messagepassing. Its approachfor achieving flexibility is to implementcommoncasecritical pathoperationsin hardware,letting softwarehandleexceptionalor unusualconditions. For example,it useslimited directories[8] to implementcachecoherence,wherehardwaresupportsdirectly a smallnumberofpointers,andsoftwaremusthandlethecasewhencachelinesaresharedby a largernumberof processors.An importantdifferencebetweenAlewife andNUMAchine is thatAlewife reliesona greatdealof customhardware.As a result,it is harderfor Alewife to tracktherapidimprovementsin workstationtechnology.

TheKSR multiprocessors[12] from KendallSquareResearchusea ring-basedhierarchicalnetworktointerconnectup to 1088processingcells. Thesesystemsimplementa CacheOnly Memory Architecture(COMA), which automaticallyreplicatesdatato requestingcells. Although NUMAchine usesa similarinterconnectiontopology, therearea numberof fundamentaldifferencesbetweenthetwo networks.In theKSRsystems,eachprocessingcell mustsnoopon ring traffic to maintaincachecoherence.Thiseffectivelyinvolvesa directorylookupandslows thespeedof operation.Furthermore,a combinedcachedirectoryisneededateachlevel in theinterconnecthierarchy, containingall thedirectoryinformationin thelevelsbelow,which severely limits thescalabilityof the architecture.The replicationof datain the COMA memoryiseffectivein reducingmemoryandnetworkcontention[6]. NUMAchinecapturesmostof thesebenefitswithits networkcaches,but withoutaffectingscalabilityandata considerablyreducedcost.

Otherinterestingmultiprocessorprojectsincludethe ASURA [17] multiprocessorbeingdevelopedatKyotoUniversityin Japan,Typhoon[20] from theUniversityof Wisconsin,theCrayT3D system[18] fromCrayResearch,andtheExemplarfrom Convex [9]. ASURAhasmany similaritieswith NUMAchine,but itsequivalentof thenetworkcacheusesvery long cacheline sizes(1 Kbyte), which mayleadto considerablefalsesharing. Typhoonhassimilar flexibility goalsto FLASH, andalsodependson a programmableco-processorto implementits coherencepolicy. TheT3D doesnot supportcachecoherencein hardware.TheExemplarusesa crossbarto interconnectprocessorsin a clusterandusesSCI ringsto interconnectclustersandmaintaininter-clustercoherence.Thedistributeddirectory-basedprotocolimplementedby SCI,usinglinked lists,canintroduceconsiderablecachecoherencelatency overhead.

31

6 Concluding remarks

In orderto besuccessful,futuremultiprocessorsystemsmustbecosteffective,modular, andeasyto programfor efficient parallelexecution. The NUMAchine project seeksto addresstheseissuesby developingacost-effective high-performancehardwareplatform supportedby softwareto easethe taskof developingparallelapplicationsandmaximizingparallelperformance.In this reportwe have provided an overviewof the NUMAchine hardwarearchitectureand presentedsimulationresultsto demonstratesomeof theimplicationsof thearchitectureonperformance.

The NUMAchine ring hierarchygivesthe desiredsimplicity of implementation.Sincethereareonlythreeconnectionsto eachnode,it is possibleto usewide datapaths.We have developeda simpleroutingmechanismthat allows the rings to be clockedat high speed. An shown in the evaluationsection,thebisectionalbandwidthof ournetworkis sufficient for typicalapplicationsrunningonthetargetsystemsize.In addition,thehigh-speedoperationresultsin low latency for remoteaccesses.

Thehierarchicalnatureof theNUMAchineringsallowsfor anaturalimplementationof multicasts.Thisfeatureis exploitedby thecoherencemechanismto invalidatemultiplecachelinesusinga singlepacket.Itis alsoexploited to implementanefficient multicastinterruptmechanismandto implement,in hardware,supportfor efficientbarriersynchronization.

Thecachecoherencesupportin NUMAchine is highly optimizedfor applicationswheremostsharingis localizedwithin a singlestation,in which casecoherenceis controlledby the local memoryor networkcacheandnoremoteinteractionsarerequired.A two-level directorystructureis used,wherethenumberofbitspercache-linegrowsonly logarithmicallywith thenumberof processorsin thesystem.

In additionto localizingcoherencetraffic, thenetworkcacheservesasa largersharedtertiarycachefortheprocessorsonthestation.It is implementedin DRAM, whichwill allow usto experimentwith verylargecachesizesin orderto avoid remoteaccesses.Also, thenetworkcacheservesasa targetfor suchoperationsasmulticastwrites; systemsoftwarecancausecachelines to be multicastto a setof stationswhereit isexpectedthatthedatawill soonberequired.

The NUMAchine architectureis one componentof the larger NUMAchine project, which involvesdevelopmentof anew operatingsystem,parallelizingcompilers,anumberof toolsfor aidingin correctnessandparallelperformancedebugging,anda large setof applications. For this reason,our prototypewillincludeextensive monitoringsupport. Also, it will allow systemsoftwareto takecontrolof the low-levelfeaturesof thehardware,facilitatingexperimentationinto hardware-softwareinteraction.

References

[1] A. Agarwal, D. Chaiken,G. D’Souza,et al. The MIT Alewife machine:A large-scaledistributed-memorymultiprocessor. TechnicalReportMIT/LCS MemoTM-454, Laboratoryfor ComputerSci-ence,MassachusettsInstituteof Technology, 1991.

[2] R.BalanandK. Gollhardt.A scalableimplementationof virtualmemoryHAT layerfor sharedmemorymultiprocessormachines.In Summer ’92 USENIX, pages107–115,SanAntonio,TX, June1992.

[3] T. befilled in later. To befilled in later. To be Added Later, 1995.

[4] T. befilled in later. To befilled in later. To be filled in later, 1995.

[5] T. befilled in later. To befilled in later. To be filled in later, 1995.

32

[6] R. Bianchini,M. E. Crovella, L. Kontoothanassis,andT. J.LeBlanc. Memorycontentionin scalablecache-coherentmultiprocessors.TechnicalReport448,ComputerScienceDepartment,UniversityofRochester, 1993.

[7] W. J.Bolosky, R. P. Fitzgerald,andM. L. Scott. Simplebut effective techniquesfor NUMA memorymanagement.In Proc. of the 12th ACM Symp. on Operating System Principles, pages19–31,1989.

[8] D. Chaiken,J. Kubiatowicz, andA. Agarwal. LimitLESS directories: A scalablecachecoherencescheme.In Proc. of the Fourth Int’l Conf. on ASPLOS, pages224–234,New York, April 1991.

[9] Convex ComputerCorporation.Convex Exemplar Systems Overview, 1994.

[10] K. Farkas,Z. Vranesic,andM. Stumm.Cacheconsistency in hierarchical-ring-basedmultiprocessors.Tech. Rep. CSRI-273,ComputerSystemsResearchInstitute, Univ. of Toronto, Ontario, Canada,January1993.

[11] K. Farkas,Z. Vranesic,and M. Stumm. Scalablecacheconsistency for hierarchically-structuredmultiprocessors.Journal of Supercomputing, 1995. in press.

[12] KendallSquareResearch.KSR1 Technical Summary, 1992.

[13] J. Kuskin, D. Ofelt, M. Heinrich, et al. The StanfordFLASH multiprocessor. In Proc. of the 21stAnnual ISCA, pages302–313,Chicago,Illinois, April 1994.

[14] R.P. LaRoweJr. andC.S.Ellis. Experimentalcomparisonof memorymanagementpoliciesfor NUMAmultiprocessors.ACM Transactions on Computer Systems, 9(4):319–363,Nov. 1991.

[15] D. Lenoski, J. Laudon,K. Gharachorloo,et al. The StanfordDASH multiprocessor. Computer,25(3):63–79,March1992.

[16] D. E.Lenoski.Thedesignandanalysisof DASH:A scalabledirectory-basedmultiprocessor. TechnicalReportCSL-TR-92-507,StanfordUniversity, January1992.

[17] S.Mori, H. Saito,M. Goshima,etal. A distributedsharedmemorymultiprocessor:ASURA— memoryandcachearchitectures—. In Supercomputing ’93, pages740–749,Portland,Oregon,November1993.

[18] W. Oed.TheCrayResearchmassively parallelprocessorsystemCRAY T3D. Technicalreport,CrayResearchGmbH,Munchen,Germany, Nov. 151993.

[19] S.K. Reinhardt,B. Falsafi,andD. A. Wood.Kernelsupportfor theWisconsinWindTunnel.In UsenixSymposium on Microkernels and Other Kernel Architectures, pages73–89,September1993.

[20] S. K. Reinhardt,J.R. Larus,andD. A. Wood. TempestandTyphoon:User-level sharedmemory. InProc. of the 21st Annual ISCA, pages325–336,Chicago,Illinois, April 1994.

[21] H. S. Sandhu,B. Gamsa,andS. Zhou. The sharedregions approachto softwarecachecoherenceon multiprocessors.In Proc. of the 4th ACM SIGPLAN Symp. on Principles and Practice of ParallelProgramming, May 1993.

[22] J.E. Veenstra.Mint TutorialandUserManual.TechnicalReport452,ComputerScienceDepartment,Universityof Rochester, May 1993.

33