graphpim: enabling instruction-level pim offloading in ... · ⎮ pim offloading for atomic...
TRANSCRIPT
GraphPIM:EnablingInstruction-LevelPIMOffloadinginGraphComputingFrameworks
LifengNai,RamyadHadidi,JaewoongSim*,HyojongKim,PranithKumar,HyesoonKim
GeorgiaTech,*IntelLabs
INTRODUCTION
⎮ Graphcomputing:processingbignetworkdata} Socialnetwork,knowledgenetwork,bioinformatics,etc.
⎮ Graphcomputingisinefficientonconventionalarchitectures} Inefficiencyinmemorysubsystems
2
INTRODUCTION
⎮ Processing-in-memory(PIM)} PIMhasthepotentialofhelpinggraphcomputingperformance} PIMisbeingrealizedinrealproducts:Hybridmemorycube(HMC)2.0
3
EnablePIMforgraphcomputing
CHALLENGES
⎮ WhatarethebenefitsofPIMforgraphcomputing?} KnownbenefitsofPIM
} Bandwidthsavings,latencyreduction,morecomputationpower} But,theyarenotgoodenough} Weexploresomethingmore!
⎮ HowtoenablePIMforgraphinapracticalway?} Minorhardware/softwarechange} Noprogrammerburden
4
OVERVIEW 5
GraphPIM:aPIM-enabled graphframework
WeidentifyanewbenefitofPIMoffloading
WedeterminePIMoffloadingtargets
WeenablePIMwithoutuser-applicationchange
KNOWNPIMBENEFITS
⎮ Morecomputationpower} Extracomputationunitsinmemory
⎮ Bandwidthsavings} Pullingdatavs.pushingcomputationcommand
⎮ Latencyreduction} Bypassingcacheforoffloadedaccessesavoidscache-checkingoverhead} Avoidscachepollutionandincreaseseffectivecachesize
Request Response Total64-byteREAD 1FLIT(addr) 5FLITs(data) 6FLITs
64-byteWRITE 5FLITs(addr,data) 1FLIT(ack) 6 FLITs
CPUrd-modify-wr(rd-Miss;wr-evict)
6FLIT 6FLIT 12FLITs
PIMrd-modify-wr 2FLITs(addr,imm) 1 FLIT(ack) 3FLITs
6
(FLIT:16byte,basicflowunit)
PERFORMANCEBENEFITS
⎮ Morecomputationpower?} Limited#ofFUsinmemory
⎮ Bandwidthsavings?} NotBWsaturated
⎮ Latencyreduction?} Yes,butsmall
7
GRAPHPIM EXPLORES…
⎮ Atomicoverheadreduction} AtomicinstructionsonCPUshavesubstantialoverhead[Schweizer’15]
} RMW:read-modify-write} Cacheoperation:cache-lineinvalidation,coherencetrafficetc.} Dataordering:writebufferdraining,pipelinefreezeetc.
} Becauseofthecharacteristicsofgraphprogrammingmodel,PIMoffloadingcanavoidtheatomicoverhead
H.Schweizer etal.,“EvaluatingtheCostofAtomicOperationsonModernArchitectures,”PACT’15
AtomicInstruction
RMWDataOrdering CacheOperation
8
ATOMICOVERHEADREDUCTION 9
RMWDataOrdering CacheOperation
PipelineStall(Serialization)
Retire
CPU
RMW
Offload
ACK
Offload
ACK
RetireCPU
PIM Atomic
ContinueExecution
SerializationinPIM
ATOMICOVERHEADESTIMATION
⎮ AtomicoverheadexperimentsonaXeonE5machine} AtomicRMWà regularload+compute+store
⎮ Atomicinstructionsincur30%performancedegradation
0
0.5
1
1.5
2
BFS CComp DC kCore SSSP TC BC PRank GMean
Normalize
dExecution
Time
AtomicNon-Atomic
10
(Non-Atomic:artificialexperiment,notpreciseestimation)
PERFORMANCEBENEFITS
⎮ Morecomputationpower?} Limited#ofFUsinmemory
⎮ Bandwidthsavings?} NotBWsaturated
⎮ Latencyreduction?} Yes,butsmall
⎮ Atomicoverheadreduction?} Yesandsignificant!} MainsourceofPIMbenefitforgraph
11
OFFLOADINGTARGETS
⎮ Codesnippet:Breadth-firstsearch(BFS)
F← {source}while F isnotempty
F’← {∅}for eachu∈ F inparallel
d← u.depth+1for eachv∈ neighbor(u)
ret←CAS(v.depth,inf,d)ifret==success
F’← F’ ∪ vendif
endforendforbarrier()F← F’
endwhile
123456789101112131415
F:frontiervertexsetofcurrentstepF’:frontiervertexsetofnextstepu.depth:depthvalueofvertexuneighbor(u):neighborverticesofuCAS(v.depth,inf,d):atomiccompareandswapoperation
line4-5,8-10:accessingmetadata
line6:accessinggraphstructure
line7:accessinggraphproperty
✔
✔
12
CacheFriendly
CacheUnfriendly+Atomic
Offloadatomic operationsongraphproperty
INDICATEOFFLOADINGTARGETS 13
⎮ Howtoindicateoffloadingtargets?
⎮ Option#1:Markinstructions} NewinstructionsinISA} Requireschangesinuser-levelapplications
⎮ Option#2:Markmemoryregions} Specialmemoryregionforoffloadingdata} Canbetransparenttoapplicationprogrammers
GRAPHFRAMEWORK
⎮ Graphcomputingisframework-based
} Userapplicationisdesignedontopofframeworkinterfaces} Dataismanagedwithintheframework
14
UserApplication
G=load_graph(“Fruit”);V1=G.find_vertex(“Apple”);V1.property().price=5;V1.add_neighbor(“Orange”);
load_graph (Framework)G_structure =malloc(size1);G_property = malloc(size2);Openfile&loaddata
GraphAP
Is
Middlew
are
ENABLEPIM INGRAPHFRAMEWORK
…
GraphAPI
Middleware
OS
HardwareArchitecture
GraphDataManagement
UserApplication
UserApplication
15
SWHW
GraphFramework
HostProcessorCore
PIMOffloadingUnit
malloc()à pmr_malloc()
GraphProperty
No userapplicationchangeload_graph
G_structure =malloc(size1);G_property =malloc(size2);Openfile&loaddata
pmr_malloc(size2);
FRAMEWORKCHANGE
⎮ PIMmemoryregion(PMR)} Uncacheable memoryregioninvirtualmemoryspace} Utilizingexistinguncacheable (UC)supportinX86
⎮ Frameworkchange} malloc()à pmr_malloc()} pmr_malloc():customizedmalloc functionthatallocatesmemobjectsin
PMR
VirtualMemorySpacePIMMemRegion
GraphProperty
pmr_malloc()Graph
Structure
malloc()
OtherData
GraphDataManagement(Framework)
16
ARCHITECTURECHANGE
⎮ PIMoffloadingunit(POU)} Identifiesatomic instructionsthatareaccessingPIMMemoryRegion} OffloadsthemasPIMinstructions
Core
Caches
POU
HMC
Atom
ic
Unit
HostProcessor HMC
HardwareArchitecture
17
CHANGES
⎮ Softwarechanges:} Nouserapplicationchange} Minorchangeinframework:malloc()à pmr_malloc()
⎮ Hardwarechanges:} PIMmemoryregion:utilizesexistinguncacheable (UC)support} PIMoffloadingunit(POU):identifiesoffloadingtargets
Noburdenonprogrammers+MinorHW/SWchange
18
EVALUATION
⎮ SimulationEnvironment} SST(framework)+MacSim (CPU)+VaultSim (HMC)
⎮ Benchmark} GraphBIGbenchmarksuite[Nai’15](https://github.com/graphbig)} LDBCdatasetfromLinkedDataBenchmarkingCouncil(LDBC)
⎮ Configuration} 16OoO cores,2GHz,4-issue} 32KBL1/256KBL2/16MBsharedL3} HMC2.0spec,8GB,32vaults,512banks,4links,120GB/sperlink
19
L.Naietal.“GraphBIG:UnderstandingGraphComputingintheContextofIndustrialSolutions,”SC’15
EVALUATION:PERFORMANCE
⎮ Baseline:NoPIMoffloading⎮ U-PEI:Performanceupper-boundofPIM-enabledinstructions[Ahn’15]
0.0%
1.0%
2.0%
3.0%
0
1
2
3
%PIM-AtomicinAll
Instructions
Speedu
poverBaseline
Baseline U-PEI GraphPIM %PIM-Atomic
Upto2.4XspeedupOnaverage1.6Xspeedup
20
J.Ahn etal.“PIM-EnabledInstructions:ALow-Overhead,Locality-AwarePIMArchitecture,”ISCA’15
EVALUATION:EXECUTIONTIMEBREAKDOWN
⎮ Breakdownofnormalizedexecutiontime} Atomic-inCore:atomicoverheadofoffloadingtargets(atomicinst.)} Atomic-inCache:cache-checkingoverheadofoffloadingtargets
00.20.40.60.81
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
BFS CComp DC kCore SSSP TC BC PRank
Normalize
dExecution
Time
Other Atomic-inCore Atomic-inCache
21
CONCLUSION
⎮ Graphcomputingisinefficiencyonconventionalarchitectures
⎮ GraphPIMenablesPIMingraphcomputingframeworks} ExploresanewbenefitofPIMoffloading:atomicoverhead reduction} Identifiesatomicoperations ongraphproperty astheoffloadingtarget} Requiresnouser-applicationchange andonlyminorchange in
frameworkandarchitecture
22
THANKYOU!
BACKUPSLIDES
DYNAMICCACHEBEHAVIOR? 25
⎮ GraphPIMmarksmemoryregionstatically} Cons:cannotbeadaptive totheworkingsetsizes} But,propertyaccessestographshaveveryhighcachemissesregardless
ofgraphinputsexceptforreallysmallgraphsizes[JPDC’16,SC’15]} Pros:coherence supportbetweenmemoryandprocessor-cacheisnot
required
CONSISTENCY? 26
⎮ PIMoffloadingforatomicinstructionsworksfinebecause…} Theprogrammingmodelofgraphapplicationsnaturallyavoids
consistencyissues:allPIMwritesaredonebeforereads} Graphapplicationsrequireonlyatomicityfromatomicinstructions} But,atomicinstructionsinCPUsdon’tallowtospecifyatomicitywithout
fence
} Wealsohaveafollow-upworkdiscussingtheconsistencyissueforPIMinstructionsinthecontextofgeneralapplications[MEMSYS’17]
CONSISTENCY? 27
⎮ GraphapplicationswithBSPmodelnaturallyavoidsconsistencyissues} BarriersensuresallPIMwritesaredonebeforereads
ProgramPhases Operationloop:foreach vertex intaskqueue:readpropertyfetchneighborlistforeach neighbor:updateneighborpropertyupdatenext-iter taskqueue
barrier
// Reads
//HMCInst.
WHYGRAPHPIM ISAFRAMEWORK? 28
⎮ GraphPIM} Considerstheseparation offrameworkanduserapplication} Proposesafull-stack solution:SWframework+HWarchitecture} Requiresno applicationprogrammers’efforts
⎮ Userscaneasilyenable/disableGraphPIMbyswitchingbetweendifferentframeworklibraries.
BANDWIDTHSENSITIVITY? 29
⎮ GraphonCPUsarenotverysensitivetoBWchanges} SpeedupoverbaselinesystemwithdifferentHMClinkbandwidth
0
0.5
1
1.5
2
2.5
BFS CComp DC kCore SSSP TC BC PRank GMean
Speedu
poverBaseline
Baseline Baseline-half-BW Baseline-double-BW
GraphPIM GraphPIM-Half-BW GraphPIM-Double-BW
APPLICABILITY? 30
Category Workload Applicable? OffloadingTarget PIMInst.
GraphTraversal
Breadth-firstsearch ✔ lockcmpxchg CASifequalDegreecentrality ✔ lockaddw Singed addBetweenness centrality ✘ (Floating pointadd) (FPadd)Shortest path ✔ lockcmpxchg CASifequalK-core decomposition ✔ locksubw Singed addConnectedcomponent ✔ lockcmpxchg CASifequalPagerank ✘ (Floating pointadd) (FPadd)
DynamicGraph
Graphconstruction ✘ (Complexoperation)Graphupdate ✘ (Complexoperation)Topologymorphing ✘ (Complexoperation)
RichProperty
Trianglecount ✔ lockadd Singed addGibbs inference ✘ (Compute intensive)
✔
✔
GRAPHPIM:EVALUATION 31
⎮ GraphPIMspeedupoverbaselinewithdifferentdatasetsizes
0
1
2
3
4
5
BFS CComp DC kCore SSSP TC BC PRank
Speedu
p
LDBC-1M LDBC-100k LDBC-10k LDBC-1k
GRAPHPIM:EVALUATION 32
⎮ Normalizeduncore energyconsumption
00.20.40.60.81
Baseline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IM
BFS CComp DC kCore SSSP TC BC PRank GMean
Normalize
dEn
ergy
Breakdow
n
Caches HMCLink HMCFU HMCLogicLayer HMCDRAM
Onaverage,GraphPIM saves37% ofuncore energybecauseofreductionincacheaccessesandmemory
bandwidth
GRAPHPIM:EVALUATION 33
⎮ Normalizedbandwidthconsumptionwithrequest/responsebreakdown
00.20.40.60.81
1.2
Baseline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IM
BFS CComp DC kCore SSSP TC BC PRankNormalize
dMem
oryBa
ndwidth
Request Response
GRAPHPIM:EVALUATION 34
⎮ Performanceandenergyresultsoftworeal-worldapplications} Basedonananalyticalmodel} FD:Financialfrauddetection;RS:Recommendersystem
0
0.5
1
1.5
2
2.5
Baseline
GraphP
IM
Baseline
GraphP
IM
FD RS
Speedu
poverbaseline
00.20.40.60.81
Baseline
GraphP
IM
Baseline
GraphP
IM
FD RS
Normalize
dEn
ergyBreakdo
wn
Caches HMCLink HMCOther
HARDWARECHANGES 35
⎮ PIMMemoryRegion(PMR)} Auncacheable memoryregioninvirtualmemoryspace} Utilizingexistinguncacheable (UC)supportinX86
⎮ PIMOffloadingUnit(POU)
Links
Core
Vault00Logic DRAMPartitionAtomicUnit
Vault01Logic DRAMPartitionAtomicUnit
Vault31Logic DRAMPartitionAtomicUnit
...Switch
HMCCo
ntroller
Last-le
velCache
HMCHOST
L2L1
Core
POU
... ...L1 L2
PIMMemRegion?
AtomicInst?
HMCPIMRequest
toL1NN
Y YMemoryInst.
POU
toHMCMemReq
MOTIVATION 36
⎮ ProfilingusingHWperformancecounters} Executioncyclebreakdown:top-downmethodologyfromIntel
050
100150200
Miss
esPerKilo
Instructions
L1D L2 L30%
25%
50%
75%
100%
ExecutionCycle
Breakdow
n
Backend Frontend BadSpeculation Retiring
Bottleneckcausedbybackendstalls
Highnumberofcachemisses
BACKGROUND:PIMOFFLOADINGINHMC2.0 37
⎮ HybridMemoryCube(HMC)2.0} OneofthefirstindustrialPIMproposals} Instruction-level PIMoffloading
} 1logicdie+4/8DRAMdies} 32Vaults} 4seriallinks
SerialLinks
TSVs
Logic LayerDRAM Layers
Vault
BACKGROUND:PIMOFFLOADINGINHMC2.0 38
⎮ Packet-basedprotocol
⎮ RegularREAD/WRITE} FLIT:16-byte;basicflowunit
HeaderPayloadTail
Request Response64-byteREAD 1FLIT 5FLITs
64-byteWRITE 5FLITs 1FLIT
8byte 8byte0~256byte
BACKGROUND:PIMOFFLOADINGINHMC2.0 39
⎮ PIMInstruction:read-modify-write (RMW)operation} SimilarasregularREAD/WRITE,justdifferentCMD intheHeader} DRAMbankislockedduringthewholeRMWforatomicity
PIM-ADD(addr,imm)
Header(PIM-ADD)addr,immTail
ACK