graphpim: enabling instruction-level pim offloading in ... · ⎮ pim offloading for atomic...

39
GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks Lifeng Nai , Ramyad Hadidi, Jaewoong Sim*, Hyojong Kim, Pranith Kumar, Hyesoon Kim Georgia Tech, *Intel Labs

Upload: others

Post on 05-Aug-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

GraphPIM:EnablingInstruction-LevelPIMOffloadinginGraphComputingFrameworks

LifengNai,RamyadHadidi,JaewoongSim*,HyojongKim,PranithKumar,HyesoonKim

GeorgiaTech,*IntelLabs

Page 2: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

INTRODUCTION

⎮ Graphcomputing:processingbignetworkdata} Socialnetwork,knowledgenetwork,bioinformatics,etc.

⎮ Graphcomputingisinefficientonconventionalarchitectures} Inefficiencyinmemorysubsystems

2

Page 3: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

INTRODUCTION

⎮ Processing-in-memory(PIM)} PIMhasthepotentialofhelpinggraphcomputingperformance} PIMisbeingrealizedinrealproducts:Hybridmemorycube(HMC)2.0

3

EnablePIMforgraphcomputing

Page 4: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

CHALLENGES

⎮ WhatarethebenefitsofPIMforgraphcomputing?} KnownbenefitsofPIM

} Bandwidthsavings,latencyreduction,morecomputationpower} But,theyarenotgoodenough} Weexploresomethingmore!

⎮ HowtoenablePIMforgraphinapracticalway?} Minorhardware/softwarechange} Noprogrammerburden

4

Page 5: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

OVERVIEW 5

GraphPIM:aPIM-enabled graphframework

WeidentifyanewbenefitofPIMoffloading

WedeterminePIMoffloadingtargets

WeenablePIMwithoutuser-applicationchange

Page 6: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

KNOWNPIMBENEFITS

⎮ Morecomputationpower} Extracomputationunitsinmemory

⎮ Bandwidthsavings} Pullingdatavs.pushingcomputationcommand

⎮ Latencyreduction} Bypassingcacheforoffloadedaccessesavoidscache-checkingoverhead} Avoidscachepollutionandincreaseseffectivecachesize

Request Response Total64-byteREAD 1FLIT(addr) 5FLITs(data) 6FLITs

64-byteWRITE 5FLITs(addr,data) 1FLIT(ack) 6 FLITs

CPUrd-modify-wr(rd-Miss;wr-evict)

6FLIT 6FLIT 12FLITs

PIMrd-modify-wr 2FLITs(addr,imm) 1 FLIT(ack) 3FLITs

6

(FLIT:16byte,basicflowunit)

Page 7: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

PERFORMANCEBENEFITS

⎮ Morecomputationpower?} Limited#ofFUsinmemory

⎮ Bandwidthsavings?} NotBWsaturated

⎮ Latencyreduction?} Yes,butsmall

7

Page 8: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

GRAPHPIM EXPLORES…

⎮ Atomicoverheadreduction} AtomicinstructionsonCPUshavesubstantialoverhead[Schweizer’15]

} RMW:read-modify-write} Cacheoperation:cache-lineinvalidation,coherencetrafficetc.} Dataordering:writebufferdraining,pipelinefreezeetc.

} Becauseofthecharacteristicsofgraphprogrammingmodel,PIMoffloadingcanavoidtheatomicoverhead

H.Schweizer etal.,“EvaluatingtheCostofAtomicOperationsonModernArchitectures,”PACT’15

AtomicInstruction

RMWDataOrdering CacheOperation

8

Page 9: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

ATOMICOVERHEADREDUCTION 9

RMWDataOrdering CacheOperation

PipelineStall(Serialization)

Retire

CPU

RMW

Offload

ACK

Offload

ACK

RetireCPU

PIM Atomic

ContinueExecution

SerializationinPIM

Page 10: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

ATOMICOVERHEADESTIMATION

⎮ AtomicoverheadexperimentsonaXeonE5machine} AtomicRMWà regularload+compute+store

⎮ Atomicinstructionsincur30%performancedegradation

0

0.5

1

1.5

2

BFS CComp DC kCore SSSP TC BC PRank GMean

Normalize

dExecution

Time

AtomicNon-Atomic

10

(Non-Atomic:artificialexperiment,notpreciseestimation)

Page 11: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

PERFORMANCEBENEFITS

⎮ Morecomputationpower?} Limited#ofFUsinmemory

⎮ Bandwidthsavings?} NotBWsaturated

⎮ Latencyreduction?} Yes,butsmall

⎮ Atomicoverheadreduction?} Yesandsignificant!} MainsourceofPIMbenefitforgraph

11

Page 12: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

OFFLOADINGTARGETS

⎮ Codesnippet:Breadth-firstsearch(BFS)

F← {source}while F isnotempty

F’← {∅}for eachu∈ F inparallel

d← u.depth+1for eachv∈ neighbor(u)

ret←CAS(v.depth,inf,d)ifret==success

F’← F’ ∪ vendif

endforendforbarrier()F← F’

endwhile

123456789101112131415

F:frontiervertexsetofcurrentstepF’:frontiervertexsetofnextstepu.depth:depthvalueofvertexuneighbor(u):neighborverticesofuCAS(v.depth,inf,d):atomiccompareandswapoperation

line4-5,8-10:accessingmetadata

line6:accessinggraphstructure

line7:accessinggraphproperty

12

CacheFriendly

CacheUnfriendly+Atomic

Offloadatomic operationsongraphproperty

Page 13: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

INDICATEOFFLOADINGTARGETS 13

⎮ Howtoindicateoffloadingtargets?

⎮ Option#1:Markinstructions} NewinstructionsinISA} Requireschangesinuser-levelapplications

⎮ Option#2:Markmemoryregions} Specialmemoryregionforoffloadingdata} Canbetransparenttoapplicationprogrammers

Page 14: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

GRAPHFRAMEWORK

⎮ Graphcomputingisframework-based

} Userapplicationisdesignedontopofframeworkinterfaces} Dataismanagedwithintheframework

14

UserApplication

G=load_graph(“Fruit”);V1=G.find_vertex(“Apple”);V1.property().price=5;V1.add_neighbor(“Orange”);

load_graph (Framework)G_structure =malloc(size1);G_property = malloc(size2);Openfile&loaddata

GraphAP

Is

Middlew

are

Page 15: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

ENABLEPIM INGRAPHFRAMEWORK

GraphAPI

Middleware

OS

HardwareArchitecture

GraphDataManagement

UserApplication

UserApplication

15

SWHW

GraphFramework

HostProcessorCore

PIMOffloadingUnit

malloc()à pmr_malloc()

GraphProperty

No userapplicationchangeload_graph

G_structure =malloc(size1);G_property =malloc(size2);Openfile&loaddata

pmr_malloc(size2);

Page 16: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

FRAMEWORKCHANGE

⎮ PIMmemoryregion(PMR)} Uncacheable memoryregioninvirtualmemoryspace} Utilizingexistinguncacheable (UC)supportinX86

⎮ Frameworkchange} malloc()à pmr_malloc()} pmr_malloc():customizedmalloc functionthatallocatesmemobjectsin

PMR

VirtualMemorySpacePIMMemRegion

GraphProperty

pmr_malloc()Graph

Structure

malloc()

OtherData

GraphDataManagement(Framework)

16

Page 17: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

ARCHITECTURECHANGE

⎮ PIMoffloadingunit(POU)} Identifiesatomic instructionsthatareaccessingPIMMemoryRegion} OffloadsthemasPIMinstructions

Core

Caches

POU

HMC

Atom

ic

Unit

HostProcessor HMC

HardwareArchitecture

17

Page 18: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

CHANGES

⎮ Softwarechanges:} Nouserapplicationchange} Minorchangeinframework:malloc()à pmr_malloc()

⎮ Hardwarechanges:} PIMmemoryregion:utilizesexistinguncacheable (UC)support} PIMoffloadingunit(POU):identifiesoffloadingtargets

Noburdenonprogrammers+MinorHW/SWchange

18

Page 19: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

EVALUATION

⎮ SimulationEnvironment} SST(framework)+MacSim (CPU)+VaultSim (HMC)

⎮ Benchmark} GraphBIGbenchmarksuite[Nai’15](https://github.com/graphbig)} LDBCdatasetfromLinkedDataBenchmarkingCouncil(LDBC)

⎮ Configuration} 16OoO cores,2GHz,4-issue} 32KBL1/256KBL2/16MBsharedL3} HMC2.0spec,8GB,32vaults,512banks,4links,120GB/sperlink

19

L.Naietal.“GraphBIG:UnderstandingGraphComputingintheContextofIndustrialSolutions,”SC’15

Page 20: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

EVALUATION:PERFORMANCE

⎮ Baseline:NoPIMoffloading⎮ U-PEI:Performanceupper-boundofPIM-enabledinstructions[Ahn’15]

0.0%

1.0%

2.0%

3.0%

0

1

2

3

%PIM-AtomicinAll

Instructions

Speedu

poverBaseline

Baseline U-PEI GraphPIM %PIM-Atomic

Upto2.4XspeedupOnaverage1.6Xspeedup

20

J.Ahn etal.“PIM-EnabledInstructions:ALow-Overhead,Locality-AwarePIMArchitecture,”ISCA’15

Page 21: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

EVALUATION:EXECUTIONTIMEBREAKDOWN

⎮ Breakdownofnormalizedexecutiontime} Atomic-inCore:atomicoverheadofoffloadingtargets(atomicinst.)} Atomic-inCache:cache-checkingoverheadofoffloadingtargets

00.20.40.60.81

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

BFS CComp DC kCore SSSP TC BC PRank

Normalize

dExecution

Time

Other Atomic-inCore Atomic-inCache

21

Page 22: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

CONCLUSION

⎮ Graphcomputingisinefficiencyonconventionalarchitectures

⎮ GraphPIMenablesPIMingraphcomputingframeworks} ExploresanewbenefitofPIMoffloading:atomicoverhead reduction} Identifiesatomicoperations ongraphproperty astheoffloadingtarget} Requiresnouser-applicationchange andonlyminorchange in

frameworkandarchitecture

22

Page 23: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

THANKYOU!

Page 24: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

BACKUPSLIDES

Page 25: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

DYNAMICCACHEBEHAVIOR? 25

⎮ GraphPIMmarksmemoryregionstatically} Cons:cannotbeadaptive totheworkingsetsizes} But,propertyaccessestographshaveveryhighcachemissesregardless

ofgraphinputsexceptforreallysmallgraphsizes[JPDC’16,SC’15]} Pros:coherence supportbetweenmemoryandprocessor-cacheisnot

required

Page 26: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

CONSISTENCY? 26

⎮ PIMoffloadingforatomicinstructionsworksfinebecause…} Theprogrammingmodelofgraphapplicationsnaturallyavoids

consistencyissues:allPIMwritesaredonebeforereads} Graphapplicationsrequireonlyatomicityfromatomicinstructions} But,atomicinstructionsinCPUsdon’tallowtospecifyatomicitywithout

fence

} Wealsohaveafollow-upworkdiscussingtheconsistencyissueforPIMinstructionsinthecontextofgeneralapplications[MEMSYS’17]

Page 27: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

CONSISTENCY? 27

⎮ GraphapplicationswithBSPmodelnaturallyavoidsconsistencyissues} BarriersensuresallPIMwritesaredonebeforereads

ProgramPhases Operationloop:foreach vertex intaskqueue:readpropertyfetchneighborlistforeach neighbor:updateneighborpropertyupdatenext-iter taskqueue

barrier

// Reads

//HMCInst.

Page 28: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

WHYGRAPHPIM ISAFRAMEWORK? 28

⎮ GraphPIM} Considerstheseparation offrameworkanduserapplication} Proposesafull-stack solution:SWframework+HWarchitecture} Requiresno applicationprogrammers’efforts

⎮ Userscaneasilyenable/disableGraphPIMbyswitchingbetweendifferentframeworklibraries.

Page 29: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

BANDWIDTHSENSITIVITY? 29

⎮ GraphonCPUsarenotverysensitivetoBWchanges} SpeedupoverbaselinesystemwithdifferentHMClinkbandwidth

0

0.5

1

1.5

2

2.5

BFS CComp DC kCore SSSP TC BC PRank GMean

Speedu

poverBaseline

Baseline Baseline-half-BW Baseline-double-BW

GraphPIM GraphPIM-Half-BW GraphPIM-Double-BW

Page 30: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

APPLICABILITY? 30

Category Workload Applicable? OffloadingTarget PIMInst.

GraphTraversal

Breadth-firstsearch ✔ lockcmpxchg CASifequalDegreecentrality ✔ lockaddw Singed addBetweenness centrality ✘ (Floating pointadd) (FPadd)Shortest path ✔ lockcmpxchg CASifequalK-core decomposition ✔ locksubw Singed addConnectedcomponent ✔ lockcmpxchg CASifequalPagerank ✘ (Floating pointadd) (FPadd)

DynamicGraph

Graphconstruction ✘ (Complexoperation)Graphupdate ✘ (Complexoperation)Topologymorphing ✘ (Complexoperation)

RichProperty

Trianglecount ✔ lockadd Singed addGibbs inference ✘ (Compute intensive)

Page 31: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

GRAPHPIM:EVALUATION 31

⎮ GraphPIMspeedupoverbaselinewithdifferentdatasetsizes

0

1

2

3

4

5

BFS CComp DC kCore SSSP TC BC PRank

Speedu

p

LDBC-1M LDBC-100k LDBC-10k LDBC-1k

Page 32: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

GRAPHPIM:EVALUATION 32

⎮ Normalizeduncore energyconsumption

00.20.40.60.81

Baseline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IM

BFS CComp DC kCore SSSP TC BC PRank GMean

Normalize

dEn

ergy

Breakdow

n

Caches HMCLink HMCFU HMCLogicLayer HMCDRAM

Onaverage,GraphPIM saves37% ofuncore energybecauseofreductionincacheaccessesandmemory

bandwidth

Page 33: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

GRAPHPIM:EVALUATION 33

⎮ Normalizedbandwidthconsumptionwithrequest/responsebreakdown

00.20.40.60.81

1.2

Baseline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IM

BFS CComp DC kCore SSSP TC BC PRankNormalize

dMem

oryBa

ndwidth

Request Response

Page 34: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

GRAPHPIM:EVALUATION 34

⎮ Performanceandenergyresultsoftworeal-worldapplications} Basedonananalyticalmodel} FD:Financialfrauddetection;RS:Recommendersystem

0

0.5

1

1.5

2

2.5

Baseline

GraphP

IM

Baseline

GraphP

IM

FD RS

Speedu

poverbaseline

00.20.40.60.81

Baseline

GraphP

IM

Baseline

GraphP

IM

FD RS

Normalize

dEn

ergyBreakdo

wn

Caches HMCLink HMCOther

Page 35: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

HARDWARECHANGES 35

⎮ PIMMemoryRegion(PMR)} Auncacheable memoryregioninvirtualmemoryspace} Utilizingexistinguncacheable (UC)supportinX86

⎮ PIMOffloadingUnit(POU)

Links

Core

Vault00Logic DRAMPartitionAtomicUnit

Vault01Logic DRAMPartitionAtomicUnit

Vault31Logic DRAMPartitionAtomicUnit

...Switch

HMCCo

ntroller

Last-le

velCache

HMCHOST

L2L1

Core

POU

... ...L1 L2

PIMMemRegion?

AtomicInst?

HMCPIMRequest

toL1NN

Y YMemoryInst.

POU

toHMCMemReq

Page 36: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

MOTIVATION 36

⎮ ProfilingusingHWperformancecounters} Executioncyclebreakdown:top-downmethodologyfromIntel

050

100150200

Miss

esPerKilo

Instructions

L1D L2 L30%

25%

50%

75%

100%

ExecutionCycle

Breakdow

n

Backend Frontend BadSpeculation Retiring

Bottleneckcausedbybackendstalls

Highnumberofcachemisses

Page 37: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

BACKGROUND:PIMOFFLOADINGINHMC2.0 37

⎮ HybridMemoryCube(HMC)2.0} OneofthefirstindustrialPIMproposals} Instruction-level PIMoffloading

} 1logicdie+4/8DRAMdies} 32Vaults} 4seriallinks

SerialLinks

TSVs

Logic LayerDRAM Layers

Vault

Page 38: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

BACKGROUND:PIMOFFLOADINGINHMC2.0 38

⎮ Packet-basedprotocol

⎮ RegularREAD/WRITE} FLIT:16-byte;basicflowunit

HeaderPayloadTail

Request Response64-byteREAD 1FLIT 5FLITs

64-byteWRITE 5FLITs 1FLIT

8byte 8byte0~256byte

Page 39: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally

BACKGROUND:PIMOFFLOADINGINHMC2.0 39

⎮ PIMInstruction:read-modify-write (RMW)operation} SimilarasregularREAD/WRITE,justdifferentCMD intheHeader} DRAMbankislockedduringthewholeRMWforatomicity

PIM-ADD(addr,imm)

Header(PIM-ADD)addr,immTail

ACK