graphpim: enabling instruction-level pim offloading in ... · ⎮ pim offloading for atomic...

GraphPIM:EnablingInstruction-LevelPIMOffloadinginGraphComputingFrameworks

LifengNai,RamyadHadidi,JaewoongSim*,HyojongKim,PranithKumar,HyesoonKim

GeorgiaTech,*IntelLabs

INTRODUCTION

⎮ Graphcomputing:processingbignetworkdata} Socialnetwork,knowledgenetwork,bioinformatics,etc.

⎮ Graphcomputingisinefficientonconventionalarchitectures} Inefficiencyinmemorysubsystems

2

INTRODUCTION

⎮ Processing-in-memory(PIM)} PIMhasthepotentialofhelpinggraphcomputingperformance} PIMisbeingrealizedinrealproducts:Hybridmemorycube(HMC)2.0

3

EnablePIMforgraphcomputing

CHALLENGES

⎮ WhatarethebenefitsofPIMforgraphcomputing?} KnownbenefitsofPIM

} Bandwidthsavings,latencyreduction,morecomputationpower} But,theyarenotgoodenough} Weexploresomethingmore!

⎮ HowtoenablePIMforgraphinapracticalway?} Minorhardware/softwarechange} Noprogrammerburden

4

OVERVIEW 5

GraphPIM:aPIM-enabled graphframework

WeidentifyanewbenefitofPIMoffloading

WedeterminePIMoffloadingtargets

WeenablePIMwithoutuser-applicationchange

KNOWNPIMBENEFITS

⎮ Morecomputationpower} Extracomputationunitsinmemory

⎮ Bandwidthsavings} Pullingdatavs.pushingcomputationcommand

⎮ Latencyreduction} Bypassingcacheforoffloadedaccessesavoidscache-checkingoverhead} Avoidscachepollutionandincreaseseffectivecachesize

Request Response Total64-byteREAD 1FLIT(addr) 5FLITs(data) 6FLITs

64-byteWRITE 5FLITs(addr,data) 1FLIT(ack) 6 FLITs

CPUrd-modify-wr(rd-Miss;wr-evict)

6FLIT 6FLIT 12FLITs

PIMrd-modify-wr 2FLITs(addr,imm) 1 FLIT(ack) 3FLITs

6

(FLIT:16byte,basicflowunit)

PERFORMANCEBENEFITS

⎮ Morecomputationpower?} Limited#ofFUsinmemory

⎮ Bandwidthsavings?} NotBWsaturated

⎮ Latencyreduction?} Yes,butsmall

7

GRAPHPIM EXPLORES…

⎮ Atomicoverheadreduction} AtomicinstructionsonCPUshavesubstantialoverhead[Schweizer’15]

} RMW:read-modify-write} Cacheoperation:cache-lineinvalidation,coherencetrafficetc.} Dataordering:writebufferdraining,pipelinefreezeetc.

} Becauseofthecharacteristicsofgraphprogrammingmodel,PIMoffloadingcanavoidtheatomicoverhead

H.Schweizer etal.,“EvaluatingtheCostofAtomicOperationsonModernArchitectures,”PACT’15

AtomicInstruction

RMWDataOrdering CacheOperation

8

ATOMICOVERHEADREDUCTION 9

RMWDataOrdering CacheOperation

PipelineStall(Serialization)

Retire

CPU

RMW

Offload

ACK

Offload

ACK

RetireCPU

PIM Atomic

ContinueExecution

SerializationinPIM

ATOMICOVERHEADESTIMATION

⎮ AtomicoverheadexperimentsonaXeonE5machine} AtomicRMWà regularload+compute+store

⎮ Atomicinstructionsincur30%performancedegradation

0

0.5

1

1.5

2

BFS CComp DC kCore SSSP TC BC PRank GMean

Normalize

dExecution

Time

AtomicNon-Atomic

10

(Non-Atomic:artificialexperiment,notpreciseestimation)

PERFORMANCEBENEFITS

⎮ Morecomputationpower?} Limited#ofFUsinmemory

⎮ Bandwidthsavings?} NotBWsaturated

⎮ Latencyreduction?} Yes,butsmall

⎮ Atomicoverheadreduction?} Yesandsignificant!} MainsourceofPIMbenefitforgraph

11

OFFLOADINGTARGETS

⎮ Codesnippet:Breadth-firstsearch(BFS)

F← {source}while F isnotempty

F’← {∅}for eachu∈ F inparallel

d← u.depth+1for eachv∈ neighbor(u)

ret←CAS(v.depth,inf,d)ifret==success

F’← F’ ∪ vendif

endforendforbarrier()F← F’

endwhile

123456789101112131415

F:frontiervertexsetofcurrentstepF’:frontiervertexsetofnextstepu.depth:depthvalueofvertexuneighbor(u):neighborverticesofuCAS(v.depth,inf,d):atomiccompareandswapoperation

line4-5,8-10:accessingmetadata

line6:accessinggraphstructure

line7:accessinggraphproperty

✔

✔

12

CacheFriendly

CacheUnfriendly+Atomic

Offloadatomic operationsongraphproperty

INDICATEOFFLOADINGTARGETS 13

⎮ Howtoindicateoffloadingtargets?

⎮ Option#1:Markinstructions} NewinstructionsinISA} Requireschangesinuser-levelapplications

⎮ Option#2:Markmemoryregions} Specialmemoryregionforoffloadingdata} Canbetransparenttoapplicationprogrammers

GRAPHFRAMEWORK

⎮ Graphcomputingisframework-based

} Userapplicationisdesignedontopofframeworkinterfaces} Dataismanagedwithintheframework

14

UserApplication

G=load_graph(“Fruit”);V1=G.find_vertex(“Apple”);V1.property().price=5;V1.add_neighbor(“Orange”);

load_graph (Framework)G_structure =malloc(size1);G_property = malloc(size2);Openfile&loaddata

GraphAP

Is

Middlew

are

ENABLEPIM INGRAPHFRAMEWORK

…

GraphAPI

Middleware

OS

HardwareArchitecture

GraphDataManagement

UserApplication

UserApplication

15

SWHW

GraphFramework

HostProcessorCore

PIMOffloadingUnit

malloc()à pmr_malloc()

GraphProperty

No userapplicationchangeload_graph

G_structure =malloc(size1);G_property =malloc(size2);Openfile&loaddata

pmr_malloc(size2);

FRAMEWORKCHANGE

⎮ PIMmemoryregion(PMR)} Uncacheable memoryregioninvirtualmemoryspace} Utilizingexistinguncacheable (UC)supportinX86

⎮ Frameworkchange} malloc()à pmr_malloc()} pmr_malloc():customizedmalloc functionthatallocatesmemobjectsin

PMR

VirtualMemorySpacePIMMemRegion

GraphProperty

pmr_malloc()Graph

Structure

malloc()

OtherData

GraphDataManagement(Framework)

16

ARCHITECTURECHANGE

⎮ PIMoffloadingunit(POU)} Identifiesatomic instructionsthatareaccessingPIMMemoryRegion} OffloadsthemasPIMinstructions

Core

Caches

POU

HMC

Atom

ic

Unit

HostProcessor HMC

HardwareArchitecture

17

CHANGES

⎮ Softwarechanges:} Nouserapplicationchange} Minorchangeinframework:malloc()à pmr_malloc()

⎮ Hardwarechanges:} PIMmemoryregion:utilizesexistinguncacheable (UC)support} PIMoffloadingunit(POU):identifiesoffloadingtargets

Noburdenonprogrammers+MinorHW/SWchange

18

EVALUATION

⎮ SimulationEnvironment} SST(framework)+MacSim (CPU)+VaultSim (HMC)

⎮ Benchmark} GraphBIGbenchmarksuite[Nai’15](https://github.com/graphbig)} LDBCdatasetfromLinkedDataBenchmarkingCouncil(LDBC)

⎮ Configuration} 16OoO cores,2GHz,4-issue} 32KBL1/256KBL2/16MBsharedL3} HMC2.0spec,8GB,32vaults,512banks,4links,120GB/sperlink

19

L.Naietal.“GraphBIG:UnderstandingGraphComputingintheContextofIndustrialSolutions,”SC’15

EVALUATION:PERFORMANCE

⎮ Baseline:NoPIMoffloading⎮ U-PEI:Performanceupper-boundofPIM-enabledinstructions[Ahn’15]

0.0%

1.0%

2.0%

3.0%

0

1

2

3

%PIM-AtomicinAll

Instructions

Speedu

poverBaseline

Baseline U-PEI GraphPIM %PIM-Atomic

Upto2.4XspeedupOnaverage1.6Xspeedup

20

J.Ahn etal.“PIM-EnabledInstructions:ALow-Overhead,Locality-AwarePIMArchitecture,”ISCA’15

EVALUATION:EXECUTIONTIMEBREAKDOWN

⎮ Breakdownofnormalizedexecutiontime} Atomic-inCore:atomicoverheadofoffloadingtargets(atomicinst.)} Atomic-inCache:cache-checkingoverheadofoffloadingtargets

00.20.40.60.81

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

Baseline

GraphP

IM

BFS CComp DC kCore SSSP TC BC PRank

Normalize

dExecution

Time

Other Atomic-inCore Atomic-inCache

21

CONCLUSION

⎮ Graphcomputingisinefficiencyonconventionalarchitectures

⎮ GraphPIMenablesPIMingraphcomputingframeworks} ExploresanewbenefitofPIMoffloading:atomicoverhead reduction} Identifiesatomicoperations ongraphproperty astheoffloadingtarget} Requiresnouser-applicationchange andonlyminorchange in

frameworkandarchitecture

22

THANKYOU!

BACKUPSLIDES

DYNAMICCACHEBEHAVIOR? 25

⎮ GraphPIMmarksmemoryregionstatically} Cons:cannotbeadaptive totheworkingsetsizes} But,propertyaccessestographshaveveryhighcachemissesregardless

ofgraphinputsexceptforreallysmallgraphsizes[JPDC’16,SC’15]} Pros:coherence supportbetweenmemoryandprocessor-cacheisnot

required

CONSISTENCY? 26

⎮ PIMoffloadingforatomicinstructionsworksfinebecause…} Theprogrammingmodelofgraphapplicationsnaturallyavoids

consistencyissues:allPIMwritesaredonebeforereads} Graphapplicationsrequireonlyatomicityfromatomicinstructions} But,atomicinstructionsinCPUsdon’tallowtospecifyatomicitywithout

fence

} Wealsohaveafollow-upworkdiscussingtheconsistencyissueforPIMinstructionsinthecontextofgeneralapplications[MEMSYS’17]

CONSISTENCY? 27

⎮ GraphapplicationswithBSPmodelnaturallyavoidsconsistencyissues} BarriersensuresallPIMwritesaredonebeforereads

ProgramPhases Operationloop:foreach vertex intaskqueue:readpropertyfetchneighborlistforeach neighbor:updateneighborpropertyupdatenext-iter taskqueue

barrier

// Reads

//HMCInst.

WHYGRAPHPIM ISAFRAMEWORK? 28

⎮ GraphPIM} Considerstheseparation offrameworkanduserapplication} Proposesafull-stack solution:SWframework+HWarchitecture} Requiresno applicationprogrammers’efforts

⎮ Userscaneasilyenable/disableGraphPIMbyswitchingbetweendifferentframeworklibraries.

BANDWIDTHSENSITIVITY? 29

⎮ GraphonCPUsarenotverysensitivetoBWchanges} SpeedupoverbaselinesystemwithdifferentHMClinkbandwidth

0

0.5

1

1.5

2

2.5


Speedu

poverBaseline

Baseline Baseline-half-BW Baseline-double-BW

GraphPIM GraphPIM-Half-BW GraphPIM-Double-BW

APPLICABILITY? 30

Category Workload Applicable? OffloadingTarget PIMInst.

GraphTraversal

Breadth-firstsearch ✔ lockcmpxchg CASifequalDegreecentrality ✔ lockaddw Singed addBetweenness centrality ✘ (Floating pointadd) (FPadd)Shortest path ✔ lockcmpxchg CASifequalK-core decomposition ✔ locksubw Singed addConnectedcomponent ✔ lockcmpxchg CASifequalPagerank ✘ (Floating pointadd) (FPadd)

DynamicGraph

Graphconstruction ✘ (Complexoperation)Graphupdate ✘ (Complexoperation)Topologymorphing ✘ (Complexoperation)

RichProperty

Trianglecount ✔ lockadd Singed addGibbs inference ✘ (Compute intensive)

✔

✔

GRAPHPIM:EVALUATION 31

⎮ GraphPIMspeedupoverbaselinewithdifferentdatasetsizes

0

1

2

3

4

5

BFS CComp DC kCore SSSP TC BC PRank

Speedu

p

LDBC-1M LDBC-100k LDBC-10k LDBC-1k


⎮ Normalizeduncore energyconsumption

00.20.40.60.81

Baseline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IMBa

seline

GraphP

IM


Normalize

dEn

ergy

Breakdow

n

Caches HMCLink HMCFU HMCLogicLayer HMCDRAM

Onaverage,GraphPIM saves37% ofuncore energybecauseofreductionincacheaccessesandmemory

bandwidth


⎮ Normalizedbandwidthconsumptionwithrequest/responsebreakdown

00.20.40.60.81

1.2

Baseline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IMBa

seline

U-PEI

GraphP

IM

BFS CComp DC kCore SSSP TC BC PRankNormalize

dMem

oryBa

ndwidth

Request Response


⎮ Performanceandenergyresultsoftworeal-worldapplications} Basedonananalyticalmodel} FD:Financialfrauddetection;RS:Recommendersystem

0

0.5

1

1.5

2

2.5

Baseline

GraphP

IM

Baseline

GraphP

IM

FD RS

Speedu

poverbaseline

00.20.40.60.81

Baseline

GraphP

IM

Baseline

GraphP

IM

FD RS

Normalize

dEn

ergyBreakdo

wn

Caches HMCLink HMCOther

HARDWARECHANGES 35

⎮ PIMMemoryRegion(PMR)} Auncacheable memoryregioninvirtualmemoryspace} Utilizingexistinguncacheable (UC)supportinX86

⎮ PIMOffloadingUnit(POU)

Links

Core

Vault00Logic DRAMPartitionAtomicUnit



...Switch

HMCCo

ntroller

Last-le

velCache

HMCHOST

L2L1

Core

POU

... ...L1 L2

PIMMemRegion?

AtomicInst?

HMCPIMRequest

toL1NN

Y YMemoryInst.

POU

toHMCMemReq

MOTIVATION 36

⎮ ProfilingusingHWperformancecounters} Executioncyclebreakdown:top-downmethodologyfromIntel

050

100150200

Miss

esPerKilo

Instructions

L1D L2 L30%

25%

50%

75%

100%

ExecutionCycle

Breakdow

n

Backend Frontend BadSpeculation Retiring

Bottleneckcausedbybackendstalls

Highnumberofcachemisses

BACKGROUND:PIMOFFLOADINGINHMC2.0 37

⎮ HybridMemoryCube(HMC)2.0} OneofthefirstindustrialPIMproposals} Instruction-level PIMoffloading

} 1logicdie+4/8DRAMdies} 32Vaults} 4seriallinks

SerialLinks

TSVs

Logic LayerDRAM Layers

Vault


⎮ Packet-basedprotocol

⎮ RegularREAD/WRITE} FLIT:16-byte;basicflowunit

HeaderPayloadTail

Request Response64-byteREAD 1FLIT 5FLITs

64-byteWRITE 5FLITs 1FLIT

8byte 8byte0~256byte


⎮ PIMInstruction:read-modify-write (RMW)operation} SimilarasregularREAD/WRITE,justdifferentCMD intheHeader} DRAMbankislockedduringthewholeRMWforatomicity

PIM-ADD(addr,imm)

Header(PIM-ADD)addr,immTail

ACK

graphpim: enabling instruction-level pim offloading in ... · ⎮ pim offloading for atomic...

Documents