exploiting high bandwidth memory for graph algorithms
TRANSCRIPT
![Page 1: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/1.jpg)
Exploiting High Bandwidth Memory for
Graph Algorithms
George M. Slota1 Sivasankaran Rajamanickam2
Cynthia Phillips2 Jonathan Berry2
1Rensselaer Polytechnic Institute, 2Sandia National [email protected], [email protected], [email protected], [email protected]
SIAM PP 8 March 2018
1 / 20
![Page 2: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/2.jpg)
Intro and Overview of Talk
Ongoing trend: expansion of memory hierarchy forincreased CPU throughput – e.g., high-bandwidthmemory (HBM) layer on current generation Intel XeonPhis (Knight’s Landing)Can we explicitly design graph computations to effectivelyutilize this layer?We explore a work chunking approach that iterativelybrings in pieces of a large graph to perform local updatesin HBM – we specifically look at the label propagationalgorithm. We find:
Chunking has minimal impact on solution qualityChunking can also decrease time to solution
Primary assumption: the graphs being processed are too largeto fit entirely within MCDRAM
2 / 20
![Page 3: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/3.jpg)
Intel Knight’s Landing (KNL)68-72 cores with High Bandwidth Multi-channel DRAM (MCDRAM)
Source:Intel
Stream Triad Bandwidths
DDR: 90 GB/s
MCDRAM: 450 GB/s
Multiple MCDRAM modes
Cache Mode
Flat Mode
Hybrid Mode
3 / 20
![Page 4: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/4.jpg)
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
![Page 5: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/5.jpg)
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
![Page 6: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/6.jpg)
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
![Page 7: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/7.jpg)
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
![Page 8: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/8.jpg)
Label Propagation
Randomly label with n = #verts labels
Iteratively update each v ∈ V (G) with max per-label count over neighbors with ties broken randomly
Algorithm completes when no new updates possible; in large graphs, fixed iteration count
4 / 20
![Page 9: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/9.jpg)
Why Label Propagation?
Iterative vertex updates – prototypical of many othergraph computations
Wide usage – community detection, partitioning, otherunsupervised learning problems
Nondeterministic algorithm by design – solution qualitycan vary based on processing methodology
Straightforward to implement via work chunking
5 / 20
![Page 10: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/10.jpg)
Multilevel Memory Label Propagationvia work chunking
1: L← LPChunking(G(V,E),Cnum,Citer)2: for all v ∈ V : L(v)← id(v) . Initialize labels as vertex ids3: while at least one L(v) updates do4: for c = 0 · · · (Cnum − 1) do5: Vc ← Chunk(c, V ), Ec ← 〈v, u〉 ∈ E : v or u ∈ Vc
6: for iter = 1 . . . Citer while one L(v) : v ∈ Vc updates do7: for all v ∈ Vc do in parallel . Random order8: Counts← ∅ . Hash table9: for all 〈v, u〉 ∈ Ec do
10: Counts(L(u))← Counts(L(u)) + 1
11: NewLabel← Max(Counts(. . .))12: if NewLabel 6= L(v) then13: L(v)← NewLabel
6 / 20
![Page 11: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/11.jpg)
Chunking Considerations
Primary chunking variables
Number of total chunks (Cnum)
Work iterations performed on each chunk (Citer)
How to determine data per chunk?
Block methods (vertex block, edge block)
Randomization
Explicit partitioning
How to transfer chunked data?
All threads transfer, then all threads work
Overlap transfer of Ci+1 with work on Ci
Vary number of work/transfer threads to ensure balance
7 / 20
![Page 12: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/12.jpg)
Algorithmic Variants
Baseline Cache
Baseline implementation running in cache mode
Baseline Hybrid
Baseline implementation with hash table allocated inMCDRAM
Graph structure and other data handled by MCDRAMcache
Chunk-HBM
All data explicitly allocated in MCDRAM
Per-chunk graph structure transfered into MCDRAM
All vertex labels static in MCDRAM
8 / 20
![Page 13: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/13.jpg)
Experimental SetupTest System and test graphs
Test System: Bowman at Sandia Labs – each node has aKNL with 68 cores, 96 GB DDR, and 16 GB MCDRAM
Test Graphs:
Network n m davg dmax D̃
LiveJournal 4.8 M 69 M 18 20 K 18Friendster 66 M 1.8 B 27 5.2 K 34Twitter 52 M 2.0 B 37 3.7 M 19Host 89 M 2.0 B 22 3.4 M 23uk-2007 105 M 3.3 B 31 975 K 82wBTER 50 50 M 1.2 B 24 110 K 12wBTER 100 100 M 2.4 B 24 135 K 12
9 / 20
![Page 14: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/14.jpg)
How does chunking impact solution quality?
10 / 20
![Page 15: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/15.jpg)
Convergence and solution qualityFor label propagation and community detection algorithms in general
Defining convergence
True convergence: no more label updates can occur
Looser criteria: fixed iterations, some modularity gain orchange, number of labels, others
We run to true convergence when possible, but fix iterations toenable a parametric study of chunking variables.
Defining solution quality
Standard metrics when no ground truth exists: modularity,conductance, among many others
When ground truth exists: normalized mutual information(NMI) and related measurements
Despite some observed flaws with their usage, we select thestandard measurements of modularity and NMI.
11 / 20
![Page 16: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/16.jpg)
Chunking parametersEvaluating impact of number of chunks and iterations per chunk
Heatmaps of iterations to convergence (left) and impact on finalmodularity (right) – lighter is betterAbout 5× increase in iterations captured in left plot and 2% totalmodularity change in right plotWhile chunking increases iterations to convergence, it has minimalimpact on final solution quality (and actually improves it in severalinstances – LiveJournal, Host, wBTER)
2 3 5
10
20
50
Number of Chunks
1
2
3
5
10
20
50
Iter
Per
Chunk
2 3 5
10
20
50
Number of Chunks
1
2
3
5
10
20
50
Iter
Per
Chunk
12 / 20
![Page 17: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/17.jpg)
Chunking parametersLancichinetti-Fortunato-Radicchi (LFR) benchmark
Ran same parametric tests on LFR benchmark(n = 10, 000, k = 15, maxk = 500, t1 = 2, t2 = 1,µ = 0.05 . . . 0.6)Heatmap of iterations to convergence (left) and NMIversus baseline (right)Similar takeaways to real-world test instances
2 3 5
10
20
50
Number of Chunks
1
2
3
5
10
20
50
Ite
r P
er
Ch
un
k
0.980
0.985
0.990
0.995
1.000
0.1 0.2 0.3 0.4 0.5 0.6
Mixing Parameter 'mu'
NM
IBase Chunk_5_5 Chunk_50_50
13 / 20
![Page 18: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/18.jpg)
Can HBM chunking improve time to solution?
14 / 20
![Page 19: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/19.jpg)
Consideration 1: Partitioning Methodology5 iterations per chunk, minimum number of chunks possible (∼5), 40 iterations
Effects of partitioning method on per-iteration speedupvs. baseline (left) and modularity (right)Explicit partitioning demonstrates largest improvements,but at the obvious cost of computing the partition
0.0
0.5
1.0
1.5
2.0
PuLP
VertB
lock
EdgeB
lock
Random
Partitioning Strategy
Per−
iter
Speedup
0.00
0.25
0.50
0.75
1.00P
uLP
VertB
lock
EdgeB
lock
Random
Partitioning Strategy
Modula
rity
Im
pro
vem
ent
15 / 20
![Page 20: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/20.jpg)
Consideration 2: Overlapping Communication
Average speedups across all partitioning methods whileoverlapping communicationNote: when overlapping, we double the number ofchunks; this can lead to greater than 2× relative speedupdue to cache effects on graph data and hash tables
0
1
2
3
4
5
Live
Journ
al
Frie
ndste
r
Tw
itter
Host
uk−
2007
wB
TE
R_50
wB
TE
R_100
Per−
iter
Speedup
16 / 20
![Page 21: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/21.jpg)
Overall: Cache vs. Hybrid vs. Flat modesBest times (in seconds) in each mode for each graph for 40 iter or convergence
Network Cache Hybrid Flat Method
LiveJournal 33 29 25 P-OLFriendster 495 337 333 VBTwitter 1,793 871 242 P-OLHost 2,447 2,086 712 EB-OLuk-2007 1,981 1,241 783 P-OLwBTER 50 577 474 225 VB-OLwBTER 100 1,602 491 435 EB-OL
Methods: VB: Vertex Block; EB: Edge Block; P: PuLPPartitioning; -OL with overlapping communication
17 / 20
![Page 22: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/22.jpg)
Time and modularity vs. iterationsPer-iteration time and total time doesn’t tell the whole story
Friendster (left) and Twitter (right) for modularity vs. iterations (top)and time per iteration (bottom). Baseline and Cnum Citer.
0.00
0.25
0.50
0.75
1.00
01 5 10 15 20 25 30 35 40
Number of Global Iterations
Modula
rity
Base 5_2 10_5 20_5
0.00
0.25
0.50
0.75
1.00
0 1 5 10 15 20 25
Number of Global Iterations
Modula
rity
Base 10_5 20_5
0.00
0.25
0.50
0.75
1.00
0 100 200 300 400 500
Time (s)
Modula
rity
Base 5_2 10_5 20_5
0.00
0.25
0.50
0.75
1.00
0 500 1000 1500
Time (s)
Modula
rity
Base 10_5 20_5
18 / 20
![Page 23: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/23.jpg)
Discussion: Generalization
To other vertex programs on KNLs with HBMTested chunked versions of PageRanks and K-coresSpeedups still there but much less – under 25%
Hash table for label propagation is likely just extremelyill-performant in cache mode; benefits most frommemory considerations
Minimal impact on solution quality for PR (for K-cores, werun to true convergence)
GPU and SSD-based graph processingNote: biggest general takeaway is running multiple localiterations doesn’t impact solution qualitySo limited-memory GPUS and large-scale processing withSSD arrays might consider similar approaches
Distributed processingEquivalence to only communicating every nth iteration
19 / 20
![Page 24: Exploiting High Bandwidth Memory for Graph Algorithms](https://reader031.vdocuments.mx/reader031/viewer/2022012407/616a20e511a7b741a34f1bf0/html5/thumbnails/24.jpg)
Conclusionsand future work
Chunking minimally affects solution quality of labelpropagation, but can increase the number of iterationsrequired for a given “quality”
Explicit handling of HBM generally improves per-iterationtiming and can improve time-to-solution in selectinstances
Future work:
Further explore generalizations to other vertex programsMulti-tiered chunking – hold key vertices in HBM andupdate every iteration
Paper to appear in IPDPS 2018www.gmslota.com, [email protected]
20 / 20