![Page 1: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/1.jpg)
Understanding the SIMD Efficiency of Graph Traversal on GPUYichao Cheng, Hong An, Zhitao Chen, Feng Li, Zhaohui Wang, Xia Jiang and Yi Peng
University of Science and Technology of China
![Page 2: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/2.jpg)
Breadth-first Search (BFS)
A C
D E F
G H I
A C
DE F
G
H I
1 1
2 22
3 3
4
Source
![Page 3: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/3.jpg)
Breadth-first Search (BFS)
A
B
C
DE F
G
H I
BFS_Iteration:for u ∈ Current Frontier for v ∈ u’ s neighbors do if v has not been labeled label v put v in Next Frontier
![Page 4: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/4.jpg)
Application of BFS• Many datasets in real world are represented by graph• VLSI circuits• Social relationship• Road connections
• Primitive for building complex algorithms• Path-finding• Belief propagation• Points-to Analysis (PTA)
![Page 5: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/5.jpg)
The Problem•GPU relies on high SIMD lanes occupancy to boost performance • 100% efficiency is achieved only if all SIMD lanes fall in the same path
I
Do_something_common();If (thread_id > 5) { do_something_red(); } else { do something_blue();}
100% utilization
![Page 6: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/6.jpg)
The Problem•GPU relies on high SIMD lanes occupancy to boost performance • 100% efficiency is achieved only if all SIMD lanes fall in the same path
I
37.5% utilization
Do_something_common();If (thread_id > 5) { do_something_red(); } else { do something_blue();}
![Page 7: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/7.jpg)
The Problem•GPU relies on high SIMD lanes occupancy to boost performance • 100% efficiency is achieved only if all SIMD lanes fall in the same path
I
62.5% utilization
Do_something_common();If (thread_id > 5) { do_something_red(); } else { do something_blue();}
![Page 8: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/8.jpg)
Traditional ImplementationGPU_BFS_Iteration u = C[tid] for v ∈ u’ s neighbors do
end for
The # of sub-iterations depends on
the size of u ’s adjacent list
task 1 = 4 sub-iterations
task 2 = 2 sub-iterations
…
![Page 9: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/9.jpg)
Visualizing the Irregularity
vertex range < 8
Highly skewedoutlier exists
irregular but concentrate
distributedbetween a wide rage
![Page 10: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/10.jpg)
Alternative Way• Assign each task with a warp of threads• Vectorize the sub-iterations!
I
So, what’s the relationship between graph topology and SIMD efficiency?
![Page 11: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/11.jpg)
Topology and Utilization• Assign each vertex with a group of threads
Thread Warp Group
task 1 = 2 sub-iterationstask 2 = 1 sub-iteration
![Page 12: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/12.jpg)
Topology and Utilization
Divide the SIMD underutilization into two parts• InteR-group Underutilization (UR)
• IntrA-group Underutilization (UA)
SIMD Window
![Page 13: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/13.jpg)
Conclusions From the Model
•UR is induced by the heterogeneity of workloads• Affected by the graph topology
•UR is sensitive to the group size (S)• Large logical SIMD window can narrow the gap• When S = 32, UR = 0
•UA is determined by the intrinsic irregularity of vertex degree• It can be limited by shrink the S• When S = 1, UA = 0
•UR and UA can convert to each other
![Page 14: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/14.jpg)
Comparing Different Mapping Strategies
Expansion Rate (ME/s)
Scalability
good
poor
low high
![Page 15: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/15.jpg)
Evaluating the SIMD Efficiency•Metrics derived from the model:
UR = inter-group underutilizationUA = intra-group underutilizationME = mapping efficiency UR + UA + ME = 100%
• Captures utilization trend with increasing S
![Page 16: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/16.jpg)
Explaining the Result
Expansion Rate (ME/s)
Scalability
good
poor
low high
alleviate the UR , introducing minor UA
![Page 17: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/17.jpg)
Explaining the Result
Expansion Rate (ME/s)
Scalability
good
poor
low high
ME in a high level (~80%)
![Page 18: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/18.jpg)
Explaining the Result
Expansion Rate (ME/s)
Scalability
good
poor
low high
outweighed by the fast-growing UA
![Page 19: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/19.jpg)
Explaining the Result
Expansion Rate (ME/s)
Scalability
good
poor
low high
do little help to UR but lead to severe UA
![Page 20: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/20.jpg)
Conclusion• Study the link between graph topo & hardware util• Present a model for analyzing the components of SIMD underutilization• Discover that the SIMD are wasted due to:•Develop 3 metrics for quantifying SIMD efficiency• Provide a foundation for developing techniques of static analysis and runtime optimization
imbalance of vertex degree distribution heterogeneity of
each vertex degree
![Page 21: Understanding t he SIMD Efficiency o f Graph Traversal o n GPU](https://reader036.vdocuments.mx/reader036/viewer/2022062408/56813c63550346895da5ed83/html5/thumbnails/21.jpg)
Q&A