locality aware scheduling of sparse computations for ...luszczek/conf/siamcse2013_energy/siam 2013 -...
TRANSCRIPT
![Page 1: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/1.jpg)
Locality Aware Scheduling of Sparse Computations for Energy and Performance
Efficiencies
Michael Frasca
Kamesh Madduri
Padma Raghavan
Department of Computer Science & Engineering
The Pennsylvania State University
March 1, 2013
![Page 2: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/2.jpg)
2
Exploiting parallelism for large-scale irregular applications
requires efficient use of a complex memory hierarchy.
We develop dynamic workload strategies that map the
demands of parallel graph algorithms onto shared
compute and memory resources, while achieving
improved power and performance efficiencies
ABSTRACT
![Page 3: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/3.jpg)
Outline
Background Non-Uniform Memory Access (NUMA)
Betweenness Centrality (BC)
NUMA-Aware Dynamic Strategies Adaptive Data Layout (ADL)
NUMA-Aware Workload Queues (NWQ)
Power & Performance Results 20 large graph inputs
1-32 cores/threads
Detailed working-set analysis
Conclusions
3
![Page 4: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/4.jpg)
Background | NUMA Cache Hierarchy
Scalable cache design Local caches with low-latency
Shared caches with higher-capacity
Shared cache latency depends on distance to core
Multi-socket systems connected via point-to-point interconnect
Coherence policy Multiple copies of
shared data
Writing to data invalidates non-owned copies
Applies to caches across multiple sockets
4
![Page 5: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/5.jpg)
Background | Graph Mining
Graph analysis Inherently unstructured data
Irregular access patterns,unknown until runtime
Data partitioning is hard
Parallelism Shared memory, dynamic partitioning
Light-weight threading
Efficient synchronization
Betweenness Centrality Measure of a node’s importance
in a graph
5
www.cise.ufl.edu
![Page 6: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/6.jpg)
Background | Betweenness Centrality
Count of all shortest paths s ⇢ t
Count of all shortest paths s ⇢ t that contain v
All-pairs shortest path One BFS per node + updates to per-node metadata
Lock-free implementation Level-synchronous design
Dynamic workload balancing
Madduri et al. [IPDPS 2009]
6
![Page 7: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/7.jpg)
Parallel BFS | Memory Behavior
Dynamic thread-vertex assignment at each level
Outgoing edge => update to frontier node
Dictates reuse across tree depths
Can generate unnecessary sharing on the frontier
7
BOUNDARY
FRONTIER
READ
WRITE
![Page 8: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/8.jpg)
Parallel BFS | Memory Behavior
8
Adjacency Matrix Representation
![Page 9: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/9.jpg)
Parallel BFS | Memory Behavior
9
Workload distribution of a front
![Page 10: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/10.jpg)
NUMA-AWARE TECHNIQUES
10
![Page 11: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/11.jpg)
NUMA-Aware | Adaptive Data Layout
We assume graphs have a random ordering Poor spatial data locality
High amount of false sharing
Dynamic graph reordering The first BFS traverses the random graph
Order of discovery is used to permute the graph
Improved locality for remaining BFS traversals
11
Permutation Sort
![Page 12: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/12.jpg)
NUMA-Aware | Adaptive Data Layout
12
![Page 13: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/13.jpg)
NUMA-Aware | Adaptive Data Layout
13
![Page 14: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/14.jpg)
NUMA-Aware | Dynamic Work Queues
14
![Page 15: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/15.jpg)
NUMA-Aware | Dynamic Work Queues
15
Applicable to iterative algorithms that dynamically generate work
BC: Boundary/Frontier represented as per-thread work queues
![Page 16: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/16.jpg)
MEMORY
COMPUTE
NUMA-Aware | Dynamic Work Queues
Each Core/Thread has its own work queue
When a thread completes, it aids other threads
Work queues traversed in order of NUMA-distance
16
C0 C1 C2 C3
L2 Cache L2 Cache
L3 Cache
![Page 17: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/17.jpg)
MEMORY
COMPUTE
NUMA-Aware | Dynamic Work Queues
17
C0 C1 C2 C3
L2 Cache L2 Cache
L3 Cache
T0: { C0, C1, C2, C3 }
T1: { C1, C0, C3, C2 }
T2: { C2, C3, C0, C1 }
T3: { C3, C2, C1, C0 }
Work Queue Traversal
![Page 18: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/18.jpg)
NUMA-Aware | Dynamic Work Queues
Improved per-thread reuse at frontier
Reduced frontier sharing
Shared frontier likely between neighboring threads
18
![Page 19: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/19.jpg)
EXPERIMENTAL ANALYSIS
19
![Page 20: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/20.jpg)
Experimental | Cache Architecture
4 x Intel Xeon E7-8837(Westmere)
8 Cores per socket,32 cores total
Direct QPIInter-Processor Communication
4 memory channels per socket
20
qdpma.com
![Page 21: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/21.jpg)
Experimental | Inputs
20 large sparse graphs Road networks
Finite element meshes
Web crawls
|V| ϵ [ 11.5, 118.1 ] million|E| ϵ [ 12.4, 1930.3 ] million
Scaling from 1 to 32 cores Measured time, power, cache
21
![Page 22: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/22.jpg)
Performance Results | Per Input Speedup
22
![Page 23: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/23.jpg)
Performance Results | Average Workload Speedup
23
Speedup BC ADL ADL+NWQ
8 Threads 6.8x 16.0x 18.4x
32 Threads 16.9x 20.2x 32.9x
![Page 24: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/24.jpg)
Energy Results | Per Input Savings
Mean energy reduction savings ADL: 17.9%
ADL+NWQ: 52.4%
Reasonably correlated with speedup
24
speedup
![Page 25: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/25.jpg)
Energy Results | Understating Consumption
Static Energy Power consumed by system at idle
Dynamic Energy Increased power consumed during utilization
Arithmetic, logic, branch units, cache and DRAM
Efficient code uses less of both Reduced runtime => less static power consumed
Fewer cache misses, branch miss-predictions, pipeline stalls => less dynamic power consumed
25
![Page 26: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/26.jpg)
Improved dynamic energy requirements
Better performance creates energy savings at scale
Energy Results | Scaling Trends
26
Dynamic energy increases due to parallel overheads
![Page 27: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/27.jpg)
Energy Results | Scaling Trends
Improved dynamic energy requirements
Better performance creates energy savings at scale
27
![Page 28: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/28.jpg)
Working Set Analysis
28
Considerable per-thread
overlap
Inefficient use of memory
bandwidth across sockets
High costs associated
with coherence traffic
NUMA-Aware scheduling reduces redundancy
More efficient use of cache space
Reduced cache invalidations
![Page 29: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/29.jpg)
Working Set Analysis | Cache Impact
29
![Page 30: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/30.jpg)
Conclusions
Memory issues have great impact as we scale
algorithms and architectures
We believe dynamic runtime environments are
key in exploiting workload specific variability
30
![Page 31: Locality Aware Scheduling of Sparse Computations for ...luszczek/conf/siamcse2013_energy/SIAM 2013 - NUMA.pdfComputations for Energy and Performance ... Dynamic thread-vertex assignment](https://reader033.vdocuments.mx/reader033/viewer/2022051723/5ab7ef157f8b9a28468c371e/html5/thumbnails/31.jpg)
Conclusions | Future Directions
NUMA-Aware Workload Scheduling
Adapting scheduler for other irregular algorithms
Incorporating other forms of system heterogeneity
Detailed analysis of cache behavior via simulation
Location of shared frontier sets within the NUMA hierarchy
Impact on load at functional units (e.g. reordering, branches)
NUMA-Aware graph data structures
Appropriate for distributed memory?
thank you
31