dimacs workshop on parallelism: a 2020 vision alejandro salinger university of waterloo march 16,...
TRANSCRIPT
Theoretical Modeling of Multicore Computation
DIMACS Workshop on Parallelism: A 2020 Vision
Alejandro SalingerUniversity of Waterloo
March 16, 2011
2
Multicore ChallengesThe purpose of modeling is to capture the salient
characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction [Maggs et al. 95]
A model should provide clear, productive design incentives while providing strong messages to platform designers about the quality of characteristics required for efficient solution
The development of a unifying paradigm also requires a somewhat unified and stable technological environment
Theoretical Modeling of Multicore Computation - Alejandro Salinger
3
We would like a model that:Reflects the characteristics of the architectureRelatively flexibleEasy theoretical analysis Cost model linked to programming modelEasy to learnEasy to program
Others? (parameter-oblivious?)
Multicore Challenges
Theoretical Modeling of Multicore Computation - Alejandro Salinger
4
Simple
Accurate
Theoretical Modeling of Multicore Computation - Alejandro Salinger
Multicore models
5
Low Degree PRAMprocessors, MIMD modeThread-based parallelismScheduled with work-stealingOptimal speedups on large class of divide-
and-conquer algorithms and dynamic programming
Works in practiceEasy to programMessage: If we design algorithms to obtain
small speedups design and programming becomes easyTheoretical Modeling of Multicore Computation - Alejandro Salinger
[Dorrigiv, Lopez-Ortiz, S. ‘08]
6
Communication is keyParallel computing is as much about
communicating data between processors, as it is about partitioning computing load between processors [Pal]
It’s all about the cacheNot only time complexity, also cache
complexity: number of cache misses, parallel transfers
Reducing misses can lead to overall faster running time even if processors are not fully utilized
Theoretical Modeling of Multicore Computation - Alejandro Salinger
7
Cache modelsCore 1
Core 2
Core 3
Core 4
Cache
RAM
Core 1
Core 2
Core 3
Core 4
RAM
Cache
Cache
Cache
Cache
Core 1
Core 2
Core 3
Core 4
RAM
Cache
Cache
Cache
Cache
Cache
Core 1
Core 2
Core 3
Core 4
RAM
Cache
Cache
Cache
Cache
Cache Cache
8
Parallel External Model (PEM)P synchronized processorsPrivate memory of M wordsBlocks of size B wordsMeasures:
Computational complexity: maximum memory accesses to cache
I/O complexity: parallel block transfers from memory
Core 1
Core 2
Core 3
Core 4
RAM
M M M M
Theoretical Modeling of Multicore Computation - Alejandro Salinger
[Arge, Goodrich, Nelson, Sitchinava ‘08]
9
speedup over External Memory bounds
Most algorithms cache awarePRAM style
Problem PEM - I/O complexity
• Sorting
• Weighted list ranking • Euler tour• Tree contraction• Expression tree
evaluation
• Lowest Common Ancestor
(Q queries)
• Minimum Spanning Tree
• Connected and biconnected components
• Ear decomposition
• Line Segment Intersection Reporting
(K output size)
Theoretical Modeling of Multicore Computation - Alejandro Salinger
[Arge, Goodrich, Sitchinava ‘10, Ajwani, Sitchinava, Zeh ‘11]
10
DAG model
Nodes: tasks (with weights )Edges: dependencies (spawn, data, etc)Work: Depth: }E.g. Mergesort:
Theoretical Modeling of Multicore Computation - Alejandro Salinger
11
SchedulersIt’s all about the schedulerMultithreaded computations with arbitrary
dependencies can be impossible to schedule efficiently
Restrict computationFully strict computation: all data
dependencies go to thread’s parentWork-stealing
Core 1
Core 2
Core 3
Core 4
Theoretical Modeling of Multicore Computation - Alejandro Salinger
12
Schedulers: Work-StealingFor any fully strict computation:
Expected running time Space (: min space for sequential
computation)Expected communication (tight)
Good for private caches:Caches of size C, transfer time m, steal time sDAG consistency
Nested-parallel
Core 1
Core 2
Core 3
Core 4
RAM
C C C C
Theoretical Modeling of Multicore Computation - Alejandro Salinger
[Acar, Blelloch, Blumofe ’02][Blumofe, Leiserson ‘94][Blumofe, Frigo, Joerg,Leiserson, Randall ‘96]
13
Schedulers: Parallel Depth FirstSequential computation on wmisses:Shared cache of size If
parallel steps is small
Core 1
Core 2
Core 3
Core 4
Cp
RAM
Theoretical Modeling of Multicore Computation - Alejandro Salinger
[Blelloch, Gibbons ‘04]
14
SchedulersCompeting demands for private and shared
cachesWork stealing can suffer shared cache missesPDF can suffer private cache misses
Multicore-Cache modelTime-cache complexity-cache complexity
Controlled PDF-scheduler
Core 1
Core 2
Core 3
Core 4
RAM
L1 L1 L1 L1
L2
Theoretical Modeling of Multicore Computation - Alejandro Salinger
[Blelloch, Chowdhury, Gibbons, Ramachandran, Chen, Kozuch ‘08]
15
Schedulers: Controlled-PDF-supernodes, 1DF
schedule-supernodes , PDF
schedule
User specifies space usage functionOnly scheduler knows about cache sizesLarge class of divide-and-conquer algorithmsOptimal speedups-cache complexity and -cache complexity within
constant factor of sequential cache complexities.
Theoretical Modeling of Multicore Computation - Alejandro Salinger
16
Cache obliviousnessParallel cache complexity can be bounded by
sequential cache complexity:Private caches: Shared caches:
For nested-parallel computationsBounds can be extended to multilevel shared
or private hierarchiesIdea: develop nested-parallel algorithms with
low cache complexity and low depth
Theoretical Modeling of Multicore Computation - Alejandro Salinger
[Blelloch, Gibbons, Simhadri ‘10]
17
Low-depth cache obliviousProblem Depth Cache (size M,
block B)
• Sorting O O (rand)
• List ranking • Euler tour on trees
• Tree contraction
• Lowest Common Ancestor (k queries)
• Minimum Spanning Forest• Connected components
O
• Sparse-Matrix Vector Multiply ( nonzeros, separators)
O
Theoretical Modeling of Multicore Computation - Alejandro Salinger
18
Resource Oblivious Algorithms - HMHierarchical model HMExtension to multicore modelEfficient oblivious algorithms
for:Matrix transpositionFFTSortingGaussian Elimination ParadigmList rankingConnected components
Scheduler hintsTheoretical Modeling of Multicore Computation - Alejandro Salinger
[Chowdurry, Silvestri, Blakeley, Rramachandran ‘10]
Core 1
Core 2
Core 3
Core 4
RAM
Cache
Cache
Cache
Cache
Cache Cache
19
Multi-BSP d levels (pj,Lj,mj,gj)
pj: number of componentsLj: synchronization costmj: size of memorygj: data rate
Level 0: coresPortable algorithms“Immortal algorithms”Optimal algorithms for matrix
multiplication, FFT, and sortingL closer to latency that synchronizationPrescriptive: e.g. support for
synchronization operation
level j
level j-1
gj
Core 1
Core 2
Core 3
Core 4
RAM
Cache
Cache
Cache
Cache
Cache Cache
1 2 pj
mj
Theoretical Modeling of Multicore Computation - Alejandro Salinger
[Valiant ‘08]
20
Models SummaryModeling parallel computation is hardMulticore architecture constantly changingCache should be part of the equation
Maybe later inter-processor communication, synchronization, energy
Theoretical Modeling of Multicore Computation - Alejandro Salinger
21
Models SummaryGood:
No need to reinvent everythingLarge class of algorithms with good cache
complexity for shared or private cachesSome relatively simple design in terms of work,
depth, and sequential cache complexityParameters of the machine only known by schedulerCilk Plus: model, scheduler, tools widely available
Needs improvement:More algorithms or scheduler with good shared and
private cache complexitiesHow to choose the scheduler?Theory needs to be accessible to the masses
Theoretical Modeling of Multicore Computation - Alejandro Salinger
22
Parallel trainingCurrent CS degree prepares for programming on
obsolete model
Change of mentality:Parallel thinking (algorithms, programming), but alsoI/O complexity, locality of reference Programming languages
Right balance between practical skills and underlying theory?
How to add new concepts without too much sacrifice?
More specialized majors?Theoretical Modeling of Multicore Computation - Alejandro Salinger
23
Final thoughtsConstant factor speedup, opportunity for
simplicity
Use of more efficient, low-level algorithms were appropriate (library tools)
Should we marry multicores? what’s the next thing?
Theoretical Modeling of Multicore Computation - Alejandro Salinger
24
BibliographyU. A. Acar, G. E. Blelloch, and R. D. Blumofe.
The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.
D. Ajwani, N. Sitchinava, N. Zeh. I/O-optimal algorithms for orthogonal problems for private-cache chip multiprocessors. In IPDPS’11, 2011
L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In ACM SPAA ’08, 2008.
L. Arge, M. T. Goodrich, and N. Sitchinava. Parallel external memory graph algorithms. In IPDPS’10, 2010
G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In ACM-SIAM SODA ’08, 2008.
Theoretical Modeling of Multicore Computation - Alejandro Salinger
25
Bibliography(2)G. E. Blelloch and P. B. Gibbons. Effectively sharing a
cache among threads. In ACM SPAA ’04, 2004.G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri.
Low-depth cache oblivious algorithms. In ACM SPAA ’10, 2010.
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 1999.
R.D. Blumofe, M. Frigo, C.F. Joerg,C.E. Leiserson, K.H. Randall. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In SPAA’96, 1996.Theoretical Modeling of Multicore Computation - Alejandro Salinger
26
Bibliography(3)R.A. Chowdhury, F. Silvestri, B. Blakeley, V. Ramachandran.
Oblivious algorithms for multicores and network of processors. In IEEE IPDPS’10, 2010.
R. Cole, V. Ramachandran. Resource Oblivious Sorting on Multicores. In ICALP ’10, 2010.
R. Dorrigiv, A. López-Ortiz, A. Salinger. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM). In ACM SPAA ’08, 2008.
B.M. Maggs, L.R. Matheson, R.E. Tarjan. Models of Parallel Computation: A Survey and Synthesis . In HICSS’95, 1995.
L. G. Valiant. A bridging model for multicore computing . In Journal of Computer and System Sciences, 2010.Theoretical Modeling of Multicore Computation - Alejandro Salinger