dimacs workshop on parallelism: a 2020 vision alejandro salinger university of waterloo march 16,...

Theoretical Modeling of Multicore Computation

DIMACS Workshop on Parallelism: A 2020 Vision

Alejandro SalingerUniversity of Waterloo

March 16, 2011

2

Multicore ChallengesThe purpose of modeling is to capture the salient

characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction [Maggs et al. 95]

A model should provide clear, productive design incentives while providing strong messages to platform designers about the quality of characteristics required for efficient solution

The development of a unifying paradigm also requires a somewhat unified and stable technological environment

Theoretical Modeling of Multicore Computation - Alejandro Salinger

3

We would like a model that:Reflects the characteristics of the architectureRelatively flexibleEasy theoretical analysis Cost model linked to programming modelEasy to learnEasy to program

Others? (parameter-oblivious?)

Multicore Challenges


4

Simple

Accurate


Multicore models

5

Low Degree PRAMprocessors, MIMD modeThread-based parallelismScheduled with work-stealingOptimal speedups on large class of divide-

and-conquer algorithms and dynamic programming

Works in practiceEasy to programMessage: If we design algorithms to obtain

small speedups design and programming becomes easyTheoretical Modeling of Multicore Computation - Alejandro Salinger

[Dorrigiv, Lopez-Ortiz, S. ‘08]

6

Communication is keyParallel computing is as much about

communicating data between processors, as it is about partitioning computing load between processors [Pal]

It’s all about the cacheNot only time complexity, also cache

complexity: number of cache misses, parallel transfers

Reducing misses can lead to overall faster running time even if processors are not fully utilized


7

Cache modelsCore 1

Core 2

Core 3

Core 4

Cache

RAM

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Cache

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Cache Cache

8

Parallel External Model (PEM)P synchronized processorsPrivate memory of M wordsBlocks of size B wordsMeasures:

Computational complexity: maximum memory accesses to cache

I/O complexity: parallel block transfers from memory

Core 1

Core 2

Core 3

Core 4

RAM

M M M M


[Arge, Goodrich, Nelson, Sitchinava ‘08]

9

speedup over External Memory bounds

Most algorithms cache awarePRAM style

Problem PEM - I/O complexity

• Sorting

• Weighted list ranking • Euler tour• Tree contraction• Expression tree

evaluation

• Lowest Common Ancestor

(Q queries)

• Minimum Spanning Tree

• Connected and biconnected components

• Ear decomposition

• Line Segment Intersection Reporting

(K output size)


[Arge, Goodrich, Sitchinava ‘10, Ajwani, Sitchinava, Zeh ‘11]

10

DAG model

Nodes: tasks (with weights )Edges: dependencies (spawn, data, etc)Work: Depth: }E.g. Mergesort:


11

SchedulersIt’s all about the schedulerMultithreaded computations with arbitrary

dependencies can be impossible to schedule efficiently

Restrict computationFully strict computation: all data

dependencies go to thread’s parentWork-stealing

Core 1

Core 2

Core 3

Core 4


12

Schedulers: Work-StealingFor any fully strict computation:

Expected running time Space (: min space for sequential

computation)Expected communication (tight)

Good for private caches:Caches of size C, transfer time m, steal time sDAG consistency

Nested-parallel

Core 1

Core 2

Core 3

Core 4

RAM

C C C C


[Acar, Blelloch, Blumofe ’02][Blumofe, Leiserson ‘94][Blumofe, Frigo, Joerg,Leiserson, Randall ‘96]

13

Schedulers: Parallel Depth FirstSequential computation on wmisses:Shared cache of size If

parallel steps is small

Core 1

Core 2

Core 3

Core 4

Cp

RAM


[Blelloch, Gibbons ‘04]

14

SchedulersCompeting demands for private and shared

cachesWork stealing can suffer shared cache missesPDF can suffer private cache misses

Multicore-Cache modelTime-cache complexity-cache complexity

Controlled PDF-scheduler

Core 1

Core 2

Core 3

Core 4

RAM

L1 L1 L1 L1

L2


[Blelloch, Chowdhury, Gibbons, Ramachandran, Chen, Kozuch ‘08]

15

Schedulers: Controlled-PDF-supernodes, 1DF

schedule-supernodes , PDF

schedule

User specifies space usage functionOnly scheduler knows about cache sizesLarge class of divide-and-conquer algorithmsOptimal speedups-cache complexity and -cache complexity within

constant factor of sequential cache complexities.


16

Cache obliviousnessParallel cache complexity can be bounded by

sequential cache complexity:Private caches: Shared caches:

For nested-parallel computationsBounds can be extended to multilevel shared

or private hierarchiesIdea: develop nested-parallel algorithms with

low cache complexity and low depth


[Blelloch, Gibbons, Simhadri ‘10]

17

Low-depth cache obliviousProblem Depth Cache (size M,

block B)

• Sorting O O (rand)

• List ranking • Euler tour on trees

• Tree contraction

• Lowest Common Ancestor (k queries)

• Minimum Spanning Forest• Connected components

O

• Sparse-Matrix Vector Multiply ( nonzeros, separators)

O


18

Resource Oblivious Algorithms - HMHierarchical model HMExtension to multicore modelEfficient oblivious algorithms

for:Matrix transpositionFFTSortingGaussian Elimination ParadigmList rankingConnected components

Scheduler hintsTheoretical Modeling of Multicore Computation - Alejandro Salinger

[Chowdurry, Silvestri, Blakeley, Rramachandran ‘10]

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Cache Cache

19

Multi-BSP d levels (pj,Lj,mj,gj)

pj: number of componentsLj: synchronization costmj: size of memorygj: data rate

Level 0: coresPortable algorithms“Immortal algorithms”Optimal algorithms for matrix

multiplication, FFT, and sortingL closer to latency that synchronizationPrescriptive: e.g. support for

synchronization operation

level j

level j-1

gj

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Cache Cache

1 2 pj

mj


[Valiant ‘08]

20

Models SummaryModeling parallel computation is hardMulticore architecture constantly changingCache should be part of the equation

Maybe later inter-processor communication, synchronization, energy


21

Models SummaryGood:

No need to reinvent everythingLarge class of algorithms with good cache

complexity for shared or private cachesSome relatively simple design in terms of work,

depth, and sequential cache complexityParameters of the machine only known by schedulerCilk Plus: model, scheduler, tools widely available

Needs improvement:More algorithms or scheduler with good shared and

private cache complexitiesHow to choose the scheduler?Theory needs to be accessible to the masses


22

Parallel trainingCurrent CS degree prepares for programming on

obsolete model

Change of mentality:Parallel thinking (algorithms, programming), but alsoI/O complexity, locality of reference Programming languages

Right balance between practical skills and underlying theory?

How to add new concepts without too much sacrifice?

More specialized majors?Theoretical Modeling of Multicore Computation - Alejandro Salinger

23

Final thoughtsConstant factor speedup, opportunity for

simplicity

Use of more efficient, low-level algorithms were appropriate (library tools)

Should we marry multicores? what’s the next thing?


24

BibliographyU. A. Acar, G. E. Blelloch, and R. D. Blumofe.

The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.

D. Ajwani, N. Sitchinava, N. Zeh. I/O-optimal algorithms for orthogonal problems for private-cache chip multiprocessors. In IPDPS’11, 2011

L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In ACM SPAA ’08, 2008.

L. Arge, M. T. Goodrich, and N. Sitchinava. Parallel external memory graph algorithms. In IPDPS’10, 2010

G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In ACM-SIAM SODA ’08, 2008.


25

Bibliography(2)G. E. Blelloch and P. B. Gibbons. Effectively sharing a

cache among threads. In ACM SPAA ’04, 2004.G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri.

Low-depth cache oblivious algorithms. In ACM SPAA ’10, 2010.

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 1999.

R.D. Blumofe, M. Frigo, C.F. Joerg,C.E. Leiserson, K.H. Randall. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In SPAA’96, 1996.Theoretical Modeling of Multicore Computation - Alejandro Salinger

26

Bibliography(3)R.A. Chowdhury, F. Silvestri, B. Blakeley, V. Ramachandran.

Oblivious algorithms for multicores and network of processors. In IEEE IPDPS’10, 2010.

R. Cole, V. Ramachandran. Resource Oblivious Sorting on Multicores. In ICALP ’10, 2010.

R. Dorrigiv, A. López-Ortiz, A. Salinger. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM). In ACM SPAA ’08, 2008.

B.M. Maggs, L.R. Matheson, R.E. Tarjan. Models of Parallel Computation: A Survey and Synthesis . In HICSS’95, 1995.

L. G. Valiant. A bridging model for multicore computing . In Journal of Computer and System Sciences, 2010.Theoretical Modeling of Multicore Computation - Alejandro Salinger

dimacs workshop on parallelism: a 2020 vision alejandro salinger university of waterloo march 16,...

Documents

memory core

cache ram core

ram cache core

cache models core

strict computation

purpose of modeling

cache io complexity

number of cache misses