dimacs workshop on parallelism: a 2020 vision alejandro salinger university of waterloo march 16,...

26
Theoretical Modeling of Multicore Computation DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

Upload: armando-eyles

Post on 01-Apr-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

Theoretical Modeling of Multicore Computation

DIMACS Workshop on Parallelism: A 2020 Vision

Alejandro SalingerUniversity of Waterloo

March 16, 2011

Page 2: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

2

Multicore ChallengesThe purpose of modeling is to capture the salient

characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction [Maggs et al. 95]

A model should provide clear, productive design incentives while providing strong messages to platform designers about the quality of characteristics required for efficient solution

The development of a unifying paradigm also requires a somewhat unified and stable technological environment

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 3: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

3

We would like a model that:Reflects the characteristics of the architectureRelatively flexibleEasy theoretical analysis Cost model linked to programming modelEasy to learnEasy to program

Others? (parameter-oblivious?)

Multicore Challenges

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 4: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

4

Simple

Accurate

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Multicore models

Page 5: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

5

Low Degree PRAMprocessors, MIMD modeThread-based parallelismScheduled with work-stealingOptimal speedups on large class of divide-

and-conquer algorithms and dynamic programming

Works in practiceEasy to programMessage: If we design algorithms to obtain

small speedups design and programming becomes easyTheoretical Modeling of Multicore Computation - Alejandro Salinger

[Dorrigiv, Lopez-Ortiz, S. ‘08]

Page 6: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

6

Communication is keyParallel computing is as much about

communicating data between processors, as it is about partitioning computing load between processors [Pal]

It’s all about the cacheNot only time complexity, also cache

complexity: number of cache misses, parallel transfers

Reducing misses can lead to overall faster running time even if processors are not fully utilized

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 7: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

7

Cache modelsCore 1

Core 2

Core 3

Core 4

Cache

RAM

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Cache

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Cache Cache

Page 8: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

8

Parallel External Model (PEM)P synchronized processorsPrivate memory of M wordsBlocks of size B wordsMeasures:

Computational complexity: maximum memory accesses to cache

I/O complexity: parallel block transfers from memory

Core 1

Core 2

Core 3

Core 4

RAM

M M M M

Theoretical Modeling of Multicore Computation - Alejandro Salinger

[Arge, Goodrich, Nelson, Sitchinava ‘08]

Page 9: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

9

speedup over External Memory bounds

Most algorithms cache awarePRAM style

Problem PEM - I/O complexity

• Sorting

• Weighted list ranking • Euler tour• Tree contraction• Expression tree

evaluation

• Lowest Common Ancestor

(Q queries)

• Minimum Spanning Tree

• Connected and biconnected components

• Ear decomposition

• Line Segment Intersection Reporting

(K output size)

Theoretical Modeling of Multicore Computation - Alejandro Salinger

[Arge, Goodrich, Sitchinava ‘10, Ajwani, Sitchinava, Zeh ‘11]

Page 10: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

10

DAG model

Nodes: tasks (with weights )Edges: dependencies (spawn, data, etc)Work: Depth: }E.g. Mergesort:

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 11: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

11

SchedulersIt’s all about the schedulerMultithreaded computations with arbitrary

dependencies can be impossible to schedule efficiently

Restrict computationFully strict computation: all data

dependencies go to thread’s parentWork-stealing

Core 1

Core 2

Core 3

Core 4

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 12: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

12

Schedulers: Work-StealingFor any fully strict computation:

Expected running time Space (: min space for sequential

computation)Expected communication (tight)

Good for private caches:Caches of size C, transfer time m, steal time sDAG consistency

Nested-parallel

Core 1

Core 2

Core 3

Core 4

RAM

C C C C

Theoretical Modeling of Multicore Computation - Alejandro Salinger

[Acar, Blelloch, Blumofe ’02][Blumofe, Leiserson ‘94][Blumofe, Frigo, Joerg,Leiserson, Randall ‘96]

Page 13: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

13

Schedulers: Parallel Depth FirstSequential computation on wmisses:Shared cache of size If

parallel steps is small

Core 1

Core 2

Core 3

Core 4

Cp

RAM

Theoretical Modeling of Multicore Computation - Alejandro Salinger

[Blelloch, Gibbons ‘04]

Page 14: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

14

SchedulersCompeting demands for private and shared

cachesWork stealing can suffer shared cache missesPDF can suffer private cache misses

Multicore-Cache modelTime-cache complexity-cache complexity

Controlled PDF-scheduler

Core 1

Core 2

Core 3

Core 4

RAM

L1 L1 L1 L1

L2

Theoretical Modeling of Multicore Computation - Alejandro Salinger

[Blelloch, Chowdhury, Gibbons, Ramachandran, Chen, Kozuch ‘08]

Page 15: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

15

Schedulers: Controlled-PDF-supernodes, 1DF

schedule-supernodes , PDF

schedule

User specifies space usage functionOnly scheduler knows about cache sizesLarge class of divide-and-conquer algorithmsOptimal speedups-cache complexity and -cache complexity within

constant factor of sequential cache complexities.

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 16: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

16

Cache obliviousnessParallel cache complexity can be bounded by

sequential cache complexity:Private caches: Shared caches:

For nested-parallel computationsBounds can be extended to multilevel shared

or private hierarchiesIdea: develop nested-parallel algorithms with

low cache complexity and low depth

Theoretical Modeling of Multicore Computation - Alejandro Salinger

[Blelloch, Gibbons, Simhadri ‘10]

Page 17: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

17

Low-depth cache obliviousProblem Depth Cache (size M,

block B)

• Sorting O O (rand)

• List ranking • Euler tour on trees

• Tree contraction

• Lowest Common Ancestor (k queries)

• Minimum Spanning Forest• Connected components

O

• Sparse-Matrix Vector Multiply ( nonzeros, separators)

O

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 18: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

18

Resource Oblivious Algorithms - HMHierarchical model HMExtension to multicore modelEfficient oblivious algorithms

for:Matrix transpositionFFTSortingGaussian Elimination ParadigmList rankingConnected components

Scheduler hintsTheoretical Modeling of Multicore Computation - Alejandro Salinger

[Chowdurry, Silvestri, Blakeley, Rramachandran ‘10]

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Cache Cache

Page 19: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

19

Multi-BSP d levels (pj,Lj,mj,gj)

pj: number of componentsLj: synchronization costmj: size of memorygj: data rate

Level 0: coresPortable algorithms“Immortal algorithms”Optimal algorithms for matrix

multiplication, FFT, and sortingL closer to latency that synchronizationPrescriptive: e.g. support for

synchronization operation

level j

level j-1

gj

Core 1

Core 2

Core 3

Core 4

RAM

Cache

Cache

Cache

Cache

Cache Cache

1 2 pj

mj

Theoretical Modeling of Multicore Computation - Alejandro Salinger

[Valiant ‘08]

Page 20: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

20

Models SummaryModeling parallel computation is hardMulticore architecture constantly changingCache should be part of the equation

Maybe later inter-processor communication, synchronization, energy

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 21: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

21

Models SummaryGood:

No need to reinvent everythingLarge class of algorithms with good cache

complexity for shared or private cachesSome relatively simple design in terms of work,

depth, and sequential cache complexityParameters of the machine only known by schedulerCilk Plus: model, scheduler, tools widely available

Needs improvement:More algorithms or scheduler with good shared and

private cache complexitiesHow to choose the scheduler?Theory needs to be accessible to the masses

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 22: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

22

Parallel trainingCurrent CS degree prepares for programming on

obsolete model

Change of mentality:Parallel thinking (algorithms, programming), but alsoI/O complexity, locality of reference Programming languages

Right balance between practical skills and underlying theory?

How to add new concepts without too much sacrifice?

More specialized majors?Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 23: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

23

Final thoughtsConstant factor speedup, opportunity for

simplicity

Use of more efficient, low-level algorithms were appropriate (library tools)

Should we marry multicores? what’s the next thing?

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 24: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

24

BibliographyU. A. Acar, G. E. Blelloch, and R. D. Blumofe.

The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.

D. Ajwani, N. Sitchinava, N. Zeh. I/O-optimal algorithms for orthogonal problems for private-cache chip multiprocessors. In IPDPS’11, 2011

L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In ACM SPAA ’08, 2008.

L. Arge, M. T. Goodrich, and N. Sitchinava. Parallel external memory graph algorithms. In IPDPS’10, 2010

G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In ACM-SIAM SODA ’08, 2008.

Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 25: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

25

Bibliography(2)G. E. Blelloch and P. B. Gibbons. Effectively sharing a

cache among threads. In ACM SPAA ’04, 2004.G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri.

Low-depth cache oblivious algorithms. In ACM SPAA ’10, 2010.

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 1999.

R.D. Blumofe, M. Frigo, C.F. Joerg,C.E. Leiserson, K.H. Randall. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In SPAA’96, 1996.Theoretical Modeling of Multicore Computation - Alejandro Salinger

Page 26: DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

26

Bibliography(3)R.A. Chowdhury, F. Silvestri, B. Blakeley, V. Ramachandran.

Oblivious algorithms for multicores and network of processors. In IEEE IPDPS’10, 2010.

R. Cole, V. Ramachandran. Resource Oblivious Sorting on Multicores. In ICALP ’10, 2010.

R. Dorrigiv, A. López-Ortiz, A. Salinger. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM). In ACM SPAA ’08, 2008.

B.M. Maggs, L.R. Matheson, R.E. Tarjan. Models of Parallel Computation: A Survey and Synthesis . In HICSS’95, 1995.

L. G. Valiant. A bridging model for multicore computing . In Journal of Computer and System Sciences, 2010.Theoretical Modeling of Multicore Computation - Alejandro Salinger