a few thoughts on programming models for massively parallel systems bill gropp and rusty lusk...

23
A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division www.mcs.anl.gov/~{gropp,lusk}

Post on 15-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

A Few Thoughts on Programming Models for

Massively Parallel SystemsBill Gropp and Rusty Lusk

Mathematics and Computer Science Divisionwww.mcs.anl.gov/~{gropp,lusk}

Page 2: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Application Realities

• The applications for massively parallel systems already exist Because they take years to write

• They are in a variety of models MPI Shared memory Vector Other

• Challenges include expressing massive parallelism and giving natural expression to spatial and temporal locality.

Page 3: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

What is the hardest problem?

• (Overly simplistic statement): Program difficulty is directly related to the relative gap in latency and overhead

• The biggest relative gap is the remote (MPI) gap, right?

Memory Layer Access Time (cycles)

Register 1

Cache 1–10

DRAM Memory 100 (now) +

Remote Memory (with MPI) 10000 (cpu limited)

Relative

1

10

10–greater

10

Page 4: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Short Term

• Transition existing applications Compiler does it all

• Model: vectorizing compilers (with feedback to retrain user)

Libraries (component software does it all)• Model: BLAS, CCA, “PETSc in PIM”

Take MPI or MPI/OpenMP codes only

• Challenges Remember history: Cray vs. STAR-100 vs.

Attached Processors

Page 5: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Mid Term

• Use variations or extensions of familiar languages

• E.g., CoArray Fortran, UPC, OpenMP, HPF, Brook Issues:

• Local vs. global. Where is the middle (for hierarchical algorithms)?

• Dynamic software (see libraries, CCA above); adaptive algorithms.

• Support for modular or component oriented software.

Page 6: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Long Term

• Performance How much can we shield the user from

managing memory?

• Fault Tolerance Particularly the impact on data distribution

strategies

• Debugging for performance and correctness Intel lessons: lock-out makes it difficult to

perform post mortems on parallel systems

Page 7: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Danger! Danger! Danger!

• Massively parallel systems are needed for hard, not easy, problems Programming models must make difficult

problems possible; the focus must not be on making simple problems trivial.

E.g., fast dense matrix-matrix multiply isn’t a good measure of the suitability of a programming model.

Page 8: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Don’t Forget the 90/10 Rule

• 90 % of the execution time in 10 % of the code Performance focus emphasizes this

10%

• The other 90% of the effort goes into the other 90% of the code Modularity, expressivity,

maintainability are important here

Page 9: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Supporting the Writing of Correct Programs

• Deterministic algorithms should have an expression that is easy to prove is deterministic This doesn’t mean enforcing a particular execution

order or preventing the use of non-deterministic algorithms

Races are just too hard to avoid• Only “Hero” programmers may be reliable

Will we have “Structured parallel programming”?• Undisciplined access to shared objects is very risky• Like goto, the access to shared objects is both powerful

and (as pointed out about goto) can simplify programs• The challenge repeated: what are the structured parallel

programming constructs?

Page 10: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Concrete Challenges for Programming Models for

Massively Parallel Systems

• Completeness of Expression How many advanced and emerging algorithms do we

exclude? How many legacy applications to we abandon?

• Fault Tolerance• Expressing (or avoiding) problem

decomposition• Correctness Debugging• Performance Debugging• I/O• Networking

Page 11: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Completeness of Expression

• Can you efficiently implement MPI? No, MPI is not the best or even a great model for WIMPS.

But …• It is well defined• The individual operations are relatively simple• Parallel implementation issues are relatively well understood• MPI is designed for scalability (apps already running on

thousands of processors) Thus, any programming model should be able to implement

MPI with a reasonable amount of effort. Consider MPI a “null test” of the power of a programming model.

• Side effect: gives insight into how to transition existing MPI applications onto massively parallel systems

• Gives some insight into the performance of many applications because it factors the problem into local and non-local performance issues.

Page 12: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Fault Tolerance

• Do we require fault tolerance on every operation, or just on the application? Checkpoints vs. “reliable computing” Cost of fine vs. coarse grain guarantees

• Software and performance costs!

• What is the support for fault-tolerant algorithms? Coarse-grain (checkpoint) vs. fine-grain

(transactions) Interaction with data decomposition

• Regular decompositions vs. turning off dead processors

Page 13: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Problem Decomposition

• Decomposition-centric (e.g., data-centric) programming models Vectors and Streams are examples Divide-and-conquer or recursive generation

(Mou, Leiserson, many others) More freedom in storage association (e.g.,

blocking to natural memory sizes; padding to eliminate false sharing)

Page 14: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Problem Decomposition Approaches

• Very fine grain (I.e., ignore) Individual words. Many think that this is the most

general way. You build a fast UMA-PRAM and I’ll believe it. Low overhead and latency tolerance requires

discovery of significant independent stuff

• Special aggregates Vectors, streams, tasks (object based

decompositions)

• Implicit by user-visible specification E.g., Recursive subdivision

Page 15: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Application Kernels

• Are needed to understand, evaluate candidates• Risks

Not representative Over-simplified

• Implicit information exploited in solution• (give example)

Under-simplified• Too hard to work with

Wrong evaluation metric• Result in “fragile” results: small changes in specification

cause large changes in results Called “Ill-posed” in numerical analysis Widely recognized: “the only real benchmark is your own

application”

Page 16: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Example Application Kernels

• Bad: Dense matrix-matrix multiply

• Rarely good algorithmic choice in practice• Too easy

(Even if most compilers don’t do a good job with this)

Fixed-length FFT Jacobi sweeps

• Getting better: Sparse matrix-vector multiply

Page 17: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Reality Check

From Atlas

Compiler

Hand-tuned

Enormous effort required to get good performance

Page 18: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Better Application Kernels

• Even Better: Sparse matrix assembly followed by matrix-vector

multiply, on q of p processing elements, matrix elements are r x r blocks

• Assembly: often a disproportionate amount of coding, stresses expressivity

• q<p: supports hierarchical algorithms• Sparse matrix: many aspects of PDE simulation (explicit

variable coefficient problems, Krylov methods and some preconditioners, multigrid); r x r typical for real multi-component problems.

• Freedoms: data structure for sparse matrix representation (but bounded spatial overhead)

• Best: Your description here (please!)

Page 19: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Some Other Comments

• Is a general purpose programming model needed? Domain-specific environments

• Combine languages, libraries, static, and dynamic tools• JIT optimization

Tools to construct efficient special-purpose systems

• First steps in this direction OpenMP (warts like “lastprivate” and all)

Name the newest widely-accepted, non-derivative programming language

• Not T, Java, Visual Basic, Python

Page 20: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Challenges

• The Processor in Memory (PIM) Ignore the M(assive). How can we program the PIM?

• Implicitly adopts the hybrid model; pragmatic if ugly

• Supporting legacy applications Implementing MPI efficiently at large scale

• Reconsider SMP and DSM-style implementations (many current impls immature)

• Supporting important classes of applications Don’t pick a single model

• Recall Dan Reed’s comment about loosing half the users with each new architecture

Explicitly make tradeoffs between features• Massive virtualization vs ruthless exploitation of compile-time knowledge

• Interacting with the OS Is the OS interface intrinsically nonscalable? Is the OS interface scalable, but only with heroic levels of

implementation effort?

Page 21: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Scalable System Services

• 100000 Independent tasks Are they truly independent? One property of related

tasks is that:• The probability that a significant number will make the

same (or any!) nonlocal system call (e.g., I/O request) in the same time interval >> random chance

• What is the programming model’s role in Aggregating nonlocal operations? Providing a framework in which it is natural to write

programs that make scalable calls to system services?

Page 22: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

Cautionary Tales

• Timers. Application programmer uses gettimeofday to time program. Each thread uses this to generate profiling data.

• File systems. Some applications write one file/task (or one file/task/timestep) leading to zillions of files. How long does ls take? ls –lt? Don’t forget, all of the names are almost identical (worst-case sorting?)

• Job startup. 100000 tasks start from their local executable, then all access a shared object (e.g., MPI_Init). What happens to the file system?

Page 23: A Few Thoughts on Programming Models for Massively Parallel Systems Bill Gropp and Rusty Lusk Mathematics and Computer Science Division {gropp,lusk}

University of Chicago Department of Energy

New OS Semantics?

• Define value-return calls (e.g., file stat, gettimeofday) to allow on the fly aggregation Defensive move for OS You can always write a nonscalable program

• Define state-update with scalable semantics Collective operations Thread safe

• Avoid seek, provide write_at, read_at