a few thoughts on programming models for massively parallel systems bill gropp and rusty lusk...
Post on 15-Jan-2016
219 views
TRANSCRIPT
A Few Thoughts on Programming Models for
Massively Parallel SystemsBill Gropp and Rusty Lusk
Mathematics and Computer Science Divisionwww.mcs.anl.gov/~{gropp,lusk}
University of Chicago Department of Energy
Application Realities
• The applications for massively parallel systems already exist Because they take years to write
• They are in a variety of models MPI Shared memory Vector Other
• Challenges include expressing massive parallelism and giving natural expression to spatial and temporal locality.
University of Chicago Department of Energy
What is the hardest problem?
• (Overly simplistic statement): Program difficulty is directly related to the relative gap in latency and overhead
• The biggest relative gap is the remote (MPI) gap, right?
Memory Layer Access Time (cycles)
Register 1
Cache 1–10
DRAM Memory 100 (now) +
Remote Memory (with MPI) 10000 (cpu limited)
Relative
1
10
10–greater
10
University of Chicago Department of Energy
Short Term
• Transition existing applications Compiler does it all
• Model: vectorizing compilers (with feedback to retrain user)
Libraries (component software does it all)• Model: BLAS, CCA, “PETSc in PIM”
Take MPI or MPI/OpenMP codes only
• Challenges Remember history: Cray vs. STAR-100 vs.
Attached Processors
University of Chicago Department of Energy
Mid Term
• Use variations or extensions of familiar languages
• E.g., CoArray Fortran, UPC, OpenMP, HPF, Brook Issues:
• Local vs. global. Where is the middle (for hierarchical algorithms)?
• Dynamic software (see libraries, CCA above); adaptive algorithms.
• Support for modular or component oriented software.
University of Chicago Department of Energy
Long Term
• Performance How much can we shield the user from
managing memory?
• Fault Tolerance Particularly the impact on data distribution
strategies
• Debugging for performance and correctness Intel lessons: lock-out makes it difficult to
perform post mortems on parallel systems
University of Chicago Department of Energy
Danger! Danger! Danger!
• Massively parallel systems are needed for hard, not easy, problems Programming models must make difficult
problems possible; the focus must not be on making simple problems trivial.
E.g., fast dense matrix-matrix multiply isn’t a good measure of the suitability of a programming model.
University of Chicago Department of Energy
Don’t Forget the 90/10 Rule
• 90 % of the execution time in 10 % of the code Performance focus emphasizes this
10%
• The other 90% of the effort goes into the other 90% of the code Modularity, expressivity,
maintainability are important here
University of Chicago Department of Energy
Supporting the Writing of Correct Programs
• Deterministic algorithms should have an expression that is easy to prove is deterministic This doesn’t mean enforcing a particular execution
order or preventing the use of non-deterministic algorithms
Races are just too hard to avoid• Only “Hero” programmers may be reliable
Will we have “Structured parallel programming”?• Undisciplined access to shared objects is very risky• Like goto, the access to shared objects is both powerful
and (as pointed out about goto) can simplify programs• The challenge repeated: what are the structured parallel
programming constructs?
University of Chicago Department of Energy
Concrete Challenges for Programming Models for
Massively Parallel Systems
• Completeness of Expression How many advanced and emerging algorithms do we
exclude? How many legacy applications to we abandon?
• Fault Tolerance• Expressing (or avoiding) problem
decomposition• Correctness Debugging• Performance Debugging• I/O• Networking
University of Chicago Department of Energy
Completeness of Expression
• Can you efficiently implement MPI? No, MPI is not the best or even a great model for WIMPS.
But …• It is well defined• The individual operations are relatively simple• Parallel implementation issues are relatively well understood• MPI is designed for scalability (apps already running on
thousands of processors) Thus, any programming model should be able to implement
MPI with a reasonable amount of effort. Consider MPI a “null test” of the power of a programming model.
• Side effect: gives insight into how to transition existing MPI applications onto massively parallel systems
• Gives some insight into the performance of many applications because it factors the problem into local and non-local performance issues.
University of Chicago Department of Energy
Fault Tolerance
• Do we require fault tolerance on every operation, or just on the application? Checkpoints vs. “reliable computing” Cost of fine vs. coarse grain guarantees
• Software and performance costs!
• What is the support for fault-tolerant algorithms? Coarse-grain (checkpoint) vs. fine-grain
(transactions) Interaction with data decomposition
• Regular decompositions vs. turning off dead processors
University of Chicago Department of Energy
Problem Decomposition
• Decomposition-centric (e.g., data-centric) programming models Vectors and Streams are examples Divide-and-conquer or recursive generation
(Mou, Leiserson, many others) More freedom in storage association (e.g.,
blocking to natural memory sizes; padding to eliminate false sharing)
University of Chicago Department of Energy
Problem Decomposition Approaches
• Very fine grain (I.e., ignore) Individual words. Many think that this is the most
general way. You build a fast UMA-PRAM and I’ll believe it. Low overhead and latency tolerance requires
discovery of significant independent stuff
• Special aggregates Vectors, streams, tasks (object based
decompositions)
• Implicit by user-visible specification E.g., Recursive subdivision
University of Chicago Department of Energy
Application Kernels
• Are needed to understand, evaluate candidates• Risks
Not representative Over-simplified
• Implicit information exploited in solution• (give example)
Under-simplified• Too hard to work with
Wrong evaluation metric• Result in “fragile” results: small changes in specification
cause large changes in results Called “Ill-posed” in numerical analysis Widely recognized: “the only real benchmark is your own
application”
University of Chicago Department of Energy
Example Application Kernels
• Bad: Dense matrix-matrix multiply
• Rarely good algorithmic choice in practice• Too easy
(Even if most compilers don’t do a good job with this)
Fixed-length FFT Jacobi sweeps
• Getting better: Sparse matrix-vector multiply
University of Chicago Department of Energy
Reality Check
From Atlas
Compiler
Hand-tuned
Enormous effort required to get good performance
University of Chicago Department of Energy
Better Application Kernels
• Even Better: Sparse matrix assembly followed by matrix-vector
multiply, on q of p processing elements, matrix elements are r x r blocks
• Assembly: often a disproportionate amount of coding, stresses expressivity
• q<p: supports hierarchical algorithms• Sparse matrix: many aspects of PDE simulation (explicit
variable coefficient problems, Krylov methods and some preconditioners, multigrid); r x r typical for real multi-component problems.
• Freedoms: data structure for sparse matrix representation (but bounded spatial overhead)
• Best: Your description here (please!)
University of Chicago Department of Energy
Some Other Comments
• Is a general purpose programming model needed? Domain-specific environments
• Combine languages, libraries, static, and dynamic tools• JIT optimization
Tools to construct efficient special-purpose systems
• First steps in this direction OpenMP (warts like “lastprivate” and all)
Name the newest widely-accepted, non-derivative programming language
• Not T, Java, Visual Basic, Python
University of Chicago Department of Energy
Challenges
• The Processor in Memory (PIM) Ignore the M(assive). How can we program the PIM?
• Implicitly adopts the hybrid model; pragmatic if ugly
• Supporting legacy applications Implementing MPI efficiently at large scale
• Reconsider SMP and DSM-style implementations (many current impls immature)
• Supporting important classes of applications Don’t pick a single model
• Recall Dan Reed’s comment about loosing half the users with each new architecture
Explicitly make tradeoffs between features• Massive virtualization vs ruthless exploitation of compile-time knowledge
• Interacting with the OS Is the OS interface intrinsically nonscalable? Is the OS interface scalable, but only with heroic levels of
implementation effort?
University of Chicago Department of Energy
Scalable System Services
• 100000 Independent tasks Are they truly independent? One property of related
tasks is that:• The probability that a significant number will make the
same (or any!) nonlocal system call (e.g., I/O request) in the same time interval >> random chance
• What is the programming model’s role in Aggregating nonlocal operations? Providing a framework in which it is natural to write
programs that make scalable calls to system services?
University of Chicago Department of Energy
Cautionary Tales
• Timers. Application programmer uses gettimeofday to time program. Each thread uses this to generate profiling data.
• File systems. Some applications write one file/task (or one file/task/timestep) leading to zillions of files. How long does ls take? ls –lt? Don’t forget, all of the names are almost identical (worst-case sorting?)
• Job startup. 100000 tasks start from their local executable, then all access a shared object (e.g., MPI_Init). What happens to the file system?
University of Chicago Department of Energy
New OS Semantics?
• Define value-return calls (e.g., file stat, gettimeofday) to allow on the fly aggregation Defensive move for OS You can always write a nonscalable program
• Define state-update with scalable semantics Collective operations Thread safe
• Avoid seek, provide write_at, read_at