experiencing cluster computing class 1. introduction to parallelism
Post on 21-Dec-2015
218 views
TRANSCRIPT
![Page 1: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/1.jpg)
Experiencing Experiencing Cluster ComputingCluster Computing
Class 1
![Page 2: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/2.jpg)
Introduction to Introduction to ParallelismParallelism
![Page 3: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/3.jpg)
OutlineOutline
• Why Parallelism
• Types of Parallelism
• Drawbacks
• Concepts
• Starting Parallelization
• Simple Example
![Page 4: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/4.jpg)
Why ParallelismWhy Parallelism
![Page 5: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/5.jpg)
Why Parallelism – PassivelyWhy Parallelism – Passively
Suppose you are using the most efficient algorithm with an optimal implementation and the program still takes too long or does not even fit onto your machine?
Parallelization is the last chance.
![Page 6: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/6.jpg)
Why Parallelism – InitiativelyWhy Parallelism – Initiatively
• Faster– Finish the work earlier
• Same work in shorter time
– Do more work• More work in the same time
• Most importantly, you want to predict the result before the event occurs
![Page 7: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/7.jpg)
ExamplesExamples
Many of the scientific and engineering problems require enormous computational power. Following are the few fields to mention:
– Quantum chemistry, statistical mechanics, and relativistic physics
– Cosmology and astrophysics– Computational fluid dynamics and turbulence– Material design and superconductivity– Biology, pharmacology, genome sequencing, genetic
engineering, protein folding, enzyme activity, and cell modeling
– Medicine, and modeling of human organs and bones– Global weather and environmental modeling– Machine Vision
![Page 8: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/8.jpg)
ParallelismParallelism
• The upper bound for the computing power that can be obtained from a single processor is limited by the fastest processor available at any certain time.
• The upper bound for the computing power available can be dramatically increased by integrating a set of processors together.
• Synchronization and exchange of partial results among processors are therefore unavoidable.
![Page 9: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/9.jpg)
Computer ArchitectureComputer Architecture
4 categories:
SISD: Single Instruction Single Data
SIMD: Single Instruction Multiple Data
MISD: Multiple Instruction Single Data
MIMD: Multiple Instruction Multiple Data
![Page 10: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/10.jpg)
Computer ArchitectureComputer Architecture
SISD SIMD MISD MIMD
Uniprocessor(Single processor
computer)
Vectorprocessor
Arrayprocessor
Shared Memory(Microprocessors)
Distributed Memory
(Microcomputers)
ClusterSMP NUMA
Processor Organizations
![Page 11: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/11.jpg)
Multiprocessing Clustering
IS
CU CU CU CU
PU PU PU PU
Shared Memory
1 n-1 n2
21 n-1 n
ISISIS
DSDSDSDS
DS
LM LM LM LM
CPU CPU CPU CPU
Interconnecting Network
1 n-1 n2
21 n-1 n
DSDSDS
Distributed Memory – Cluster
Shared Memory –
Symmetric multiprocessors (SMP)
Parallel Computer ArchitectureParallel Computer Architecture
![Page 12: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/12.jpg)
Types of ParallelismTypes of Parallelism
![Page 13: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/13.jpg)
Parallel Programming ParadigmParallel Programming Paradigm
• Multithreading – OpenMP
• Message Passing– MPI (Message Passing Interface)– PVM (Parallel Virtual Machine)
Shared memory, Distributed memory
Shared memory only
![Page 14: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/14.jpg)
ThreadsThreads
• In computer programming, a thread is placeholder information associated with a single use of a program that can handle multiple concurrent users.
• From the program's point-of-view, a thread is the information needed to serve one individual user or a particular service request.
• If multiple users are using the program or concurrent requests from other programs occur, a thread is created and maintained for each of them.
• The thread allows a program to know which user is being served as the program alternately gets re-entered on behalf of different users.
![Page 15: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/15.jpg)
ThreadsThreads
• Programmers view:– Single CPU– Single block of memory– Several threads of action
• Parallelization– Done by the compiler
ParallelRegion
FORK
JOIN
Master Thread
Team of parallel threads
Thread 1 2 3 4
Fork-Join Model
![Page 16: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/16.jpg)
Shared MemoryShared Memory
• Programmers view:– Several CPUs– Single block of memory– Several threads of action
• Parallelization– Done by the compiler
• Example– OpenMP
time
P1
P1 P2 P3Single threadedSingle threaded
Multi-threadedMulti-threaded
Process
Process P2
P3
Threads
Data exchange via shared memory
![Page 17: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/17.jpg)
Parallel Region 1
!$OMP PARALLEL
!$OMP END PARALLEL
!$OMP PARALLEL
!$OMP END PARALLEL
Parallel Region 2
Master Thread
Team of parallel threads
Multithreaded ParallelizationMultithreaded Parallelization
![Page 18: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/18.jpg)
Distributed MemoryDistributed Memory
• Programmers view:– Several CPUs– Several block of memory– Several threads of action
• Parallelization– Done by hand
• Example– MPI
time
P1
P1 P2 P3
P2
P3
Process 0
Process 1
Process 2
SerialSerial
Data exchange viainterconnection
Process
MessageMessagePassingPassing
![Page 19: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/19.jpg)
DrawbacksDrawbacks
![Page 20: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/20.jpg)
Drawbacks of ParallelismDrawbacks of Parallelism
• Traps– Deadlocks– Process Synchronization
• Programming Effort– Few tools support for automated parallelization and
debugging
• Task Distribution (Load balancing)
![Page 21: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/21.jpg)
DeadlockDeadlock
• The earliest computer operating systems ran only one program at a time.
• All of the resources of the system were available to this one program.
• Later, operating systems ran multiple programs at once, interleaving them.
• Programs were required to specify in advance what resources they needed so that they could avoid conflicts with other programs running at the same time.
• Eventually some operating systems offered dynamic allocation of resources. Programs could request further allocations of resources after they had begun running. This led to the problem of the deadlock.
![Page 22: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/22.jpg)
DeadlockDeadlock
• Parallel tasks require resources to accomplish their work. If the resources are not available, the work cannot be finished. Each resource can only be locked (controlled) by exactly one task at any given point in time.
• Consider the situation:– Two tasks need both the same two resources.– Each task manages to gain control over just one resource, but
not the other.– Neither task releases the resource that it already holds.
• It is called deadlock and the program will not terminate.
![Page 23: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/23.jpg)
DeadlockDeadlock
Process
Process
ResourceResource
![Page 24: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/24.jpg)
Dining PhilosophersDining Philosophers
• Each philosopher either thinks or eats.
• In order to eat, he requires two forks.
• Each philosopher tries to pick up the right fork first.
• If success, he waits for the left fork to become available.
Deadlock
![Page 25: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/25.jpg)
Dining Philosophers DemoDining Philosophers Demo
• Problem– http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterCo
mp/deadlock/Diners.htm
• Solution– http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterCo
mp/deadlock/FixedDiners.htm
![Page 26: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/26.jpg)
ConceptsConcepts
![Page 27: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/27.jpg)
SpeedupSpeedup
Given a fixed problem size.
TS: sequential wall clock execution time (in seconds)
TN: parallel wall clock execution time using N processors (in seconds)
Ideally, speedup = N Linear speed up
N
S
T
Tspeedup
![Page 28: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/28.jpg)
SpeedupSpeedup
• Absolute SpeedupSequential time on 1 processor/parallel time on N processors
• Relative SpeedupParallel time on 1 processor/parallel time on N processors
• Different because parallel code on 1 processor has unnecessary MPI overhead
–It may be slower than sequential code on 1 processor
![Page 29: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/29.jpg)
Parallel EfficiencyParallel Efficiency
Effciency is a measure of process utilization in a parallel program, relative to the serial program.
Parallel Efficiency E: Speedup per processor
Ideally, EN = 1.
N
S
TN
TE
N
SpeedupE or
![Page 30: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/30.jpg)
Amdahl’s LawAmdahl’s Law
It states that potential program speedup is defined by the fraction of code (f) which can be parallelized
If none of the code can be parallelized, f = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, f = 1 and the speedup is infinite (in theory).
fspeedup
1
1
![Page 31: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/31.jpg)
SNP
speedup
1
Amdahl’s LawAmdahl’s Law
Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by the equation where:
P: parallel fraction S: serial fraction N: number of processors
![Page 32: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/32.jpg)
Amdahl’s LawAmdahl’s Law
When N ∞, Speedup = 1/S
Interpretation:
No matter how many processors are used, the upper bound for the speed up is determined by the sequential section.
SSpeedup
N
1lim
![Page 33: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/33.jpg)
Amdahl’s Law – ExampleAmdahl’s Law – Example
If the sequential section of a program amounts 5% of the run time, then S = 0.05 and hence:
2005.0
1Speedup
![Page 34: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/34.jpg)
Behind Amdahl’s LawBehind Amdahl’s Law
1. How much faster can a given problem be solved?
2. Which problem size can be solved on a parallel machine in the same time as on a sequential one? (Scalability)
![Page 35: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/35.jpg)
Starting Starting ParallelizationParallelization
![Page 36: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/36.jpg)
Parallelization – Option 1Parallelization – Option 1
• Starting from an existing, sequential program– Easy on shared memory architectures (OpenMP)– Potentially adequate for small number of processes (m
oderate speed-up)– Does not scale to large number of processes– Restricted to trivially parallel problems on distributed m
emory machines
![Page 37: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/37.jpg)
Parallelization – Option 2Parallelization – Option 2
• Starting from scratch– Not popular, but often inevitable– Needs new program design– Increase complexity (data distribution)– Widely applicable– Often the best choice for large scale problems
![Page 38: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/38.jpg)
Goals for ParallelizationGoals for Parallelization
• Avoid or reduce– synchronization– communication
• Try to maximize computational intensive sections.
![Page 39: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/39.jpg)
Simple ExampleSimple Example
![Page 40: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/40.jpg)
SummationSummation
Given an N-dimensional vector of type integer.
// Initialization //
for (int i = 0; i<len; i++)
vec[i] = i*i ;
// Sum Calculation //
for (int i = 0; i<len; i++)
sum += vec[i];
![Page 41: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/41.jpg)
Parallel AlgorithmParallel Algorithm
1. Divide the vector in certain parts
2. In each CPU, initialize their own parts
3. Use global reduction to calculate the sum of the vector
![Page 42: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/42.jpg)
OpenMPOpenMP
Compiler directives (#pragma omp) are inserted to tell the compiler to perform parallelization.
The compiler would be responsible for automatically parallelizing certain types of loops.
#pragma omp parallel for
for (int i=1; i<len; i++)
vec[i] = i*i;
#pragma omp parallel for reduction(+: sum)
for (int i=0; i<len; i++)
sum += vec[i];
![Page 43: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/43.jpg)
MPIMPI
// in each process, do the initializationfor(int i=rank; i<len; i+=np)
vec[i] = i*i;// calculate the local sumfor(int i=rank; i<len; i+=np)
localsum += vec[i];// perform global reductionMPI_Reduce(&localsum, &sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
no. of processors, np = 3
rank
0
1
2
sum
sum
sum
localsum
MPI_Reduce sum
vec
![Page 44: Experiencing Cluster Computing Class 1. Introduction to Parallelism](https://reader031.vdocuments.mx/reader031/viewer/2022032704/56649d565503460f94a33c2b/html5/thumbnails/44.jpg)
ENDEND