intro parallel processing 566
TRANSCRIPT
-
8/13/2019 Intro Parallel Processing 566
1/51
Introduction to ParallelProcessing
Shantanu Dutt
University of Illinois at Chicago
-
8/13/2019 Intro Parallel Processing 566
2/51
2
Acknowledgements Ashish Agrawal, IIT Kanpur, Fundamentals of Parallel
Processing (slides), w/ some modifications andaugmentations by Shantanu Dutt
John Urbanic, Parallel Computing: Overview (slides), w/
some modifications and augmentations by Shantanu
Dutt John Mellor-Crummey, COMP 422 Parallel Computing:
An Introduction, Department of Computer Science, Rice
University, (slides), w/ some modifications and
augmentations by Shantanu Dutt
-
8/13/2019 Intro Parallel Processing 566
3/51
3
Outline Moore's Law and its limits
Different uni-processor performance enhancement
techniques and their limits Classification of parallel computations
Classification of parallel architectures - Distributed andShared memory
Simple examples of parallel processing
Example applications
Future advances
Summary
Some text from: Fund. of ParallelProcessing, A. Agrawal, IIT Kanpur
-
8/13/2019 Intro Parallel Processing 566
4/51
Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 4
Moores Law &
Need for Parallel Processing
Chip performance doublesevery 18-24 months
Power consumption is prop. tofreq.
Limits of Serial computing
Heating issues
Limit to transmissionsspeeds
Leakage currents
Limit to miniaturization
Multi-core processors alreadycommonplace.
Most high performance serversalready parallel.
-
8/13/2019 Intro Parallel Processing 566
5/51
Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 5
Quest for Performance
Pipelining Superscalar Architecture
Out of Order Execution
Caches
Instruction Set Design
Advancements Parallelism
Multi-core processors
Clusters
Grid
This is the future
-
8/13/2019 Intro Parallel Processing 566
6/51
Top text from: Fundamentals of ParallelProcessing, A. Agrawal, IIT Kanpur 6
Pipelining Illustration of Pipeline using the fetch, load, execute, store stages.
At the start of executionWind up.
At the end of executionWind down.
Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect
branch predictionHit performance and speedup.
Pipeline depthNo of cycles in execution simultaneously.
Intel Pentium 435 stages.
-
8/13/2019 Intro Parallel Processing 566
7/51
7
Pipelining
Tpipe(n) is pipelined time to process n instructions = fill-time + max{ti}, ti =
exec. time of the ith stage
-
8/13/2019 Intro Parallel Processing 566
8/51
Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 8
Cache Desire for fast cheap and non volatile memory
Memory speed growth at 7% per annum while processor growth at 50% p.a.
Cachefast small memory. L1 and L2 caches.
Retrieval from memory takes several hundred clock cycles
Retrieval from L1 cache takes the order of one clock cycle and from L2 cache
takes the order of 10 clock cycles.
Cache hit and miss. Prefetch used to avoid cache misses at the start of the execution of the
program.
Cache lines used to avoid latency time in case of a cache miss
Order of searchL1 cache -> L2 cache -> RAM -> Disk
Cache coherencyCorrectness of data. Important for distributed parallel
computing
Limit to cache improvement: Improving cache performance will at most improve
efficiency to match processor efficiency
-
8/13/2019 Intro Parallel Processing 566
9/51
9
(exs. of limited data parallelism)
(exs. of limited & low-level functional parallelism)
(single-instr.
multiple data)
: instruction-level parallelismdegree generally low and dependenton how the sequential code has been written, so not v. effective
-
8/13/2019 Intro Parallel Processing 566
10/51
10
-
8/13/2019 Intro Parallel Processing 566
11/51
11
Thus need development of explicit parallel algorithms that are
based on a fundamental understanding of the parallelism inherent
in a problem, and exploiting that parallelism with minimuminteraction/communication between the parallel parts
-
8/13/2019 Intro Parallel Processing 566
12/51
12
-
8/13/2019 Intro Parallel Processing 566
13/51
13
-
8/13/2019 Intro Parallel Processing 566
14/51
14
(simultaneous multi-threading)
(multi-threading)
-
8/13/2019 Intro Parallel Processing 566
15/51
15
-
8/13/2019 Intro Parallel Processing 566
16/51
16
-
8/13/2019 Intro Parallel Processing 566
17/51
Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 17
-
8/13/2019 Intro Parallel Processing 566
18/51
18
-
8/13/2019 Intro Parallel Processing 566
19/51
19
-
8/13/2019 Intro Parallel Processing 566
20/51
Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 20
Applications of Parallel Processing
-
8/13/2019 Intro Parallel Processing 566
21/51
21
-
8/13/2019 Intro Parallel Processing 566
22/51
22
-
8/13/2019 Intro Parallel Processing 566
23/51
23
-
8/13/2019 Intro Parallel Processing 566
24/51
24
-
8/13/2019 Intro Parallel Processing 566
25/51
25
-
8/13/2019 Intro Parallel Processing 566
26/51
Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 26
Example problems & solutions Easy Parallel SituationEach data part is
independent. No communication is required
between the execution units solving two differentparts.
Heat Equation -
The initial temperature is zero on the boundaries
and high in the middle
The boundary temperature is held at zero.
The calculation of an element is dependent uponits neighbor elements
data1 data2 ... data N
-
8/13/2019 Intro Parallel Processing 566
27/51
Code from: Fundamentals of ParallelProcessing, A. Agrawal, IIT Kanpur 27
1. find out if I am MASTER or WORKER
2. if I am MASTER
3. initialize array
4. send each WORKER starting info andsubarray
5. do until all WORKERS converge6. gather from all WORKERS convergence
data
7. broadcast to all WORKERS convergencesignal
8. end do
9. receive results from each WORKER
1. else if I am WORKER
2. receive from MASTER starting info andsubarray
3. do until solution converged {
4. update time
1. non-blocking send neighbors my border
info5. non-blocking receive neighbors border
info
6. update interior of my portion of solutionarray
7. wait for non-block. commun. to complete
14. update border of my portion of solutionarray
15. determine if my solution has converged
16. if so {send MASTER convergence signal
17. recv. from MASTER convergence signal}
18. end do }19. send MASTER results
20. endif
Serial Code -
do iy=2, ny-1
do ix=2, nx-1u2(ix,iy)=u1(ix,iy)+cx*{u1(ix+1,iy)} + u1(ix-
1,iy) + cy*{u1(ix,iy+1)} + u1(ix,iy-1)enddo
enddo Master (can be one of the workers)
Workers Problem
Grid
-
8/13/2019 Intro Parallel Processing 566
28/51
28
How to interconnect the
multiple cores/processors
is a major consideration in
a parallel architecture
-
8/13/2019 Intro Parallel Processing 566
29/51
29
Tflops Tflops kW
1
-
8/13/2019 Intro Parallel Processing 566
30/51
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur 30
Parallelism - A simplistic understanding
Multiple tasks at once.
Distribute work into multiple
execution units. A classification of parallelism:
Data Parallelism
Functional or ControlParallelism
Data Parallelism - Divide thedataset and solve each sectorsimilarly on a separateexecution unit.
Functional ParallelismDivide the 'problem' into
different tasks and execute thetasks on different units. Whatwould func. parallelism look likefor the example on the right?
Sequentia
l
DataP
arallelism
-
8/13/2019 Intro Parallel Processing 566
31/51
16/12/2008 Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 31
Data Parallelism
Functional Parallelism
-
8/13/2019 Intro Parallel Processing 566
32/51
Flynns Classification
Flynn's Classical Taxonomy -
Single Instruction, Single Data (SISD)your single-core uni-processor PC
Single Instruction, Multiple Data (SIMD)special purpose low-granularity multi-processor m/c w/ a single control unit relayingthe same instruction to all processors (w/ different data) every cc
Multiple Instruction, Single Data (MISD)pipelining is a majorexample
Multiple Instruction, Multiple Data (MIMD)the most prevalentmodel. SPMD (Single Program Multiple Data) is a very usefulsubset. Note that this is v. different from SIMD. Why?
Note that Data vs Control Parallelism is another independentclassification to the above
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur 32
-
8/13/2019 Intro Parallel Processing 566
33/51
Flynns Classification (contd).
33
-
8/13/2019 Intro Parallel Processing 566
34/51
Flynns Classification (contd).
34
-
8/13/2019 Intro Parallel Processing 566
35/51
Flynns Classification (contd).
35
-
8/13/2019 Intro Parallel Processing 566
36/51
Flynns Classification (contd).
36
-
8/13/2019 Intro Parallel Processing 566
37/51
Flynns Classification (contd).
37
f d
-
8/13/2019 Intro Parallel Processing 566
38/51
Flynns Classification (contd).
38
Data Parallelism: SIMD and SPMD fall into this category
Functional Parallelism: MISD falls into this category
Parallel Arch Classification
-
8/13/2019 Intro Parallel Processing 566
39/51
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur 39
Parallel Arch. Classification
Multi-processor Architectures-
Distributed MemoryMost prevalent architecture model for # processors > 8 Indirect interconnectionn n/ws
Direct interconnection n/ws
Shared Memory
Uniform Memory Access (UMA)
Non- Uniform Memory Access (NUMA)Distributed shared memory
1
b d
-
8/13/2019 Intro Parallel Processing 566
40/51
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur 40
Distributed MemoryMessage Passing
Architectures Each processor P (with its own local
cache C) is connected to exclusive
local memory, i.e. no other CPU has
direct access to it.
Each node comprises at least one
network interface (NI) that mediates
the connection to a communication
network.
On each CPU runs a serial process
that can communicate with other
processes on other CPUs by means of
the network.
Non-blocking vs Blocking
communication
Direct vs Indirect
Communication/Interconnection
network
Example: A 2x4
mesh n/w (direct
connection n/w)
1
-
8/13/2019 Intro Parallel Processing 566
41/51
The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)
41
Has 56 compute nodes/computers and a master node Master here has a different meaninggenerally a system front-end where you login and perform
various tasks before submitting your parallel code to run on several compute nodesthan the
master node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution
problem), which would actually be one of the compute nodes, and generally distributes data to the
other compute nodes, monitors progress of the computation, determines the end of the
computation, etc., and may also additionally perform a part of the computation
Compute nodes are divided among 14 zones, each zone containing 4 nodes which are
connected as a ring network. Zones are connected to each other by a higher-level n/w.
Each node (compute or master) has 2 processors. Each processor on some nodes are
single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes
1
1
-
8/13/2019 Intro Parallel Processing 566
42/51
System Computational Actions in a Message-Passing Program
42
(a) Two basic parallel processes
X, Y, and their data dependency
a := b+c; b := x*y;
Proc. X Proc. Y
recv(P2, b);
a := b+c;
b := x*y;
send(P1,b);
Proc. X Proc. Y
bP(X) P(Y)
Processor/corecontaining Y
Processor/corecontaining X
Message passing
of data item b.Link (director indirect) betw.the 2 processors
(b) Their mapping to a message-passing multicomputer
Message passingmapping
1
Di t ib t d Sh d M A h UMA 1
-
8/13/2019 Intro Parallel Processing 566
43/51
Dual-Core Quad-Core
L1 cache
L2 cache
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur 43
Distributed Shared Memory Arch.: UMA Flat memory model
Memory bandwidth and latency are the same for allprocessors and all memory locations.
Simplest exampledual core processor Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Cache coherent UMAconsistent cache values of thesame data item in different proc./core caches
1
1
-
8/13/2019 Intro Parallel Processing 566
44/51
System Computational Actions in a Shared-Memory Program
44
(a) Two basic parallel processes
X, Y, and their data dependency
a := b+c; b := x*y;
Proc. X Proc. Y
a := b+c; b := x*y;
Proc. X Proc. Y
P(X) P(Y)
(b) Their mapping to a shared-memory multiprocessor
Shared-memorymapping
Shared Memory
Possible Actions by O.S.:
(i) Since b is a shared
data item (e.g.,designated by
compiler or
programmer), check
bslocation to see if
it can be written to (all
prev. reads done:
read_cntr for b = 0).
(ii) If so, write b to its
location and markstatus bit as written
by Y. Initialize
read_cntr for b to
pre-determined value
Possible Actions by O.S.:
(i) Since b is a shareddata item (e.g.,
designated by
compiler or
programmer), check
bslocation to see if
it has been written to
by Y or any process
(if dont care about
the writing process).
(ii) If so {read b &
decrement read_cntr
for b} else go to (i)
and busy wait (check
periodically).
1
Di t ib t d Sh d M A h NUMA
1
-
8/13/2019 Intro Parallel Processing 566
45/51
Most text from Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur 45
Distributed Shared Memory Arch.: NUMA Memory is physically distributed but logically shared.
The physical layout similar to the distributed-memory message-passing case
Aggregated memory of the whole system appear as one single address space.
Due to the distributed nature, memory access performance varies depending on whichCPU accesses which parts of memory (local vs. remote access).
Two locality domains linked through a high speed connection called Hyper Transport (ingeneral via a link, as in message passing archs, only here these links are used by theO.S. to transmit read/write non-local data to/from processor/non-local memory).
AdvantageScalability (compared to UMAs)
Disadvantagea) Locality Problems and Connection congestion. b) Not a naturalparallel prog./algo. Model (it is easier to partition data among procs instead of think ofall of it occupying a large monolithic address space that each proc. can access).
2x2 meshconnection
1
-
8/13/2019 Intro Parallel Processing 566
46/51
46
-
8/13/2019 Intro Parallel Processing 566
47/51
47
-
8/13/2019 Intro Parallel Processing 566
48/51
An example of an SPMD message-passing parallel
program
48
1
-
8/13/2019 Intro Parallel Processing 566
49/51
SPMD message-passing parallel program (contd.)
49
node xor D,
1
S
-
8/13/2019 Intro Parallel Processing 566
50/51
Most text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur 50
Summary Serial computers / microprocessors will probably not get much faster -
parallelization unavoidable
Pipelining, cache and other optimization strategies for serial computers
reaching a plateau
Data and functional parallelism
Flynns taxonomy: SIMD, MISD, MIMD/SPMD
Parallel Architectures Intro
Distributed Memory
Shared Memory
Uniform Memory Access
Non Uniform Memory Access
Application examples
Parallel program/algorithm examples
-
8/13/2019 Intro Parallel Processing 566
51/51
Fundamentals of Parallel Processing
Additional References
Computer Organization and DesignPatterson Hennessey
Modern Operating SystemsTanenbaum Concepts of High Performance ComputingGeorg Hager
Gerhard Wellein
Cramming more components onto Integrated CircuitsGordonMoore, 1965
Introduction to Parallel Computinghttps://computing.llnl.gov/tutorials/parallel_comp
The Landscape of Parallel Computing ResearchA view fromBerkeley, 2006
https://computing.llnl.gov/tutorials/parallel_comphttps://computing.llnl.gov/tutorials/parallel_comp