cis669 distributed and parallel processing spring 2002 professor yuan shi
TRANSCRIPT
CIS669Distributed and Parallel
Processing
Spring 2002
Professor Yuan Shi
Distributed Processing
Transaction-oriented
Geographically dispersed locations
I/O intense
Database-central
Parallel Processing
Non-transactional, single goal computing
Computing intense and/or data-intense
May or may not involve databases
Is There a Real Difference?
Not in terms of functionality and resource use intensity.For transactional systems, there are OLAP (Online Analysis Processing) and data mining tools that is computing intense and single goal-oriented.For parallel processing, many scientific/engineering applications need to interact with databases to make more accurate calculations.
Parallelism and Programming DifficultiesFor distributed processing, parallelism is given and usually cannot easily change. Programming is relatively easy.
For parallel processing, the programmer defines parallelism by partitioning the serial program(s). Parallel programming in general is more difficult than transaction applications.
This picture is changing…
Industrial-strength distributed applications are evolving into more parallel-like.
Lab-based parallel applications are blending into industrial strength applications by incorporating transactions.
Why Clusters (the textbook)?
We have tried all others: vector, dataflow, NUMA, hypercube, 3D-torus, etc.
Parallel programming does not get easier with any configuration.
Cluster promises the most potential for cost/performance. Check this out:
Types of Parallelism(Flynn,1972*)
1.SIMD (Single Instruction Multiple Data)
2.MIMD (Multiple Instruction Multiple Data)
3.MISD (Pipeline) *Flynn, M., Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 94, 1972.
** Other taxonomies to categorize parallel machines. (see http://csep1.phy.ornl.gov/ca/node11.html)
SIMDI
I
I
I
D1
D2
D3
D4
Tseq=4
Tpar=1
Sp = Tseq/Tpar=4=P
MIMDI1
I2
I3
I4
D1
D2
D3
D4
Tseq=4
Tpar=1
Sp = Tseq/Tpar=4=P
Pipeline(MISD)
I1 I2 I3 I4D1D2D3D4
Tseq=4x4=16
Tpar=4+3=7
Sp = Tseq/Tpar~=2.3
Machines that can work in parallel
Cray: X-MP, Y-MP, T3D.
TMC: CM1-CM5.
Kendal Square Research: KSR-1
SGI: Power Challenge,Origin
IBM: 3090, SP2…
PCs
History
Single CPU: Smaller size -> faster speed (Cray, remember Moore’s Law?)
Muti-CPU: Share memory or not share memory?
The war between Big-iron and Many irons: Cray against TMC.
Result: All lost. Cluster won by survival.
State of ArtSymmetric Multiprocessing is still the only practical industrial application. Vendors include HP, Sun, SGI, IBM, Compaq/Tandem, Status.Special purpose, small scale multiprocessors: CISCO routers, SSL processors, MPEG decoders, etc.Special purpose massively parallel Processors are designed for special types of applications, such as human genome classification, nuclear accelerator simulation, fluid-dynamic simulations, etc.
Hardware Technology Advances*
Computing LawsComputing Laws
Transistor density doubles every 18 months 60% increase per year– Chip density transistors/die – Micro processor speeds
Exponential growth:– The past does not matter– 10x here, 10x there … means REAL change
PC costs decline faster than any other platform– Volume and learning curves– PCs are the building bricks of all future systems
Moore’s First Law
128KB128KB
128MB128MB
200020008KB8KB
1MB1MB
8MB8MB
1GB1GB
19701970 19801980 19901990
1M1M 16M16Mbits: 1Kbits: 1K 4K4K 16K16K 64K64K 256K256K 4M4M 64M64M 256M256M
1 chip memory size1 chip memory size( 2 MB to 32 MB)( 2 MB to 32 MB)
* Credit: Gordon Bell
Region/Region/IntranetIntranet
CampusCampusHome…Home… buildingsbuildings
BodyBody
WorldWorld
ContinentContinent
Everything cyberizable will Everything cyberizable will be in Cyberspace and be in Cyberspace and
covered by a hierarchy of covered by a hierarchy of computers!computers!
Cars… Cars… phys. nets phys. nets
* Credit: Gordon Bell
Distributed Programming Tools
•C/C++ with TCP/IP
•Perl with TCP/IP
•Java
•Corba
•ASP
•.Net
Parallel Programming Tools
PVM
MPI
Synergy
Others (proprietary hardware)
Semester Outline
Parallel programming
Architecture and performance evaluation
Distributed programming
Architecture and performance evaluation
Project selection
Project implementation
Presentation
Parallel Programming Difficulties
Program partition and allocation
Data partition and allocation
Program(process) synchronization
Data access mutual exclusion
Dependencies
Process(or) failures
Scalability…
Meeting the ChallengeUse the Stateless Parallel Processing principle. (U.S. Patent: #5,517,656, May 1996).
Advantages: High performance – automatic formation of SIMD, MIMD
and MISD clusters at runtime. Runtime add/subtract processors allows for ultimate
scalability. It is the ONLY multiprocessor architecture designed with
fault tolerance in mind. Ease of programming – no mutual exclusion problems,
automatic tools possible.
Stateless Parallel Processing
A stateless program is any program whose execution does not hard-wire and does not incur side-effects on ANY global information.
Non-stateless program example: All PVM/MPI programs. Since they create processes with IDs(global information).
Why Stateless Programs?
A stateless program can execute on any processor. This allows dynamic formation of SIMD, MIMD and MISD clusters at runtime.
Only stateless programs can promise the ultimate scalability (adding a processor on the fly) and fault tolerance (loosing a processor on the fly).
Stateless Parallel Processor
High Speed Switch
Processor Processor
Processor Processor
Processor Processor
Unidirectional Ring
Operations of A Stateless Parallel Processor
The shared disk stores ALL stateless programs.The unidirectional ring flows control tuples of two types: read and exclusive read. Read tuples drops off the ring after on rotation. Exclusive-read tuples drops of the ring after being consumed.Each processor can execute ANY stateless program from the shared disk.Control tuples carry data locations to allow direct data access via high speed switch.
How does a stateless system start?
An initialization program sends initial ER tuple(s) onto the ring.
It fires up all dependent programs on multiple processors (MIMD).
Newly generated tuples fire up more programs.
A SIMD cluster forms when a stateless program can accept multiple tuple values (MD).
MISD (pipeline) forms when multiple processors form a chain of dependency with sufficient data supply.
How do you get your hands on a SPP?
Synergy. Synergy is a close-approximation of SPP. It uses a tuple space to replace the unidirectional ring (same function, but slower). Multiple tuple spaces are used to simulate the high speed switch.
Note: The absence of the high speed switch costs great deal on performance.
Next: Parallel Program Performance Analysis
Next week no lecture.Home Work1 (Due 2/4/02, submit .doc file to [email protected] with subject: 669 HW1)Reading: Textbook chapters 1-4.Problems:
1. What is the most likely performance bottleneck of an SPP machine? Explain.
2. Why the unidirectional ring? Explain.3. Is it possible to build an SPP system using cluster of
PCs? How? What would you propose to make Synergy a true SPP system? Justify.
4. Compare SMP (symmetric multiprocessor) with SPP. Explain pros and cons. Are they compatible?
5. Compare SPP with Massively Parallel Processors. Explain pros and cons. Restrict discussion at architecture level.
6. Design a stateless matrix multiplication system. How many programs do you need? Explain.How many forms of parallelisms can you find?