cis669 distributed and parallel processing spring 2002 professor yuan shi

CIS669Distributed and Parallel

Processing

Spring 2002

Professor Yuan Shi

Distributed Processing

Transaction-oriented

Geographically dispersed locations

I/O intense

Database-central

Parallel Processing

Non-transactional, single goal computing

Computing intense and/or data-intense

May or may not involve databases

Is There a Real Difference?

Not in terms of functionality and resource use intensity.For transactional systems, there are OLAP (Online Analysis Processing) and data mining tools that is computing intense and single goal-oriented.For parallel processing, many scientific/engineering applications need to interact with databases to make more accurate calculations.

Parallelism and Programming DifficultiesFor distributed processing, parallelism is given and usually cannot easily change. Programming is relatively easy.

For parallel processing, the programmer defines parallelism by partitioning the serial program(s). Parallel programming in general is more difficult than transaction applications.

This picture is changing…

Industrial-strength distributed applications are evolving into more parallel-like.

Lab-based parallel applications are blending into industrial strength applications by incorporating transactions.

Why Clusters (the textbook)?

We have tried all others: vector, dataflow, NUMA, hypercube, 3D-torus, etc.

Parallel programming does not get easier with any configuration.

Cluster promises the most potential for cost/performance. Check this out:

Types of Parallelism(Flynn,1972*)

1.SIMD (Single Instruction Multiple Data)

2.MIMD (Multiple Instruction Multiple Data)

3.MISD (Pipeline) *Flynn, M., Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 94, 1972.

** Other taxonomies to categorize parallel machines. (see http://csep1.phy.ornl.gov/ca/node11.html)

SIMDI

I

I

I

D1

D2

D3

D4

Tseq=4

Tpar=1

Sp = Tseq/Tpar=4=P

MIMDI1

I2

I3

I4

D1

D2

D3

D4

Tseq=4

Tpar=1

Sp = Tseq/Tpar=4=P

Pipeline(MISD)

I1 I2 I3 I4D1D2D3D4

Tseq=4x4=16

Tpar=4+3=7

Sp = Tseq/Tpar~=2.3

Machines that can work in parallel

Cray: X-MP, Y-MP, T3D.

TMC: CM1-CM5.

Kendal Square Research: KSR-1

SGI: Power Challenge,Origin

IBM: 3090, SP2…

PCs

History

Single CPU: Smaller size -> faster speed (Cray, remember Moore’s Law?)

Muti-CPU: Share memory or not share memory?

The war between Big-iron and Many irons: Cray against TMC.

Result: All lost. Cluster won by survival.

State of ArtSymmetric Multiprocessing is still the only practical industrial application. Vendors include HP, Sun, SGI, IBM, Compaq/Tandem, Status.Special purpose, small scale multiprocessors: CISCO routers, SSL processors, MPEG decoders, etc.Special purpose massively parallel Processors are designed for special types of applications, such as human genome classification, nuclear accelerator simulation, fluid-dynamic simulations, etc.

Hardware Technology Advances*

Computing LawsComputing Laws

Transistor density doubles every 18 months 60% increase per year– Chip density transistors/die – Micro processor speeds

Exponential growth:– The past does not matter– 10x here, 10x there … means REAL change

PC costs decline faster than any other platform– Volume and learning curves– PCs are the building bricks of all future systems

Moore’s First Law

128KB128KB

128MB128MB

200020008KB8KB

1MB1MB

8MB8MB

1GB1GB

19701970 19801980 19901990

1M1M 16M16Mbits: 1Kbits: 1K 4K4K 16K16K 64K64K 256K256K 4M4M 64M64M 256M256M

1 chip memory size1 chip memory size( 2 MB to 32 MB)( 2 MB to 32 MB)

* Credit: Gordon Bell

Region/Region/IntranetIntranet

CampusCampusHome…Home… buildingsbuildings

BodyBody

WorldWorld

ContinentContinent

Everything cyberizable will Everything cyberizable will be in Cyberspace and be in Cyberspace and

covered by a hierarchy of covered by a hierarchy of computers!computers!

Cars… Cars… phys. nets phys. nets

* Credit: Gordon Bell

Distributed Programming Tools

•C/C++ with TCP/IP

•Perl with TCP/IP

•Java

•Corba

•ASP

•.Net

Parallel Programming Tools

PVM

MPI

Synergy

Others (proprietary hardware)

Semester Outline

Parallel programming

Architecture and performance evaluation

Distributed programming

Architecture and performance evaluation

Project selection

Project implementation

Presentation

Parallel Programming Difficulties

Program partition and allocation

Data partition and allocation

Program(process) synchronization

Data access mutual exclusion

Dependencies

Process(or) failures

Scalability…

Meeting the ChallengeUse the Stateless Parallel Processing principle. (U.S. Patent: #5,517,656, May 1996).

Advantages: High performance – automatic formation of SIMD, MIMD

and MISD clusters at runtime. Runtime add/subtract processors allows for ultimate

scalability. It is the ONLY multiprocessor architecture designed with

fault tolerance in mind. Ease of programming – no mutual exclusion problems,

automatic tools possible.

Stateless Parallel Processing

A stateless program is any program whose execution does not hard-wire and does not incur side-effects on ANY global information.

Non-stateless program example: All PVM/MPI programs. Since they create processes with IDs(global information).

Why Stateless Programs?

A stateless program can execute on any processor. This allows dynamic formation of SIMD, MIMD and MISD clusters at runtime.

Only stateless programs can promise the ultimate scalability (adding a processor on the fly) and fault tolerance (loosing a processor on the fly).

Stateless Parallel Processor

High Speed Switch

Processor Processor

Processor Processor

Processor Processor

Unidirectional Ring

Operations of A Stateless Parallel Processor

The shared disk stores ALL stateless programs.The unidirectional ring flows control tuples of two types: read and exclusive read. Read tuples drops off the ring after on rotation. Exclusive-read tuples drops of the ring after being consumed.Each processor can execute ANY stateless program from the shared disk.Control tuples carry data locations to allow direct data access via high speed switch.

How does a stateless system start?

An initialization program sends initial ER tuple(s) onto the ring.

It fires up all dependent programs on multiple processors (MIMD).

Newly generated tuples fire up more programs.

A SIMD cluster forms when a stateless program can accept multiple tuple values (MD).

MISD (pipeline) forms when multiple processors form a chain of dependency with sufficient data supply.

How do you get your hands on a SPP?

Synergy. Synergy is a close-approximation of SPP. It uses a tuple space to replace the unidirectional ring (same function, but slower). Multiple tuple spaces are used to simulate the high speed switch.

Note: The absence of the high speed switch costs great deal on performance.

Next: Parallel Program Performance Analysis

Next week no lecture.Home Work1 (Due 2/4/02, submit .doc file to [email protected] with subject: 669 HW1)Reading: Textbook chapters 1-4.Problems:

1. What is the most likely performance bottleneck of an SPP machine? Explain.

2. Why the unidirectional ring? Explain.3. Is it possible to build an SPP system using cluster of

PCs? How? What would you propose to make Synergy a true SPP system? Justify.

4. Compare SMP (symmetric multiprocessor) with SPP. Explain pros and cons. Are they compatible?

5. Compare SPP with Massively Parallel Processors. Explain pros and cons. Restrict discussion at architecture level.

6. Design a stateless matrix multiplication system. How many programs do you need? Explain.How many forms of parallelisms can you find?

cis669 distributed and parallel processing spring 2002 professor yuan shi

Documents

parallel applications

parallel programming

parallel processors

parallel machines

parallel processingspring

special types of applications

transaction applications

special purpose