Transcript
Page 1: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#1 lec # 1 Spring 2003 3-11-2003

Parallel Computer ArchitectureParallel Computer Architecture• A parallel computer is a collection of processing elements that

cooperate to solve large computational problems fast• Broad issues involved:

– The concurrency and communication characteristics of parallel algorithms for a given computational problem

– Computing Resources and Computation Allocation:• The number of processing elements (PEs), computing power of each element and

amount of physical memory used.

• What portions of the computation and data are allocated to each PE.

– Data access, Communication and Synchronization• How the elements cooperate and communicate.

• How data is transmitted between processors.

• Abstractions and primitives for cooperation.

– Performance and Scalability• Maximize performance enhancement of parallelism: Speedup.

– By minimizing parallelization overheads

• Scalabilty of performance to larger systems/problems.

Page 2: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#2 lec # 1 Spring 2003 3-11-2003

The Need And Feasibility of The Need And Feasibility of Parallel ComputingParallel Computing• Application demands: More computing cycles needed:

– Scientific computing: CFD, Biology, Chemistry, Physics, ...– General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing,

Gaming…– Mainstream multithreaded programs, are similar to parallel programs

• Technology Trends– Number of transistors on chip growing rapidly. Clock rates expected to go up but only slowly.

• Architecture Trends– Instruction-level parallelism is valuable but limited.– Coarser-level parallelism, as in multiprocessor systems is the most viable approach to further

improve performance.

• Economics:– The increased utilization of commodity of-the-shelf (COTS) components in high performance

parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost.

• Today’s microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom PEs

• Commercial System Area Networks (SANs) offer an alternative to custom more costly networks

Page 3: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#3 lec # 1 Spring 2003 3-11-2003

Scientific Computing DemandsScientific Computing Demands

(Memory Requirement)

Page 4: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#4 lec # 1 Spring 2003 3-11-2003

Scientific Supercomputing TrendsScientific Supercomputing Trends• Proving ground and driver for innovative architecture and

advanced computing techniques:

– Market is much smaller relative to commercial segment – Dominated by vector machines starting in the 70s through the

80s– Meanwhile, microprocessors have made huge gains in floating-

point performance• High clock rates.

• Pipelined floating point units.

• Instruction-level parallelism.

• Effective use of caches.

• Large-scale multiprocessors and computer clusters are replacing vector supercomputers

Page 5: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#5 lec # 1 Spring 2003 3-11-2003

CPU Performance TrendsCPU Performance TrendsP

erfo

rman

ce

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

The microprocessor is currently the most naturalbuilding block for multiprocessor systems interms of cost and performance.

Page 6: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#6 lec # 1 Spring 2003 3-11-2003

General Technology TrendsGeneral Technology Trends• Microprocessor performance increases 50% - 100% per year• Transistor count doubles every 3 years• DRAM size quadruples every 3 years

0

20

40

60

80

100

120

140

160

180

1987 1988 1989 1990 1991 1992

Integer FP

Sun 4

260

MIPS

M/120

IBM

RS6000

540MIPS

M2000

HP 9000

750

DEC

alpha

Page 7: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#7 lec # 1 Spring 2003 3-11-2003

Clock Frequency Growth RateClock Frequency Growth Rate

0.1

1

10

100

1,000

19701975

19801985

19901995

20002005

Clo

ck r

ate

(MH

z)

i4004i8008

i8080

i8086 i80286i80386

Pentium100

R10000

• Currently increasing 30% per year

Page 8: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#8 lec # 1 Spring 2003 3-11-2003

Transistor Count Growth RateTransistor Count Growth Rate

Tran

sist

ors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

19701975

19801985

19901995

20002005

i4004i8008

i8080

i8086

i80286i80386

R2000

Pentium R10000

R3000

• One billion transistors on chip by early 2004• Transistor count grows much faster than clock rate

- Currently 40% per year

Page 9: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#9 lec # 1 Spring 2003 3-11-2003

Tran

sist

ors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

Parallelism in Microprocessor VLSI GenerationsParallelism in Microprocessor VLSI Generations

SMT:e.g. Intel’s Hyper-threading

Page 10: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#10 lec # 1 Spring 2003 3-11-2003

Uniprocessor Attributes to PerformanceUniprocessor Attributes to Performance• Performance benchmarking is program-mix dependent.• Ideal performance requires a perfect machine/program match.• Performance measures:

– Cycles per instruction (CPI)

– Total CPU time = T = C x = C / f = Ic x CPI x

= Ic x (p+ m x k) x

Ic = Instruction count = CPU cycle time

p = Instruction decode cycles

m = Memory cycles k = Ratio between memory/processor cycles

C = Total program clock cycles f = clock rate

– MIPS Rate = Ic / (T x 106) = f / (CPI x 106) = f x Ic /(C x 106)

– Throughput Rate: Wp = f /(Ic x CPI) = (MIPS) x 106 /Ic

• Performance factors: (Ic, p, m, k, ) are influenced by: instruction-set architecture, compiler design, CPU implementation and control, cache and memory hierarchy and program instruction mix and instruction dependencies.

Page 11: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#11 lec # 1 Spring 2003 3-11-2003

Raw Uniprocessor Performance: Raw Uniprocessor Performance: LINPACKLINPACK

LIN

PA

CK

(M

FL

OP

S)

1

10

100

1,000

10,000

1975 1980 1985 1990 1995 2000

CRAY n = 100 CRAY n = 1,000

Micro n = 100 Micro n = 1,000

CRAY 1s

Xmp/14se

Xmp/416Ymp

C90

T94

DEC 8200

IBM Power2/990MIPS R4400

HP9000/735DEC Alpha

DEC Alpha AXPHP 9000/750

IBM RS6000/540

MIPS M/2000

MIPS M/120

Sun 4/260

Vector Processors

Microprocessors

Page 12: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#12 lec # 1 Spring 2003 3-11-2003

Raw Parallel Performance: Raw Parallel Performance: LINPACKLINPACK

LIN

PA

CK

(G

FLO

PS

)

CRAY peak MPP peak

Xmp /416(4)

Ymp/832(8) nCUBE/2(1024)iPSC/860

CM-2CM-200

Delta

Paragon XP/S

C90(16)

CM-5

ASCI Red

T932(32)

T3D

Paragon XP/S MP(1024)

Paragon XP/S MP(6768)

0.1

1

10

100

1,000

10,000

1985 1987 1989 1991 1993 1995 1996

Page 13: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#13 lec # 1 Spring 2003 3-11-2003

LINPAK Performance TrendsLI

NPAC

K (M

FLOP

S)

1

10

100

1,000

10,000

1975 1980 1985 1990 1995 2000

CRAY n = 100 CRAY n = 1,000

Micro n = 100 Micro n = 1,000

CRAY 1s

Xmp/14se

Xmp/416Ymp

C90

T94

DEC 8200

IBM Power2/990MIPS R4400

HP9000/735DEC Alpha

DEC Alpha AXPHP 9000/750

IBM RS6000/540

MIPS M/2000

MIPS M/120

Sun 4/260

LINP

ACK

(GFL

OPS)

CRAY peak MPP peak

Xmp /416(4)

Ymp/832(8) nCUBE/2(1024)iPSC/860

CM-2CM-200

Delta

Paragon XP/S

C90(16)

CM-5

ASCI Red

T932(32)

T3D

Paragon XP/S MP(1024)

Paragon XP/S MP(6768)

0.1

1

10

100

1,000

10,000

1985 1987 1989 1991 1993 1995 1996

Uniprocessor PerformanceUniprocessor Performance Parallel System PerformanceParallel System Performance

Page 14: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#14 lec # 1 Spring 2003 3-11-2003

Computer System Peak FLOP Rating Computer System Peak FLOP Rating History/Near FutureHistory/Near Future

Petaflop

Teraflop

Page 15: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#15 lec # 1 Spring 2003 3-11-2003

The Goal of Parallel ProcessingThe Goal of Parallel Processing• Goal of applications in using parallel machines:

Maximize Speedup over single processor performance

Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =

• Ideal speedup = number of processors = p Very hard to achieve

Performance (p processors)

Performance (1 processor)

Time (1 processor)

Time (p processors)

Page 16: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#16 lec # 1 Spring 2003 3-11-2003

The Goal of Parallel ProcessingThe Goal of Parallel Processing• Parallel processing goal is to maximize parallel speedup:

• Ideal Speedup = p number of processors – Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors.

• Maximize parallel speedup by:– Balancing computations on processors (every processor does the same amount of work). – Minimizing communication cost and other overheads associated with each step of parallel program creation and execution.

• Performance Scalability:

Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are increased.

Sequential Work on one processor

Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup = <

Time(1)

Time(p)Parallelization overheads

Page 17: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#17 lec # 1 Spring 2003 3-11-2003

Elements of Parallel ComputingElements of Parallel Computing

HardwareHardwareArchitectureArchitecture

Operating SystemOperating System

Applications SoftwareApplications Software

ComputingComputing ProblemsProblems

AlgorithmsAlgorithmsand Dataand DataStructuresStructures

High-levelHigh-levelLanguagesLanguages

Performance Performance EvaluationEvaluation

MappingMapping

ProgrammingProgramming

BindingBinding(Compile, (Compile, Load)Load)

Page 18: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#18 lec # 1 Spring 2003 3-11-2003

Elements of Parallel ComputingElements of Parallel Computing1 Computing Problems:

– Numerical Computing: Science and technology numerical problems demand intensive integer and floating point computations.

– Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches.

2 Algorithms and Data Structures– Special algorithms and data structures are needed to specify the

computations and communication present in computing problems.

– Most numerical algorithms are deterministic using regular data structures.

– Symbolic processing may use heuristics or non-deterministic searches.

– Parallel algorithm development requires interdisciplinary interaction.

Page 19: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#19 lec # 1 Spring 2003 3-11-2003

Elements of Parallel ComputingElements of Parallel Computing3 Hardware Resources

– Processors, memory, and peripheral devices form the hardware core of a computer system.

– Processor instruction set, processor connectivity, memory organization, influence the system architecture.

4 Operating Systems– Manages the allocation of resources to running processes.

– Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication.

– Parallelism exploitation at: algorithm design, program writing, compilation, and run time.

Page 20: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#20 lec # 1 Spring 2003 3-11-2003

Elements of Parallel ComputingElements of Parallel Computing5 System Software Support

– Needed for the development of efficient programs in high-level languages (HLLs.)

– Assemblers, loaders.– Portable parallel programming languages– User interfaces and tools.

6 Compiler Support– Preprocessor compiler: Sequential compiler and low-level

library of the target parallel computer.– Precompiler: Some program flow analysis, dependence

checking, limited optimizations for parallelism detection.– Parallelizing compiler: Can automatically detect parallelism in

source code and transform sequential code into parallel constructs.

Page 21: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#21 lec # 1 Spring 2003 3-11-2003

Approaches to Parallel ProgrammingApproaches to Parallel Programming

Source code written inSource code written inconcurrent dialects of C, C++concurrent dialects of C, C++ FORTRAN, LISPFORTRAN, LISP ..

ProgrammerProgrammer

ConcurrencyConcurrencypreserving compilerpreserving compiler

ConcurrentConcurrentobject codeobject code

Execution byExecution byruntime systemruntime system

Source code written inSource code written insequential languages C, C++sequential languages C, C++ FORTRAN, LISPFORTRAN, LISP ..

ProgrammerProgrammer

ParallelizingParallelizing compilercompiler

ParallelParallelobject codeobject code

Execution byExecution byruntime systemruntime system

(a) Implicit (a) Implicit ParallelismParallelism

(b) Explicit(b) Explicit ParallelismParallelism

Page 22: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#22 lec # 1 Spring 2003 3-11-2003

Factors Affecting Parallel System Performance• Parallel Algorithm Related:

– Available concurrency and profile, grain, uniformity, patterns.– Required communication/synchronization, uniformity and patterns.– Data size requirements.– Communication to computation ratio.

• Parallel program Related:– Programming model used.– Resulting data/code memory requirements, locality and working set

characteristics.– Parallel task grain size.– Assignment: Dynamic or static.– Cost of communication/synchronization.

• Hardware/Architecture related:– Total CPU computational power available.– Types of computation modes supported.– Shared address space Vs. message passing.– Communication network characteristics (topology, bandwidth, latency)– Memory hierarchy properties.

Page 23: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban

Evolution of Computer Evolution of Computer ArchitectureArchitecture

Scalar

Sequential Lookahead

I/E Overlap FunctionalParallelism

MultipleFunc. Units Pipeline

Implicit Vector

Explicit Vector

MIMDSIMD

MultiprocessorMulticomputer

Register-to -Register

Memory-to -Memory

Processor Array

Associative Processor

Massively Parallel Processors (MPPs)

I/E: Instruction Fetch and Execute

SIMD: Single Instruction stream over Multiple Data streams

MIMD: Multiple Instruction streams over Multiple Data streams

Computer Clusters

Page 24: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#24 lec # 1 Spring 2003 3-11-2003

Parallel Architectures HistoryParallel Architectures History

Application Software

System Software SIMD

Message Passing

Shared MemoryDataflow

SystolicArrays Architecture

Historically, parallel architectures tied to programming models

• Divergent architectures, with no predictable pattern of growth.

Page 25: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#25 lec # 1 Spring 2003 3-11-2003

Parallel Programming ModelsParallel Programming Models• Programming methodology used in coding applications• Specifies communication and synchronization

• Examples:

– Multiprogramming: No communication or synchronization at program level. A number of

independent programs.

– Shared memory address space: Parallel program threads or tasks communicate using a shared memory

address space

– Message passing: Explicit point to point communication is used between parallel program

tasks.

– Data parallel: More regimented, global actions on data

– Can be implemented with shared address space or message passing

Page 26: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#26 lec # 1 Spring 2003 3-11-2003

Flynn’s 1972 Classification of Flynn’s 1972 Classification of Computer ArchitectureComputer Architecture

• Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines.

• Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements.

• Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution.

• Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers:

• Shared memory multiprocessors.

• Multicomputers: Unshared distributed memory, message-passing used instead.

Page 27: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#27 lec # 1 Spring 2003 3-11-2003

Flynn’s Classification of Computer Architecture

Fig. 1.3 page 12 in

Advanced Computer Architecture: Parallelism, Scalability, Programmability, Hwang, 1993.

Page 28: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#28 lec # 1 Spring 2003 3-11-2003

Current Trends In Parallel ArchitecturesCurrent Trends In Parallel Architectures

• The extension of “computer architecture” to support communication and cooperation:

– OLD: Instruction Set Architecture

– NEW: Communication Architecture

• Defines: – Critical abstractions, boundaries, and primitives

(interfaces)

– Organizational structures that implement interfaces (hardware or software)

• Compilers, libraries and OS are important bridges today

Page 29: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#29 lec # 1 Spring 2003 3-11-2003

Modern Parallel ArchitectureModern Parallel ArchitectureLayered FrameworkLayered Framework

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Database Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compilationor library

Operating systems support

Communication hardware

Physical communication medium

Hardware/software boundary

Page 30: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#30 lec # 1 Spring 2003 3-11-2003

Shared Address Space Parallel Shared Address Space Parallel ArchitecturesArchitectures

• Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores

• Convenient: – Location transparency

– Similar programming model to time-sharing in uniprocessors• Except processes run on different processors

• Good throughput on multiprogrammed workloads

• Naturally provided on a wide range of platforms– Wide range of scale: few to hundreds of processors

• Popularly known as shared memory machines or model– Ambiguous: Memory may be physically distributed among

processors

Page 31: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#31 lec # 1 Spring 2003 3-11-2003

Shared Address Space (SAS) Parallel Programming Model• Process: virtual address space plus one or more threads of control

• Portions of address spaces of processes are shared

• Writes to shared address visible to other threads (in other processes too)• Natural extension of the uniprocessor model:

• Conventional memory operations used for communication• Special atomic operations needed for synchronization• OS uses shared memory to coordinate processes

St or e

P1

P2

Pn

P0

Load

P0 pr i vat e

P1 pr i vat e

P2 pr i vat e

Pn pr i vat e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Shared portionof address space

Private portionof address space

Common physicaladdresses

Page 32: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#32 lec # 1 Spring 2003 3-11-2003

Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors• The Uniform Memory Access (UMA) Model:

– The physical memory is shared by all processors.– All processors have equal access to all memory addresses.– Also referred to as Symmetric Memory Processors (SMPs).

• Distributed memory or Nonuniform Memory Access (NUMA) Model:

– Shared memory is physically distributed locally among processors. Access to remote memory is higher.

• The Cache-Only Memory Architecture (COMA) Model:

– A special case of a NUMA machine where all distributed main memory is converted to caches.

– No memory hierarchy at each processor.

Page 33: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#33 lec # 1 Spring 2003 3-11-2003

Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Processor Processor

Interconnect

I/Odevices

M M M

Network

P

$

P

$

P

$

Network

D

P

C

D

P

C

D

P

C

Distributed memory or Nonuniform Memory Access (NUMA) Model

Uniform Memory Access (UMA) Model

or Symmetric Memory Processors (SMPs). Interconnect: Bus, Crossbar, Multistage networkP: ProcessorM: MemoryC: CacheD: Cache directory

Cache-Only Memory Architecture (COMA)

Page 34: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#34 lec # 1 Spring 2003 3-11-2003

Uniform Memory Access Example: Uniform Memory Access Example: Intel Pentium Pro QuadIntel Pentium Pro Quad

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

• All coherence and multiprocessing glue in processor module

• Highly integrated, targeted at high volume

• Low latency and bandwidth

Page 35: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#35 lec # 1 Spring 2003 3-11-2003

Uniform Memory Access Example:Uniform Memory Access Example: SUN EnterpriseSUN Enterprise

– 16 cards of either type: processors + memory, or I/O

– All memory accessed over bus, so symmetric

– Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

Page 36: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#36 lec # 1 Spring 2003 3-11-2003

Distributed Shared-Memory Distributed Shared-Memory Multiprocessor System Example: Multiprocessor System Example:

Cray T3ECray T3E

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

• Scale up to 1024 processors, 480MB/s links

• Memory controller generates communication requests for nonlocal references

• No hardware mechanism for coherence (SGI Origin etc. provide this)

Page 37: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#37 lec # 1 Spring 2003 3-11-2003

Message-Passing MulticomputersMessage-Passing Multicomputers• Comprised of multiple autonomous computers (nodes) connected

via a suitable network.

• Each node consists of one or more processors, local memory, attached storage and I/O peripherals.

• Local memory is only accessible by local processors in a node.

• Inter-node communication is carried out by message passing through the connection network

• Process communication achieved using a message-passing programming environment.– Programming model more removed from basic hardware operations

• Include:– A number of commercial Massively Parallel Processor systems (MPPs).

– Computer clusters that utilize commodity of-the-shelf (COTS) components.

Page 38: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#38 lec # 1 Spring 2003 3-11-2003

Message-Passing AbstractionMessage-Passing Abstraction

• Send specifies buffer to be transmitted and receiving process

• Receive specifies sending process and application storage to receive into

• Memory to memory copy possible, but need to name processes

• Optional tag on send and matching rule on receive

• User process names local data and entities in process/tag space too

• In simplest form, the send/receive match achieves pairwise synch event

• Many overheads: copying, buffer management, protection

Process P Process Q

Address Y

Address X

Send X, Q, t

Receive Y, P, tMatch

Local pr ocessaddress space

Local pr ocessaddress space

Page 39: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#39 lec # 1 Spring 2003 3-11-2003

Message-Passing Example: IBM SP-2Message-Passing Example: IBM SP-2

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC• Made out of essentially

complete RS6000 workstations

• Network interface integrated in I/O bus (bandwidth limited by I/O bus)

Page 40: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#40 lec # 1 Spring 2003 3-11-2003

Message-Passing Example: Message-Passing Example: Intel ParagonIntel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

Page 41: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#41 lec # 1 Spring 2003 3-11-2003

Message-Passing Programming ToolsMessage-Passing Programming Tools• Message-passing programming environments include:

– Message Passing Interface (MPI):• Provides a standard for writing concurrent message-passing

programs.• MPI implementations include parallel libraries used by existing

programming languages.

– Parallel Virtual Machine (PVM):• Enables a collection of heterogeneous computers to used as a

coherent and flexible concurrent computational resource.• PVM support software executes on each machine in a user-

configurable pool, and provides a computational environment of concurrent applications.

• User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines.

Page 42: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#42 lec # 1 Spring 2003 3-11-2003

Data Parallel Systems Data Parallel Systems SIMD in Flynn taxonomySIMD in Flynn taxonomy• Programming model

– Operations performed in parallel on each element of data structure

– Logically single thread of control, performs sequential or parallel steps

– Conceptually, a processor is associated with each data element

• Architectural model– Array of many simple, cheap processors each with

little memory• Processors don’t sequence through

instructions– Attached to a control processor that issues

instructions– Specialized and general communication, cheap

global synchronization

• Example machines: – Thinking Machines CM-1, CM-2 (and CM-5)

– Maspar MP-1 and MP-2,

PE PE PE

PE PE PE

PE PE PE

Controlprocessor

Page 43: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#43 lec # 1 Spring 2003 3-11-2003

Dataflow ArchitecturesDataflow Architectures• Represent computation as a graph of essential dependences

– Logical processor at each node, activated by availability of operands

– Message (tokens) carrying tag of next instruction sent to next processor

– Tag compared with others in matching store; match fires execution

1 b

a

+

c e

d

f

Dataflow graph

f = a d

Network

Tokenstore

WaitingMatching

Instructionfetch

Execute

Token queue

Formtoken

Network

Network

Programstore

a = (b +1) (b c)d = c e

Research Dataflow machineprototypes include:• The MIT Tagged Architecture• The Manchester Dataflow Machine

Page 44: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#44 lec # 1 Spring 2003 3-11-2003

Systolic ArchitecturesSystolic Architectures

M

PE

M

PE PE PE

• Replace single processor with an array of regular processing elements

• Orchestrate data flow for high throughput with less memory access

• Different from pipelining– Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory

• Different from SIMD: each PE may do something different• Initial motivation: VLSI enables inexpensive special-purpose chips• Represent algorithms directly by chips connected in regular pattern

Page 45: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#45 lec # 1 Spring 2003 3-11-2003

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

b2,2 b2,1 b1,2b2,0 b1,1 b0,2b1,0 b0,1b0,0

a0,2 a0,1 a0,0

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

Rows of A

Columns of B

T = 0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Page 46: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#46 lec # 1 Spring 2003 3-11-2003

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

b2,2 b2,1 b1,2b2,0 b1,1 b0,2b1,0 b0,1

a0,2 a0,1

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

T = 1

b0,0

a0,0a0,0*b0,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Page 47: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#47 lec # 1 Spring 2003 3-11-2003

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

b2,2 b2,1 b1,2b2,0 b1,1 b0,2

a0,2

a1,2 a1,1

a2,2 a2,1 a2,0

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

T = 2

b1,0

a0,1 a0,0*b0,0+ a0,1*b1,0

a1,0

a0,0

b0,1

b0,0

a0,0*b0,1

a1,0*b0,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Page 48: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#48 lec # 1 Spring 2003 3-11-2003

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

b2,2 b2,1 b1,2

a1,2

a2,2 a2,1

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

T = 3

b2,0

a0,2 a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a1,1

a0,1

b1,1

b1,0

a0,0*b0,1+ a0,1*b1,1

a1,0*b0,0+ a1,1*b1,0 a1,0

b0,1

a0,0

b0,0

b0,2

a2,0

a1,0*b0,1

a0,0*b0,2

a2,0*b0,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Page 49: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#49 lec # 1 Spring 2003 3-11-2003

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

b2,2

a2,2

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

T = 4

a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a1,2

a0,2

b2,1

b2,0

a0,0*b0,1+ a0,1*b1,1+ a0,2*b2,1

a1,0*b0,0+ a1,1*b1,0+ a1,2*a2,0

a1,1

b1,1

a0,1

b1,0

b1,2

a2,1

a1,0*b0,1+a1,1*b1,1

a0,0*b0,2+ a0,1*b1,2

a2,0*b0,0+ a2,1*b1,0

b0,1

a1,0

b0,2

a2,0 a2,0*b0,1

a1,0*b0,2

a2,2

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Page 50: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#50 lec # 1 Spring 2003 3-11-2003

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

T = 5

a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a0,0*b0,1+ a0,1*b1,1+ a0,2*b2,1

a1,0*b0,0+ a1,1*b1,0+ a1,2*a2,0

a1,2

b2,1

a0,2

b2,0

b2,2

a2,2

a1,0*b0,1+a1,1*b1,1+ a1,2*b2,1

a0,0*b0,2+ a0,1*b1,2+ a0,2*b2,2

a2,0*b0,0+ a2,1*b1,0+ a2,2*b2,0

b1,1

a1,1

b1,2

a2,1 a2,0*b0,1+ a2,1*b1,1

a1,0*b0,2+ a1,1*b1,2

b0,2

a2,0 a2,0*b0,2

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Page 51: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#51 lec # 1 Spring 2003 3-11-2003

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

T = 6

a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a0,0*b0,1+ a0,1*b1,1+ a0,2*b2,1

a1,0*b0,0+ a1,1*b1,0+ a1,2*a2,0

a1,0*b0,1+a1,1*b1,1+ a1,2*b2,1

a0,0*b0,2+ a0,1*b1,2+ a0,2*b2,2

a2,0*b0,0+ a2,1*b1,0+ a2,2*b2,0

b2,1

a1,2

b2,2

a2,2 a2,0*b0,1+ a2,1*b1,1+ a2,2*b2,1

a1,0*b0,2+ a1,1*b1,2+ a1,2*b2,2

b1,2

a2,1 a2,0*b0,2+ a2,1*b1,2

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Page 52: EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate

EECC756 - ShaabanEECC756 - Shaaban#52 lec # 1 Spring 2003 3-11-2003

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

T = 7

a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a0,0*b0,1+ a0,1*b1,1+ a0,2*b2,1

a1,0*b0,0+ a1,1*b1,0+ a1,2*a2,0

a1,0*b0,1+a1,1*b1,1+ a1,2*b2,1

a0,0*b0,2+ a0,1*b1,2+ a0,2*b2,2

a2,0*b0,0+ a2,1*b1,0+ a2,2*b2,0

a2,0*b0,1+ a2,1*b1,1+ a2,2*b2,1

a1,0*b0,2+ a1,1*b1,2+ a1,2*b2,2

b2,2

a2,2 a2,0*b0,2+ a2,1*b1,2+ a2,2*b2,2

Done

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/


Top Related