parallel computer architecture - rochester institute of...
TRANSCRIPT
![Page 1: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/1.jpg)
EECC756 - ShaabanEECC756 - Shaaban#1 lec # 1 Spring 2002 3-12-2002
Parallel Computer ArchitectureParallel Computer Architecture• A parallel computer is a collection of processing elements
that cooperate to solve large problems fast
• Broad issues involved:– Resource Allocation:
• Number of processing elements (PEs).
• Computing power of each element.
• Amount of physical memory used.
– Data access, Communication and Synchronization• How the elements cooperate and communicate.
• How data is transmitted between processors.
• Abstractions and primitives for cooperation.
– Performance and Scalability• Performance enhancement of parallelism: Speedup.
• Scalabilty of performance to larger systems/problems.
![Page 2: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/2.jpg)
EECC756 - ShaabanEECC756 - Shaaban#2 lec # 1 Spring 2002 3-12-2002
The Need And Feasibility of The Need And Feasibility of Parallel ComputingParallel Computing• Application demands: More computing cycles:
– Scientific computing: CFD, Biology, Chemistry, Physics, ...– General-purpose computing: Video, Graphics, CAD, Databases,
Transaction Processing, Gaming…– Mainstream multithreaded programs, are similar to parallel programs
• Technology Trends– Number of transistors on chip growing rapidly– Clock rates expected to go up but only slowly
• Architecture Trends– Instruction-level parallelism is valuable but limited
– Coarser-level parallelism, as in MPs, the most viable approach
• Economics:– Today’s microprocessors have multiprocessor support eliminating the
need for designing expensive custom PEs– Lower parallel system cost.– Multiprocessor systems to offer a cost-effective replacement of
uniprocessor systems in mainstream computing.
![Page 3: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/3.jpg)
EECC756 - ShaabanEECC756 - Shaaban#3 lec # 1 Spring 2002 3-12-2002
Scientific Computing DemandScientific Computing Demand
![Page 4: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/4.jpg)
EECC756 - ShaabanEECC756 - Shaaban#4 lec # 1 Spring 2002 3-12-2002
Scientific Supercomputing TrendsScientific Supercomputing Trends• Proving ground and driver for innovative architecture
and advanced techniques:
– Market is much smaller relative to commercial segment
– Dominated by vector machines starting in 70s
– Meanwhile, microprocessors have made huge gains infloating-point performance
• High clock rates.
• Pipelined floating point units.
• Instruction-level parallelism.
• Effective use of caches.
• Large-scale multiprocessors replace vector supercomputers– Well under way already
![Page 5: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/5.jpg)
EECC756 - ShaabanEECC756 - Shaaban#5 lec # 1 Spring 2002 3-12-2002
Raw Uniprocessor Performance:Raw Uniprocessor Performance:LINPACKLINPACK
LIN
PA
CK
(M
FLO
PS
)
s
s
s
s
s
s
u
u
u
uu
u
uu
u
u u
1
10
100
1,000
10,000
1975 1980 1985 1990 1995 2000
sCRAYn = 100n CRAYn = 1,000
uMicro n = 100l Micro n = 1,000
CRAY 1s
Xmp/14se
Xmp/416Ymp
C90
T94
DEC 8200
IBM Power2/990MIPS R4400
HP9000/735DEC Alpha
DEC Alpha AXPHP 9000/750
IBM RS6000/540
MIPS M/2000
MIPS M/120
Sun 4/260
n
nn
n
n
n
l
l
l
l l
l ll
ll
l
![Page 6: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/6.jpg)
EECC756 - ShaabanEECC756 - Shaaban#6 lec # 1 Spring 2002 3-12-2002
Raw Parallel Performance:Raw Parallel Performance:LINPACKLINPACK
LIN
PA
CK
(GFL
OP
S)
nCRAY peaklMPP peak
Xmp /416(4)
Ymp/832(8) nCUBE/2(1024)iPSC/860
CM-2CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP (1024)
Paragon XP/S MP (6768)
n
n
n
n
l
l
nl
l
l
ll
l
ll
0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 1996
![Page 7: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/7.jpg)
EECC756 - ShaabanEECC756 - Shaaban#7 lec # 1 Spring 2002 3-12-2002
General Technology TrendsGeneral Technology Trends• Microprocessor performance increases 50% - 100% per year• Transistor count doubles every 3 years• DRAM size quadruples every 3 years
0
20
40
60
80
100
120
140
160
180
1987 1988 1989 1990 1991 1992
Integer FP
Sun 4
260
MIPS
M/120
IBM
RS6000
540MIPS
M2000
HP 9000
750
DEC
alpha
![Page 8: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/8.jpg)
EECC756 - ShaabanEECC756 - Shaaban#8 lec # 1 Spring 2002 3-12-2002
Clock Frequency Growth RateClock Frequency Growth Rate
uuuuuuu
uuu
uu
uuu
u
uuuu
uu
uu
u
u
uuu
uu
u
u
u
uu
uuu
u
uu
u
uuuu
u
uuu
uu
uu
uu
uuuu
uuu
uuu
uuuuu
u
uuu
uu
0.1
1
10
100
1,000
19701975
19801985
19901995
20002005
Clo
ck r
ate
(MH
z)
i4004i8008
i8080
i8086 i80286i80386
Pentium100
R10000
• Currently increasing 30% per year
![Page 9: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/9.jpg)
EECC756 - ShaabanEECC756 - Shaaban#9 lec # 1 Spring 2002 3-12-2002
Transistor Count Growth RateTransistor Count Growth Rate
Tran
sist
ors
uuuuuuuuu
u
uu
uuuuu
u
uu
u
uu
uuuu
uu
u
u
u u
uu
u
u
uuu
uuuuu
uuuuuuuuu
u
uuuu
uuu
uuuuu
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
19701975
19801985
19901995
20002005
i4004i8008
i8080
i8086
i80286i80386
R2000
Pentium R10000
R3000
• 100 million transistors on chip by early 2000’s A.D.• Transistor count grows much faster than clock rate
- Currently 40% per year
![Page 10: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/10.jpg)
EECC756 - ShaabanEECC756 - Shaaban#10 lec # 1 Spring 2002 3-12-2002
System Attributes to PerformanceSystem Attributes to Performance• Performance benchmarking is program-mix dependent.
• Ideal performance requires a perfect machine/program match.
• Performance measures:
– Cycles per instruction (CPI)
– Total CPU time = T = C x ττ = C / f = Ic x CPI x ττ = Ic x (p + m x k) x τ τ
Ic = Instruction count ττ = CPU cycle time
p = Instruction decode cycles
m = Memory cycles k = Ratio between memory/processor cycles
C = Total program clock cycles f = clock rate
– MIPS Rate = Ic / (T x 106) = f / (CPI x 106) = f x Ic /(C x 106)
– Throughput Rate: Wp = f /(Ic x CPI) = (MIPS) x 106 /Ic
• Performance factors: (Ic, p, m, k, ττ) are influenced by: instruction-setarchitecture, compiler design, CPU implementation and control, cacheand memory hierarchy.
![Page 11: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/11.jpg)
EECC756 - ShaabanEECC756 - Shaaban#11 lec # 1 Spring 2002 3-12-2002
CPU Performance TrendsCPU Performance TrendsP
erfo
rman
ce
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
The microprocessor is currently the most naturalbuilding block for multiprocessor systems interms of cost and performance.
![Page 12: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/12.jpg)
EECC756 - ShaabanEECC756 - Shaaban#12 lec # 1 Spring 2002 3-12-2002
Tran
sist
ors
uuuuuuu
uu
u
uu
u
u
u uu
u
u
u
u
uu
uuu u
uu
u
u
u u
u
u
u
u
u
uu
u u
uuu
uuu u
uu uu u
u
uuu u
uuu
u
u uuu
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Parallelism in Microprocessor VLSI GenerationsParallelism in Microprocessor VLSI Generations
![Page 13: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/13.jpg)
EECC756 - ShaabanEECC756 - Shaaban#13 lec # 1 Spring 2002 3-12-2002
The Goal of Parallel ComputingThe Goal of Parallel Computing• Goal of applications in using parallel machines: Speedup
Speedup (p processors) =
• For a fixed problem size (input data set),performance = 1/time
Speedup fixed problem (p processors) =
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)
![Page 14: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/14.jpg)
EECC756 - ShaabanEECC756 - Shaaban#14 lec # 1 Spring 2002 3-12-2002
Elements of Modern ComputersElements of Modern Computers
HardwareHardwareArchitectureArchitecture
Operating SystemOperating System
Applications SoftwareApplications Software
ComputingComputing Problems Problems
AlgorithmsAlgorithmsand Dataand DataStructuresStructures
High-levelHigh-levelLanguagesLanguages
Performance Performance Evaluation Evaluation
MappingMapping
ProgrammingProgramming
BindingBinding(Compile, (Compile, Load)Load)
![Page 15: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/15.jpg)
EECC756 - ShaabanEECC756 - Shaaban#15 lec # 1 Spring 2002 3-12-2002
Elements of Modern ComputersElements of Modern Computers1 Computing Problems:
– Numerical Computing: Science and technology numericalproblems demand intensive integer and floating pointcomputations.
– Logical Reasoning: Artificial intelligence (AI) demand logicinferences and symbolic manipulations and large space searches.
2 Algorithms and Data Structures– Special algorithms and data structures are needed to specify the
computations and communication present in computingproblems.
– Most numerical algorithms are deterministic using regular datastructures.
– Symbolic processing may use heuristics or non-deterministicsearches.
– Parallel algorithm development requires interdisciplinaryinteraction.
![Page 16: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/16.jpg)
EECC756 - ShaabanEECC756 - Shaaban#16 lec # 1 Spring 2002 3-12-2002
Elements of Modern ComputersElements of Modern Computers3 Hardware Resources
– Processors, memory, and peripheral devices form thehardware core of a computer system.
– Processor instruction set, processor connectivity, memoryorganization, influence the system architecture.
4 Operating Systems– Manages the allocation of resources to running processes.
– Mapping to match algorithmic structures with hardwarearchitecture and vice versa: processor scheduling, memorymapping, interprocessor communication.
– Parallelism exploitation at: algorithm design, programwriting, compilation, and run time.
![Page 17: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/17.jpg)
EECC756 - ShaabanEECC756 - Shaaban#17 lec # 1 Spring 2002 3-12-2002
Elements of Modern ComputersElements of Modern Computers5 System Software Support
– Needed for the development of efficient programs in high-level languages (HLLs.)
– Assemblers, loaders.
– Portable parallel programming languages
– User interfaces and tools.
6 Compiler Support– Preprocessor compiler: Sequential compiler and low-level
library of the target parallel computer.
– Precompiler: Some program flow analysis, dependencechecking, limited optimizations for parallelism detection.
– Parallelizing compiler: Can automatically detectparallelism in source code and transform sequential codeinto parallel constructs.
![Page 18: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/18.jpg)
EECC756 - ShaabanEECC756 - Shaaban#18 lec # 1 Spring 2002 3-12-2002
Approaches to Parallel ProgrammingApproaches to Parallel Programming
Source code written in Source code written inconcurrent dialects of C, C++concurrent dialects of C, C++ FORTRAN, LISP FORTRAN, LISP ..
ProgrammerProgrammer
ConcurrencyConcurrencypreserving compilerpreserving compiler
ConcurrentConcurrentobject codeobject code
Execution by Execution byruntime systemruntime system
Source code written in Source code written insequential languages C, C++sequential languages C, C++ FORTRAN, LISP FORTRAN, LISP ..
ProgrammerProgrammer
ParallelizingParallelizing compiler compiler
Parallel Parallelobject codeobject code
Execution by Execution byruntime systemruntime system
(a) Implicit (a) Implicit Parallelism Parallelism
(b) Explicit (b) Explicit Parallelism Parallelism
![Page 19: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/19.jpg)
EECC756 - ShaabanEECC756 - Shaaban
Evolution of ComputerEvolution of ComputerArchitectureArchitecture
Scalar
Sequential Lookahead
I/E Overlap FunctionalParallelism
MultipleFunc. Units Pipeline
Implicit Vector
Explicit Vector
MIMDSIMD
MultiprocessorMulticomputer
Register-to -Register
Memory-to -Memory
Processor Array
Associative Processor
Massively Parallel Processors (MPPs)
I/E: Instruction Fetch and Execute
SIMD: Single Instruction stream over Multiple Data streams
MIMD: Multiple Instruction streams over Multiple Data streams
![Page 20: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/20.jpg)
EECC756 - ShaabanEECC756 - Shaaban#20 lec # 1 Spring 2002 3-12-2002
Parallel Architectures HistoryParallel Architectures History
Application Software
System Software SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
Historically, parallel architectures tied to programming models
• Divergent architectures, with no predictable pattern of growth.
![Page 21: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/21.jpg)
EECC756 - ShaabanEECC756 - Shaaban#21 lec # 1 Spring 2002 3-12-2002
Programming ModelsProgramming Models• Programming methodology used in coding applications
• Specifies communication and synchronization
• Examples:
– Multiprogramming:
No communication or synchronization at program level
– Shared memory address space:
– Message passing:
Explicit point to point communication
– Data parallel: More regimented, global actions on data
• Implemented with shared address space or message passing
![Page 22: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/22.jpg)
EECC756 - ShaabanEECC756 - Shaaban#22 lec # 1 Spring 2002 3-12-2002
Flynn’s 1972 Classification ofFlynn’s 1972 Classification ofComputer ArchitectureComputer Architecture
• Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines.
• Single Instruction stream over Multiple Data streams(SIMD): Vector computers, array of synchronized processing elements.
• Multiple Instruction streams and a Single Data stream(MISD): Systolic arrays for pipelined execution.
• Multiple Instruction streams over Multiple Data streams(MIMD): Parallel computers:
• Shared memory multiprocessors.
• Multicomputers: Unshared distributed memory,message-passing used instead.
![Page 23: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/23.jpg)
EECC756 - ShaabanEECC756 - Shaaban#23 lec # 1 Spring 2002 3-12-2002
Flynn’s Classification of Computer Architecture
Fig. 1.3 page 12 in
Advanced Computer Architecture: Parallelism,Scalability, Programmability, Hwang, 1993.
![Page 24: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/24.jpg)
EECC756 - ShaabanEECC756 - Shaaban#24 lec # 1 Spring 2002 3-12-2002
Current Trends In Current Trends In Parallel ArchitecturesParallel Architectures
• The extension of “computer architecture” to supportcommunication and cooperation:
– OLD: Instruction Set Architecture
– NEW: Communication Architecture
• Defines:– Critical abstractions, boundaries, and primitives
(interfaces)
– Organizational structures that implement interfaces(hardware or software)
• Compilers, libraries and OS are important bridges today
![Page 25: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/25.jpg)
EECC756 - ShaabanEECC756 - Shaaban#25 lec # 1 Spring 2002 3-12-2002
Modern Parallel ArchitectureModern Parallel ArchitectureLayered FrameworkLayered Framework
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
![Page 26: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/26.jpg)
EECC756 - ShaabanEECC756 - Shaaban#26 lec # 1 Spring 2002 3-12-2002
Shared Address Space Parallel ArchitecturesShared Address Space Parallel Architectures
• Any processor can directly reference any memory location
– Communication occurs implicitly as result of loads and stores
• Convenient:– Location transparency
– Similar programming model to time-sharing in uniprocessors• Except processes run on different processors
• Good throughput on multiprogrammed workloads
• Naturally provided on a wide range of platforms– Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model– Ambiguous: Memory may be physically distributed among
processors
![Page 27: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/27.jpg)
EECC756 - ShaabanEECC756 - Shaaban#27 lec # 1 Spring 2002 3-12-2002
Shared Address Space (SAS) Model• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared
• Writes to shared address visible to other threads (in other processes too)
• Natural extension of the uniprocessor model:• Conventional memory operations used for communication
• Special atomic operations needed for synchronization
• OS uses shared memory to coordinate processes
St or e
P1
P2
Pn
P0
Load
P0 pr i vat e
P1 pr i vat e
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physical address space
Shared portionof address space
Private portionof address space
Common physicaladdresses
![Page 28: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/28.jpg)
EECC756 - ShaabanEECC756 - Shaaban#28 lec # 1 Spring 2002 3-12-2002
Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors• The Uniform Memory Access (UMA) Model:
– The physical memory is shared by all processors.
– All processors have equal access to all memory addresses.
• Distributed memory or Nonuniform Memory Access(NUMA) Model:
– Shared memory is physically distributed locally amongprocessors.
• The Cache-Only Memory Architecture (COMA) Model:
– A special case of a NUMA machine where all distributedmain memory is converted to caches.
– No memory hierarchy at each processor.
![Page 29: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/29.jpg)
EECC756 - ShaabanEECC756 - Shaaban#29 lec # 1 Spring 2002 3-12-2002
Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors
I/O ctrlMem Mem Mem
Interconnect
Mem I/O ctrl
Processor Processor
Interconnect
I/Odevices
M ° ° °M M
Network
P
$
P
$
P
$
° ° °
Network
D
P
C
D
P
C
D
P
C
Distributed memory or Nonuniform Memory Access (NUMA) Model
Uniform Memory Access (UMA) ModelInterconnect: Bus, Crossbar, Multistage networkP: ProcessorM: MemoryC: CacheD: Cache directory
Cache-Only Memory Architecture (COMA)
![Page 30: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/30.jpg)
EECC756 - ShaabanEECC756 - Shaaban#30 lec # 1 Spring 2002 3-12-2002
Uniform Memory Access Example:Uniform Memory Access Example:Intel Pentium Pro QuadIntel Pentium Pro Quad
P-Pr o bus (64-bit data, 36-bit addr ess, 66 MHz)
CPU
Bus interface
MIU
P-Pr omodule
P-Pr omodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PC
I bus
PC
I busPCI
I/Ocards
• All coherence and multiprocessingglue in processor module
• Highly integrated, targeted at highvolume
• Low latency and bandwidth
![Page 31: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/31.jpg)
EECC756 - ShaabanEECC756 - Shaaban#31 lec # 1 Spring 2002 3-12-2002
Uniform Memory Access Example:Uniform Memory Access Example:SUN EnterpriseSUN Enterprise
– 16 cards of either type: processors + memory, or I/O
– All memory accessed over bus, so symmetric
– Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 Fi
berC
hann
el
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
I/O car ds
![Page 32: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/32.jpg)
EECC756 - ShaabanEECC756 - Shaaban#32 lec # 1 Spring 2002 3-12-2002
Distributed Shared-MemoryDistributed Shared-MemoryMultiprocessor System Example:Multiprocessor System Example:
Cray T3ECray T3E
Switch
P
$
XY
Z
Exter nal I/O
Memctrl
and NI
Mem
• Scale up to 1024 processors, 480MB/s links
• Memory controller generates communication requests for nonlocalreferences
• No hardware mechanism for coherence (SGI Origin etc. provide this)
![Page 33: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/33.jpg)
EECC756 - ShaabanEECC756 - Shaaban#33 lec # 1 Spring 2002 3-12-2002
Message-Passing MulticomputersMessage-Passing Multicomputers• Comprised of multiple autonomous computers (nodes).
• Each node consists of a processor, local memory, attachedstorage and I/O peripherals.
• Programming model more removed from basic hardwareoperations
• Local memory is only accessible by local processors.
• A message passing network provides point-to-point staticconnections among the nodes.
• Inter-node communication is carried out by message passingthrough the static connection network
• Process communication achieved using a message-passingprogramming environment.
![Page 34: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/34.jpg)
EECC756 - ShaabanEECC756 - Shaaban#34 lec # 1 Spring 2002 3-12-2002
Message-Passing AbstractionMessage-Passing Abstraction
• Send specifies buffer to be transmitted and receiving process
• Recv specifies sending process and application storage to receive into
• Memory to memory copy, but need to name processes
• Optional tag on send and matching rule on receive
• User process names local data and entities in process/tag space too
• In simplest form, the send/recv match achieves pairwise synch event
• Many overheads: copying, buffer management, protection
Process P Process Q
Address Y
Address X
Send X, Q, t
Receive Y, P, tMatch
Local pr ocessaddr ess space
Local pr ocessaddr ess space
![Page 35: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/35.jpg)
EECC756 - ShaabanEECC756 - Shaaban#35 lec # 1 Spring 2002 3-12-2002
Message-Passing Example: IBM SP-2Message-Passing Example: IBM SP-2
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC• Made out of essentially
complete RS6000workstations
• Network interfaceintegrated in I/O bus(bandwidth limited by I/Obus)
![Page 36: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/36.jpg)
EECC756 - ShaabanEECC756 - Shaaban#36 lec # 1 Spring 2002 3-12-2002
Message-Passing Example:Message-Passing Example:Intel ParagonIntel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Supercomputer
![Page 37: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/37.jpg)
EECC756 - ShaabanEECC756 - Shaaban#37 lec # 1 Spring 2002 3-12-2002
Message-Passing Programming ToolsMessage-Passing Programming Tools• Message-passing programming libraries include:
– Message Passing Interface (MPI):• Provides a standard for writing concurrent message-passing
programs.
• MPI implementations include parallel libraries used byexisting programming languages.
– Parallel Virtual Machine (PVM):• Enables a collection of heterogeneous computers to used as a
coherent and flexible concurrent computational resource.
• PVM support software executes on each machine in a user-configurable pool, and provides a computationalenvironment of concurrent applications.
• User programs written for example in C, Fortran or Java areprovided access to PVM through the use of calls to PVMlibrary routines.
![Page 38: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/38.jpg)
EECC756 - ShaabanEECC756 - Shaaban#38 lec # 1 Spring 2002 3-12-2002
Data Parallel Systems Data Parallel Systems SIMD in Flynn taxonomySIMD in Flynn taxonomy• Programming model
– Operations performed in parallel on eachelement of data structure
– Logically single thread of control, performssequential or parallel steps
– Conceptually, a processor is associated witheach data element
• Architectural model– Array of many simple, cheap processors each
with little memory
• Processors don’t sequence throughinstructions
– Attached to a control processor that issuesinstructions
– Specialized and general communication, cheapglobal synchronization
• Example machines:– Thinking Machines CM-1, CM-2 (and CM-5)
– Maspar MP-1 and MP-2,
PE PE PE° ° °
PE PE PE° ° °
PE PE PE° ° °
° ° ° ° ° ° ° ° °
Controlprocessor
![Page 39: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/39.jpg)
EECC756 - ShaabanEECC756 - Shaaban#39 lec # 1 Spring 2002 3-12-2002
DataflowDataflow Architectures Architectures• Represent computation as a graph of essential dependences
– Logical processor at each node, activated by availability of operands
– Message (tokens) carrying tag of next instruction sent to next processor
– Tag compared with others in matching store; match fires execution
1 b
a
+ − ×
×
×
c e
d
f
Dataflow graph
f = a × d
Network
Token store
Waiting Matching
Instruction fetch
Execute
Token queue
Form token
Network
Network
Program store
a = (b +1) × (b − c) d = c × e
![Page 40: Parallel Computer Architecture - Rochester Institute of …meseec.ce.rit.edu/eecc756-spring2002/756-3-12-2002.pdf · 2002-03-12 · General Technology Trends • Microprocessor performance](https://reader035.vdocuments.mx/reader035/viewer/2022070611/5b305e467f8b9adc6e8e11ec/html5/thumbnails/40.jpg)
EECC756 - ShaabanEECC756 - Shaaban#40 lec # 1 Spring 2002 3-12-2002
Systolic ArchitecturesSystolic Architectures
M
PE
M
PE PE PE
• Replace single processor with an array of regular processing elements
• Orchestrate data flow for high throughput with less memory access
• Different from pipelining
– Nonlinear array structure, multidirection data flow, each PEmay have (small) local instruction and data memory
• Different from SIMD: each PE may do something different
• Initial motivation: VLSI enables inexpensive special-purpose chips
• Represent algorithms directly by chips connected in regular pattern