computer architecture guidance keio university amano, hideharu hunga@am . ics . keio . ac ....
TRANSCRIPT
Computer Architecture Guidance
Keio University
AMANO, Hideharu
hunga@am . ics . keio . ac . jp
Contents
Techniques of two key architectures for future system LSIs Parallel Architectures Reconfigurable Architectures
Advanced uni-processor architecture→ Special Course of Microprocessors (by Prof. Y
amasaki, Fall term)
Class
Lecture using Powerpoint: (70mins. ) The ppt file is uploaded on the web site
http://www.am.ics.keio.ac.jp, and you can down load/print before the lecture.
When the file is uploaded, the message is sent to you by E-mail.
Textbook: “Parallel Computers” by H.Amano (Sho-ko-do)
Exercise (20mins.) Simple design or calculation on design issues
Evaluation
Exercise on Parallel Programming using SCore on RHiNET-2 (50%)
Exercise after every lecture (50%)
Computer Architecture 1Introduction to Parallel Architectures
Keio University
AMANO, Hideharu
hunga@am . ics . keio . ac . jp
Parallel Architecture
Purposes Classifications Terms Trends
A parallel architecture consists of multiple processing units which work simultaneously.
Boundary between Parallel machines and Uniprocessors
ILP(Instruction Level Parallelism) A single Program Counter Parallelism Inside/Between instructions
TLP(Tread Level Parallelism) Multiple Program Counters Parallelism between processes and jobs
Uniprocessors
Parallel MachinesDefinition
Hennessy & Petterson’s Computer Architecture: A quantitative approach
Increasing of simultaneous issued instructions vs. Tightly couplingSingle pipeline
Multiple instructions issue
Multiple Threads execution
Connecting Multiple Processors
Shared memory, Shared register
On chip implementation
Tightly Coupling
Performance improvement
Purposes of providing multiple processors Performance
A job can be executed quickly with multiple processors Fault tolerance
If a processing unit is damaged, total system can be available : Redundant systems
Resource sharing Multiple jobs share memory and/or I/O modules for cost
effective processing : Distributed systems Low power
High performance with Low frequency operation
Parallel Architecture: Performance Centric!
Flynn’s Classification
The number of Instruction Stream : M(Multiple)/S(Single)
The number of Data Stream : M/S SISD
Uniprocessors ( including Super scalar 、 VLIW ) MISD : Not existing ( Analog Computer ) SIMD MIMD
SIMD (Single Instruction StreamsMultiple Data Streams
Instruction
InstructionMemory
Processing Unit
Data memory
•All Processing Units executes the same instruction•Low degree of flexibility•Illiac-IV/MMX ( coarse grain )•CM-2 type ( fine grain )
Two types of SIMD
Coarse grain : Each node performs floating point numerical operations ILLIAC-IV , BSP,GF-11 Multimedia instructions in recent high-end CPUs Dedicated on-chip approach: NEC’s IMEP
Fine grain : Each node only performs a few bits operations
ICL DAP, CM-2 , MP-2 Image/Signal Processing Connection Machines extends the application to Artificial I
ntelligence (CmLisp)
A processing unit of CM-2
A B F
C
256bit memory
s c
OP
Flags
Context
1bit serial ALU
Element of CM2
P P P PP P P PP P P PP P P P
Router
4x4 Processor Array
12links 4096 Hypercubeconnection
256bit x 16 PE RAM
Instruction
LSI chip
4096 chips =64K PE
Thinking Machines’ CM2 (1996)
The future of SIMD
Coarse grain SIMD A large scale supercomputer like Illiac-IV/GF-11 will not
revive. Multi-media instructions will be used in the future. Special purpose on-chip system will become popular.
Fine grain SIMD Advantageous to specific applications like image
processing General purpose machines are difficult to be built
ex.CM2 → CM5
MIMD
Processors
Memory modules (Instructions ・ Data )
•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible
Interconnectionnetworks
Classification of MIMD machines Structure of shared memory UMA(Uniform Memory Access Model )
provides shared memory which can be accessed from all processors with the same manner.
NUMA(Non-Uniform Memory Access Model )provides shared memory but not uniformly accessed.
NORA/NORMA ( No Remote Memory Access Model )provides no shared memory. Communication is done with
message passing.
UMA The simplest structure of shared memory machin
e The extension of uniprocessors OS which is an extension for single processor ca
n be used. Programming is easy. System size is limited.
Bus connected Switch connected
A total system can be implemented on a single chip
On-chip multiprocessorChip multiprocessorSingle chip multiprocessor
An example of UMA : Bus connected
PU PU
SnoopCache
PU
SnoopCache
PU
SnoopCache
Main Memory
shared bus
SnoopCache
SMP(Symmetric MultiProcessor)
On chip multiprocessor
Switch connected UMA
Switch
Local Memory
CPU
Interface
Main Memory
. . . .
… .
The gap between switch and bus becomes small
NUMA Each processor provides a local memory,
and accesses other processors’ memory through the network.
Address translation and cache control often make the hardware structure complicated.
Scalable : Programs for UMA can run without modification. The performance is improved as the system size.
Competitive to WS/PC clusters with Software DSM
Typical structure of NUMA
Node 1
Node 2
Node 3
Node 0 0
1
2
3
InterconnectonNetwork
Logical address space
Classification of NUMA Simple NUMA :
Remote memory is not cached. Simple structure but access cost of remote memory is large.
CC-NUMA : Cache Coherent Cache consistency is maintained with hardware. The structure tends to be complicated.
COMA:Cache Only Memory Architecture No home memory Complicated control mechanism
Cray’s T3D: A simple NUMA supercomputer (1993)
Using
Alpha 21064
The Earth simulator(2002)
SGI Origin
Hub Chip
Main Memory
Network
Bristled Hypercube
Main Memory is connected directly with Hub Chip1 cluster consists of 2 PE.
SGI’s CC-NUMA Origin3000(2000)
Using R12000
DDM(Data Diffusion Machine )
... ... ... ...
D
NORA/NORMA
No shared memory Communication is done with message
passing Simple structure but high peak performance
The fastest processor is always NORA(except The Earth Simulator)
Hard for programming
Inter-PU communications Cluster computing
Early Hypercube machine nCUBE2
Fujitsu’s NORA AP1000(1990)
Mesh connection SPARC
Intel’s Paragon XP/S(1991)
Mesh connection i860
PC Cluster
Beowulf Cluster (NASA’s Beowulf Projects 1994, by Sterling) Commodity components TCP/IP Free software
Others Commodity components High performance networks like Myrinet / Infiniban
d Dedicated software
RHiNET-2 cluster
Terms(1)
Multiprocessors : MIMD machines with shared memory ( Strict definition :by Enslow J
r.) Shared memory Shared I/O Distributed OS Homogeneous
Extended definition: All parallel machines ( Wrong usage )
Terms ( 2) Multicomputer
MIMD machines without shared memory, that is NORA/NORMA
Arraycomputer A machine consisting of array of processing elements :
SIMD A supercomputer for array calculation (Wrong usage)
Loosely coupled ・ Tightly coupled Loosely coupled: NORA, Tightly coupled: UMA But, are NORAs really loosely coupled??
Don’t use if possible
Stored programmingbased
Others
SIMDFine grain Coarse grain
MIMD
Bus connected UMASwitch connected UMA
NUMASimple NUMACC-NUMACOMA
NORA
Systolic architectureData flow architectureMixed controlDemand driven architecture
Multiprocessors
Multicomputers
Classification
Systolic Architecture
Data x
Data y
Computational Array
Data streams are inserted into an array of special purpose computing node in a certain rhythm.
Introduced in Reconfigurable Architectures
Data flow machines
x
+
x
+
a bc
d e
(a+b)x(c+(dxe))
A process is driven by the data
Also introduced in Reconfigurable Systems
Exercise 1
In this class, a PC cluster RHiNET is used for exercising parallel programming. Is RHiNET classified into Beowulf cluster ? Why do you think so?
If you take this class, send the answer with your name and student number to [email protected]