computer architecture guidance keio university amano, hideharu hunga@am ． ics ． keio ． ac ．...

Computer 　 Architecture 　Guidance

Keio University

AMANO, Hideharu

hunga@am ． ics ． keio ． ac ． jp

Contents

Techniques of two key architectures for future system LSIs Parallel Architectures Reconfigurable Architectures

Advanced uni-processor architecture→ 　 Special Course of Microprocessors (by Prof. Y

amasaki, Fall term)

Class

Lecture using Powerpoint: (70mins. ) The ppt file is uploaded on the web site

http://www.am.ics.keio.ac.jp, and you can down load/print before the lecture.

When the file is uploaded, the message is sent to you by E-mail.

Textbook: “Parallel Computers” by H.Amano (Sho-ko-do)

Exercise (20mins.) Simple design or calculation on design issues

Evaluation

Exercise on Parallel Programming using SCore on RHiNET-2 (50%)

Exercise after every lecture (50%)

Computer 　 Architecture 　 1Introduction to Parallel Architectures

Keio University

AMANO, Hideharu

hunga@am ． ics ． keio ． ac ． jp

Parallel Architecture

Purposes Classifications Terms Trends

A parallel architecture consists of multiple processing units which work simultaneously.

Boundary between Parallel machines and Uniprocessors

ILP(Instruction 　 Level 　 Parallelism) A single Program Counter Parallelism Inside/Between instructions

TLP(Tread 　 Level 　 Parallelism) Multiple Program Counters Parallelism between processes and jobs

Uniprocessors

Parallel MachinesDefinition

Hennessy & Petterson’s Computer Architecture: A quantitative approach

Increasing of simultaneous issued instructions vs. Tightly couplingSingle pipeline

Multiple instructions issue

Multiple Threads execution

Connecting Multiple Processors

Shared memory, Shared register

On chip implementation

Tightly Coupling

Performance improvement

Purposes of providing multiple processors Performance

A job can be executed quickly with multiple processors Fault tolerance

If a processing unit is damaged, total system can be available ： Redundant systems

Resource sharing Multiple jobs share memory and/or I/O modules for cost

effective processing ： Distributed systems Low power

High performance with Low frequency operation

Parallel Architecture: Performance Centric!

Flynn’s Classification

The number of Instruction 　 Stream ：　 M(Multiple)/S(Single)

The number of Data 　 Stream ： M/S SISD

Uniprocessors （ including Super scalar 、 VLIW ） MISD ： Not existing （ Analog 　 Computer ） SIMD MIMD

SIMD (Single Instruction StreamsMultiple Data Streams

Instruction

InstructionMemory

Processing Unit

Data memory

•All Processing Units executes the same instruction•Low degree of flexibility•Illiac-IV/MMX （ coarse grain ）•CM-2 type （ fine grain ）

Two types of ＳＩＭＤ

Coarse grain ： Each node performs floating point numerical operations ILLIAC-IV ， BSP,GF-11 Multimedia instructions in recent high-end CPUs Dedicated on-chip approach: NEC’s IMEP

Fine grain ： Each node only performs a few bits operations

ICL 　 DAP, 　 CM-2 ， MP-2 Image/Signal Processing Connection Machines extends the application to Artificial I

ntelligence (CmLisp)

A processing unit of CM-2

A B F

C

256bit memory

s c

OP

Flags

Context

1bit serial ALU

Element of CM2

P P P PP P P PP P P PP P P P

Router

4x4 Processor Array

12links 4096 Hypercubeconnection

256bit x 16 PE RAM

Instruction

LSI chip

4096 chips =64K 　 PE

Thinking Machines’ CM2 (1996)

The future of SIMD

Coarse grain SIMD A large scale supercomputer like Illiac-IV/GF-11 will not

revive. Multi-media instructions will be used in the future. Special purpose on-chip system will become popular.

Fine grain SIMD Advantageous to specific applications like image

processing General purpose machines are difficult to be built

ex.CM2 　→　 CM5

MIMD

Processors

Memory modules (Instructions ・ Data ）

•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible

Interconnectionnetworks

Classification of MIMD machines Structure of shared memory UMA(Uniform 　 Memory 　 Access 　 Model ）

provides shared memory which can be accessed from all processors with the same manner.

NUMA(Non-Uniform 　 Memory 　 Access 　 Model ）provides shared memory but not uniformly accessed.

NORA/NORMA （ No 　 Remote 　 Memory 　 Access 　 Model ）provides no shared memory. Communication is done with

message passing.

UMA The simplest structure of shared memory machin

e The extension of uniprocessors OS which is an extension for single processor ca

n be used. Programming is easy. System size is limited.

Bus connected Switch connected

A total system can be implemented on a single chip

On-chip multiprocessorChip multiprocessorSingle chip multiprocessor

An example of UMA ： Bus connected

PU PU

SnoopCache

PU

SnoopCache

PU

SnoopCache

Main 　 Memory

　 shared 　 bus

SnoopCache

SMP(Symmetric MultiProcessor)

On chip multiprocessor

Switch connected UMA

Switch

Local 　 Memory

CPU

Interface

Main 　 Memory

．．．．

… ．

The gap between switch and bus becomes small

NUMA Each processor provides a local memory,

and accesses other processors’ memory through the network.

Address translation and cache control often make the hardware structure complicated.

Scalable ： Programs for UMA can run without modification. The performance is improved as the system size.

Competitive to WS/PC clusters with Software DSM

Typical structure of NUMA

Node 　１

Node 　 2

Node 　３

Node 　００

１

２

３

ＩｎｔｅｒｃｏｎｎｅｃｔｏｎＮｅｔｗｏｒｋ

Logical address space

Classification of NUMA Simple NUMA ：

Remote memory is not cached. Simple structure but access cost of remote memory is large.

CC-NUMA ： Cache 　 Coherent Cache consistency is maintained with hardware. The structure tends to be complicated.

COMA:Cache 　 Only 　 Memory 　 Architecture No home memory Complicated control mechanism

Cray’s T3D: A simple NUMA supercomputer (1993)

Using

Alpha 21064

The Earth simulator(2002)

SGI 　 Origin

Hub 　Chip

Main 　 Memory

Ｎｅｔｗｏｒｋ

Bristled 　 Hypercube

Main 　 Memory is connected directly with Hub 　 Chip1 cluster consists of ２ PE.

SGI’s CC-NUMA Origin3000(2000)

Using R12000

DDM(Data 　 Diffusion 　Machine ）

．．．．．．．．．．．．

Ｄ

NORA/NORMA

No shared memory Communication is done with message

passing Simple structure but high peak performance

The fastest processor is always NORA(except The Earth Simulator)

Hard for programming

Inter-PU communications Cluster computing

Early Hypercube machine nCUBE2

Fujitsu’s NORA AP1000(1990)

Mesh connection SPARC

Intel’s Paragon XP/S(1991)

Mesh connection i860

PC Cluster

Beowulf Cluster (NASA’s Beowulf Projects 1994, by Sterling) Commodity components TCP/IP Free software

Others Commodity components High performance networks like Myrinet / Infiniban

d Dedicated software

RHiNET-2 cluster

Terms(1)

Multiprocessors ： MIMD machines with shared memory （ Strict definition ：ｂｙ　Ｅｎｓｌｏｗ　Ｊ

ｒ．） Shared memory Shared I/O Distributed OS Homogeneous

Extended definition: All parallel machines （ Wrong usage ）

Terms （ 2) Multicomputer

ＭＩＭＤ machines without shared memory, that is ＮＯＲＡ／ＮＯＲＭＡ

Arraycomputer A machine consisting of array of processing elements :

ＳＩＭＤ A supercomputer for array calculation (Wrong usage)

Loosely coupled ・ Tightly coupled Loosely coupled: ＮＯＲＡ， Tightly coupled: ＵＭＡ But, are NORAs really loosely coupled??

Don’t use if possible

Stored programmingbased

Others

ＳＩＭＤＦｉｎｅ　ｇｒａｉｎ　Ｃｏａｒｓｅ　ｇｒａｉｎ　

ＭＩＭＤ

Bus connected UMASwitch connected UMA

ＮＵＭＡＳｉｍｐｌｅ　ＮＵＭＡＣＣ－ＮＵＭＡＣＯＭＡ

ＮＯＲＡ

Systolic architectureData flow architectureMixed controlDemand driven architecture

Multiprocessors

Multicomputers

Classification

Systolic Architecture

Data ｘ

Data ｙ

Computational Array

Data streams are inserted into an array of special purpose computing node in a certain rhythm.

Introduced in Reconfigurable Architectures

Data flow machines

ｘ

＋

ｘ

＋

ａｂｃ

ｄｅ

（ａ＋ｂ）ｘ（ｃ＋（ｄｘｅ））

A process is driven by the data

Also introduced in Reconfigurable Systems

Exercise 1

In this class, a PC cluster RHiNET is used for exercising parallel programming. Is RHiNET classified into Beowulf cluster ? Why do you think so?

If you take this class, send the answer with your name and student number to [email protected]

computer architecture guidance keio university amano, hideharu hunga@am ． ics ． keio ． ac ．...

Documents

term slide

multiple processing

quantitative approach

serial alu slide

type fine grain slide

parallel programming

parallel computers

ics keio ac jp slide