1 shared memory mimd architectures sima, fountain and kacsuk chapter 18 cse462

118
1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

Post on 20-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

1

Shared Memory MIMD

Architectures

Sima, Fountain and KacsukChapter 18

CSE462

Page 2: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

2

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Design choices

Types of shared memory – Physically shared memory– Virtual (or distributed) shared memory

Scalability issues– Organisation of memory– Design of interconnection network– Cache coherence protocols

Page 3: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

3

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Design space of shared memory computers

Shared memory computers

Single address space memory access

Interconnection scheme

Cache coherency

Physical shared memory UMA

Virtual shared memory

NUMA

CC-NUMA

COMA

Shared pathSwitching network

Singled bus based

Multiple bus based

Bus multiplication

Grid of buses

Hierarchical system

CrossbarMultistage

network

Hardware based

Software based

Omega Banyan Benes

Page 4: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

4

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Classification of dynamic interconnection networks

Enable temporary connection of any two components of a multiprocessor

Dynamic interconnection networks

Shared path networks Switching networks

Single bus

Multiple buses

Crossbar

Multistage networks

Page 5: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

5

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Buses

Very limited scalability– Typically 3-5 processors unless special

techniques (TDM)– Can be expanded significantly if

• Use private memory

• Coherent cache memory

• Multiple buses

Page 6: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

6

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of a single bus multiprocessor (nocaches)

P1 Pk M1 Mn

Bus arbiter

and control

logic

I/O1

I/OM

Address

Data

Control

InterruptBus exchange lines

Page 7: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

7

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Locking or multiplexing the bus

Two main approaches Locking and holding

– Acquire the bus– Send out address and/or data– Wait for data (read), wait for write to complete– Release the bus

Multiplexing– Acquire bus time slot– Send address and/or data– Come back for data n cycles later (read), or keep going

if write

Page 8: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

8

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Memory write on locked bus

P1

P2

P3

P4

Processors

Time4 8 12 16

Bus cycle Memory cycle

Page 9: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

9

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Memory write on multiplexed buses

P1

P2

P3

P4

Processors

Time4 7

Bus cycle Memory cycle

Note – This assumes different

memory banks

Page 10: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

10

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Memory read on locked bus

P1

P2

P3

Processors

Time5 10

Phase 1: address bus is used

15 20 25 30

Phase 2: bus is not used

Phase 3: data bus is used

Page 11: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

11

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Memory read on multiplexed bus

P1

P2

P3

Processors

Time5 10

Phase 1: address bus is used

12

Phase 2: bus is not used

Phase 3: data bus is used

Page 12: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

12

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Memory read on split-transaction bus

P1

P2

P3

Processors

Time5 10

Phase 1: address bus is used

Phase 2: bus is not used

Phase 3: data bus is used

Next transfer started before last

one completed!

Needs special associative hardware

Page 13: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

13

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Arbiter Logic

Because bus is shared resource, but arbitrate for access

Arbiter may be – Centralised

• Central unit which looks at all requests

– Decentralised.• Logic is split amongst bus masters

• Scalable– Each new master adds more logic

Page 14: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

14

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Design space of arbiter logic

Arbiter logics

OrganizationBus allocation

policyHandling of

requestsHandling of

grants

Centralized

Distributed

Fixed priority

Rotating

Round robin

Least recently used

First come first served

Fixed priority

Rotating

Fixed priority

Rotating

Page 15: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

15

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Centralized arbitration with independent requests and grants

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

R1

G1

R2

G2

RN

GNBus busy

Page 16: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

16

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Centralized arbitration with independent requests and grants

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

R1

G1

R2

G2

RN

GNBus busy

Masters Request Bus

Page 17: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

17

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Centralized arbitration with independent requests and grants

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

R1

G1

R2

G2

RN

GNBus busy

One is granted

Page 18: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

18

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Centralized arbitration with independent requests and grants

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

R1

G1

R2

G2

RN

GNBus busy

Successful master claims bus

Page 19: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

19

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Centralized arbitration with independent requests and grants

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

R1

G1

R2

G2

RN

GNBus busy

Bus is released

Page 20: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

20

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Daisy-chained bus arbitration scheme

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Page 21: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

21

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Daisy-chained bus arbitration scheme

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Masters Request Bus

Page 22: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

22

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Daisy-chained bus arbitration scheme

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Bus grant generated

Page 23: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

23

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Daisy-chained bus arbitration scheme

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Bus grant not propagated

Page 24: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

24

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Daisy-chained bus arbitration scheme

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Master claims bus

Page 25: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

25

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Daisy-chained bus arbitration scheme

Central bus

arbiter

Master 1 Master 2 Master N

Bus lines

G2 GN

Bus

Grant 1

Bus request

Bus busy

Bus released

Page 26: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

26

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Decentralized rotating arbiter with independent requests and grants

Problems with previous design – lack of fairness

– Wait whilst grant signal propagates

Rotating priority solves lack of fairness– Logical first not

same as physical first

Master 1 Master 2 Master N

Bus lines

Arbiter 1 Arbiter 2 Arbiter NP1 P2

R1 G1 R2 G2 R3 G3

PN

Bus busy

R: Request G: Grant P: Priority

Page 27: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

27

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Multiple buses

Increase bandwidth by adding additional resources

Bus is limiting factor

Page 28: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

28

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

1-dimension multiple bus multiprocessor

Each processor connected to all buses Each memory connected to all buses Processor chooses bus dynamically Load can be spread across buses

P1 P2 Pn M1 M2 Mm

B1

B2

Bb

Page 29: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

29

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

2 and 3 dimensional bus system

PM

Page 30: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

30

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

2 Dimensional bus design

Can support specialised access patterns e.g. Climate model

– Access to local data– Access to data in same latitude– Access to data in same longitude

Page 31: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

32

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Cluster bus architecture

Hierarchy of buses Arbitrary large networks Cache coherence becomes very difficult

Uniform interconn.

card

Uniform interconn.

card

Uniform interconn.

card

Uniform interconn.

cardMultimax

Uniform cluster cache

Cluster bus (Nanobus)

Cluster 1

Uniform interconn.

card

Uniform interconn.

card

Uniform interconn.

card

Uniform interconn.

cardMultimax

Uniform cluster cache

Cluster bus (Nanobus)

Cluster 8

Global bus (Nanobus)

Page 32: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

33

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Switching Networks

Multistage networks

No. of stagesNo. of switches

at a stageTopology of links

among stagesSwitch type Operation mode

No. of input andoutput links

Operation mode Blocking

Non-blockingNormal switch

Queuing switch

Combining switch

Page 33: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

34

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

View of a crossbar network

Cross bar allows any processor to connect with any memory

As long as there is no contention for the memory, network in non-blocking

P1

S: Switch

P2

Pn

M1 M2 Mn

S S S

S

S

S

S

S

S

Page 34: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

35

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

View of a crossbar network

Cross bar allows any processor to connect with any memory

As long as there is no contention for the memory, network in non-blocking

P1

S: Switch

P2

Pn

M1 M2 Mn

S S S

S

S

S

S

S

S

Page 35: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

36

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

View of a crossbar network

Cross bar allows any processor to connect with any memory

As long as there is no contention for the memory, network in non-blocking

P1

S: Switch

P2

Pn

M1 M2 Mn

S S S

S

S

S

S

S

S

Page 36: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

37

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Detailed structure of a crossbar network

P1

BBCU

Arbiter

Switch

ControlAddressData bus

P1

BBCU

Arbiter

Switch

ControlAddressData bus

Mi

ControlAddressData bus

Page 37: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

38

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Multi-stage interconnection networks Cannot directly connect processor to

memory Use cross-bar switches as components to

build larger network Minimum number of stages is logarithmic

– Single path– No fault tolerance– Blocking (if intermediate switch in use)

Page 38: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

39

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Omega network topology 2 x 2 cross bar switch components

– Butterfly built from 8x8 Unique path from one port to another Log depth

000001

010011

100101

110111

000001

010011

100101

110111

Upper broadcast

Lower broadcast

Straight through

Straight through

Page 39: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

40

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Omega network topology

Some configurations are non blocking– `e.g. reversal

000001

010011

100101

110111

000001

010011

100101

110111

0->7, 1->6, 2->5, 3->4, 4->3, 5->2, 6->1, 7->0

Page 40: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

41

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Broadcast in the omega network

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

Page 41: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

42

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Blocking in an omega network

000001

010011

100101

110111

000001

010011

100101

110111

(0->5, . . ., 6->4, . . .)

Page 42: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

43

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Multistage Network Portperties

Network Type

# Stages Switches/stage

Topology Switch Size

Op Mode

Omega log2N N/2 2-way shuffle 2x2 Blocking

Butterfly log8N N/8 8-way shuffle 8x8 Blocking

Generalized-cube

S= log2N N/2 [0,1] shuffle

[1,S] exchange

2x2 Blocking

Benes S= 2log2N-1 N/2 [2,S] exchange 2x2 Non-blocking

Page 43: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

44

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Hot-spot saturation in a blocking omega network

M0M1

M2M3

M4M5

M6M7

P0P1

P2P3

P4P5

P6P7

P2->M4 active => P7->M4 blocked => P1->M5 blocked => P5->M7 blocked

P5->M7 blocked

P1->M5 blocked

P2->M4 active

P7->M4 blocked

Page 44: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

45

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Hotspots in Omega networks

In shared memory machine two sorts of contention– Memory unit– Switch elements

Certain access patterns can repeatedly block each other even though addressing different memory units

Message combining can solve these problems– Switch element buffers request– Memory only sees one request

Read 100

Read 100

Read 100

Page 45: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

46

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of a combining switch

Introduced on NYU Ultracomputer

Combining queue

Noncomb. queue

Noncomb. queue

Combining queue

Wait buffer

Wait buffer

Proc(i)

Proc(j)

Mem(k)

Mem(I)

Page 46: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

47

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Cache Coherence

Cache coherence problems– Sharing of writable data– Process migration– I/O activity

Processor Cache Memory

Processor Cache Memory

Processor Cache Memory

Page 47: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

48

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Cache Coherence

Cache coherence problems– Sharing of writable data– Process migration– I/O activity

Processor Cache Memory

Processor Cache Memory

Processor Cache Memory

Write 100,5

Read 100

Cache

Cache

Write 100,100

Cache

Page 48: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

49

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Cache Coherence

Cache coherence problems– Sharing of writable data– Process migration– I/O activity

Processor Cache Memory

Processor Cache Memory

Processor Cache Memory

Write 100,5

Read 100

Cache

Cache

Write 100,100CacheProcessor

Processor

Page 49: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

50

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Cache Coherence

Cache coherence problems– Sharing of writable data– Process migration– I/O activity

Processor Cache Memory

Processor Cache Memory

Processor Cache Memory

CacheIODev

Read 100

IORead 100

Memory

Page 50: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

51

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Classification of data structures

Read only– Never cause cache coherence problems

Shared writable– Main source of cache coherence problems

Private writable data– Causes problems with process migration

Solutions– Hardware based protocols– Software based protocols

Page 51: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

52

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Design space of hardware-based cache coherence protocols

Hardware-based cachecoherence protocols

Memory updatepolicy

Cache coherencepolicy

Interconnectionscheme

Write-through

Write-back

Write-invalidate

Write-update

** continue next slide **

Page 52: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

53

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Design space of hardware-based cache coherence protocols (cont.)

Interconnection scheme

Single bus snoopy chache protocols

Multistage directoryschemes

Multiple bus hierachical

Cache coherence protocols

Full-map directories Limited directories Chained directories

Centralized

Distributed

Page 53: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

54

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Write-through memory update policy

Memory always updated on a write Intuitively easier to keep caches coherent

D1

Pi

D1

Pj

D

Processor

Store D1

Cache

D1

Memory

Page 54: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

55

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Write-back memory update policy

Data only written back to memory when flushed Processor can do many writes before flushed

D

Pi

D1

Pj

D

Processor

Store D1

Cache

Memory

Page 55: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

56

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Write-update cache coherence policy

When a processor writes a variable, updates all other copies in other processors

Pj

D1

PkProcessor

Store D1

Cache

Pi

D1

Update (D1)

D1

Page 56: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

57

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Write-invalidate cache coherence policy

When a processor writes a variable invalidates copy in any other caches

Makes one processor the “owner”

Pj

D1

PkProcessor

Store D1

Cache

Pi

Invalidate (addr(D)) Invalid data

Page 57: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

58

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Snoopy Protocols

If interconnection network supports broadcasting (cheaply) then a snoopy policy is effective– Every cache “watches” every transaction to

memory– Works for buses

If broadcast is not efficient– Directory based scheme– Keeps track of where cache blocks are located

Page 58: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

59

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Snoopy write update protocol

Possible cache block states– Used to support cache coherence protocol– Valid-exclusive

• Only copy of this cache block. Cache and memory are consistent

– Shared• Several copies of this cache block

– Dirty• Only copy but cache and memory are inconsistent

Page 59: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

60

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Read Miss logic

Snoopy cache controller broadcasts a Read-Blk command on the bus– If there are shared copies

• Delivered by cache with copy

– If dirty copies • It is supplied and flushed to main memory. • All copies become shared.

– If a valid-exclusive copy exists • Copy supplied and all become shared

– If no cache copy • Memory supplies data• Becomes valid exclusive

Page 60: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

61

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Snoopy Update - Read miss

D

Pi PjProcessor

Cache

Memory

Load D

Read-blk (addr(D))

Shared Dirty

D

Exclusive

Load D

DD

Page 61: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

62

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Write hit logic

If block is valid-exclusive or dirty– Write is performed locally– New state is dirty

If block is shared– Broadcast update block on bus– All copies (including memory) update– Status remain shared.

Page 62: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

63

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Snoopy Update – Write hit Exclusive

D

Pi PjProcessor

Cache

Memory

Write D

Shared Dirty

D

Exclusive

D

Load D

Read-blk (addr(D))

Page 63: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

64

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Snoopy Update – Write hit Shared

D

Pi PjProcessor

Cache

Memory

Write D

Shared Dirty

D

Exclusive

Load D

Read-blk (addr(D))

Load D

Read-blk (addr(D))

D D

Write (addr(D))

Page 64: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

65

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Write miss

If only memory contains copy– Memory updated – Requesting cache loaded with data – valid exclusive

If shared copies are available– All copies (including memory one) updated– Requesting cache loaded with data –shared

If dirty or valid exclusive exist– Other blocks updated– Memory updated– Requesting cache loaded with data –shared

Page 65: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

66

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Snoopy Update – Write miss

D

Pi PjProcessor

Cache

Memory

Shared Dirty Exclusive

Write D

D

Write (addr(D))

Page 66: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

67

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

State transition graph for snoopy update

Cache responds to – P-READs, P-WRITEs from the Processor and – READ-BLK, WRITE-BLK from the Bus

Valid-exclusive

P-Read

Shared

Dirty

Read-Blk/Write-Blk

P-Read/P-Write

Read-blk/Write-Blk/Update-Blk

Read-Blk/Write-BlkP-Write

P-Read/P-Write

Page 67: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

68

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of the snoopy cache controller

Snoopy controller needs to operate at bus speed

D

Memory

A

Processor

Interface

Cache

DA

Cache controller

Snoopy controller

Interface

Cache directory

Snoopy cache controller

Proc.

DA

Cache

PEi

PEn

Page 68: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

69

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Directory Schemes

Directory schemes only send consistency commands to those caches where a valid copy of the shared

Designed for systems where snooping is not possible Three main approaches

– Full map directory• Each entry points to all caches• Entry indicates whether block is present in remote caches• Not efficient for large systems

– Limited directory• Only point to subset of the caches

– Works because tend not to share a variable with all processors• Same information as in full map

– Chained directory• Directory entries form a linked list• Scalable – can add processors without increasing directory width

Page 69: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

70

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Chained directory scheme

P0

X, CT

PE0

P1

C1

PE1

Pn

Cn

PEn

C X

Read X

Processor

Cache

Shared memory

Directory entry

Page 70: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

71

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Chained directory scheme

P0

X, CT

PE0

P1

PE1

Pn

Cn

PEn

C X

Processor

Cache

Shared memory

Directory entry

X,

Page 71: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

72

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Scalable Coherence Interface

Concrete example of chained directory IEEE Standard Defines

– Interface to interconnection network– Not any particular interconnection network

Interface– Point to point– Well suited to networks like Convex Exemplar

• Simple, uni-directional ring Designed for building scalable shared memory

machines

Page 72: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

73

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of sharing-lists in the SCI

Operations defined for– Creation

– Insertion

– Deletion

– Reduction to single node

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

NodekMemory

mstate forw_id

data (64 bits) cstate mem_id

data (64 bits)

forw_id back_id

Page 73: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

74

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Insertion in a sharing-list

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

Nodek Memory

New-head responses

prepend

Page 74: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

75

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Messages for deletion

Pi

Nodei

Pj

Nodej

Pk

NodekMemory

Update forward

Update forward

12

Page 75: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

76

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of the sharing-list after deletion

Pi

Nodei

Pk

NodekMemory

Page 76: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

77

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Hierarchical cache coherence

C10

Main memory

C20C21

C22

B20

write

X X

P0 P1

C11

P1

C11C10

P0 P1

C11

P1

C11C10

P0

C10

P0

B10 write B11 B12

Invalidate

Page 77: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

78

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Software Based Coherence

Software approaches rely on compiler assistance Identify different classes of variables

– Read-only

– Read-only for any number of processors and read-write for one process

– Read-write for one process

– Read-write for any number of processes Once identified (by static analysis), handled

differently

Page 78: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

79

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Software based cache coherence

Read only variables– Can be cached any time

Read only for any number and read-write for one process– Can only be cached on writing processor

Read-write for one process– Cache only on that processor

Read-write for many processes– Cannot be cached at all

Clearly need accurate information in order to limit performance hit

Page 79: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

80

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Classification of software-based cache coherence protocols

Software-basedprotocols

Indiscriminateinvalidation

Selectinginvalidation

Parallel for-loop based

Critical sectionbased

Fast selectiveinvalidation

Version controlscheme

Timestampscheme

Page 80: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

81

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Invalidation

Can invalidate the entire cache– Single hardware mechanism for

clearing valid bits– Very conservative!

Selective invalidation– Invalidate before critical sections– Understand parallel for-loop and

invalidate– Still needs hardware support to clear

effectively

Key Datavvv

Page 81: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

82

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Using knowledge of critical regions

:

:

Secure_lock()

Invalidate_cache()

:

:

Flush_cache()

Release_lock()

:

:

Variables in here can be used without worrying about any other processes

Page 82: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

83

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Using knowledge of parallel loops

:

:

Par For (I = 0; I < 100; i++) {

:

}

:

:

Par For (I = 0; I < 50; i++) {

:

}

Processor 0

:

:

Par For (I = 50; I < 100; i++) {

:

}

Processor 1

Knowledge about loops

Page 83: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

84

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Selective invalidation schemes

Add change bit to cache block status• Set change bit to true

– If read on block then invalidate and reload

Add timestamp to cache block– Clock associated with a data structure

– Update timestamp in cache when block changed

– Can compare timestamp in block with current timestamp

Adding version number– Similar to clock scheme

Page 84: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

85

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Synchronization & Event Ordering

Mutual exclusion required in many parallel algorithms– Monitors– Sempahores

All high level schemes base don low level synchronization tools

Atomic test-and-set common in shared memory multiprocessor– Needs to take account of cache

• Minimum traffic generated while waiting• Low latency release of a waiting processor• Low latency acquisition of a free lock

– Typically work well on small bus based machines

Page 85: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

86

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Synchronization with test-and-set

Lock variable– Open– Closed

Acquire lockchar *lock;while (exchange(lock,CLOSED) == CLOSED);

Release lock*lock = OPEN;

Page 86: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

87

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Cache states after Pi successfully executed test&set on lock

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

NodekMemory

Lock: Lock:

Exchange

invalid dirty

Page 87: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

88

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Bus commands when Pj executes test&set on lock and cache states after

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

NodekMemory

Lock: Lock:

Exchange

invalid dirty

Lock:

(closed)

Read-Blk (lock)

Block (lock)

Invalidate (lock)

1

2

3

Page 88: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

89

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Cache states after Pk executed test&set on lock

Pi

Ci

Nodei

Pj

Cj

Nodej

Pk

Ck

NodekMemory

Lock: Lock:

Exchange

invalid dirty

Lock: Lock:

(closed)

Page 89: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

90

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Busy waiting with cache coherence

Indivisable test-and-set instruction requires write access to lock– Causes processor doing test-and-set to acquire variable

in cache, invalidating all other copies When multiple processors spin on lock

– Each one tries to acquire the variable in cache

– Causes cache trashing Instead use snooping lock

– Spin on test without indivisible test-and-set

– Only exchange once OPEN

Page 90: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

91

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Efficient algorithm for locking

while (exchange(lock, CLOSE) == CLOSE)while (*lock == CLOSED);

First while will claim lock if OPEN and lock it But if already CLOSE transfer control to second

loop– Continuously reads lock

– No bus traffic during this phase

– When lock is OPEN try and test-and-set again.

Page 91: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

92

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Test and test-and-set

Even more efficient to test the lock before trying to set itFor (;;) {

While (*lock == CLOSED);If (exchange(lock,CLOSED) != CLOSED)

Break;

} Introduces extra latency for unused locks

Page 92: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

93

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Lock implementation on scalable multiprocessors New York Ultracomputer and IBM RP3

implemented – Fetch-and-add

Fetch-and-add – Atomic operation– All memory modules augmented with adder circuitfetch-and-add(x,a)int * x, a;{ int temp;

temp = *x;*x = *x + a;return (temp);

}

Page 93: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

94

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Example of fetch-and-add

Suppose we want to implement parallel loopDOALL N = 1 to 1000

<loop body using N>

ENDDO Suppose want to allocate to processors dynamically

N = 0;i = fetch-and-add(N,1)While (i <= 1000) {

loop_body(i);i = fetch-and-add(N,1);

} Regardless of how many processors execute the look

– Each processor will get a different value of i.

Page 94: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

95

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Fetch-and-add

Fetch-and-add automatically allocates look indexes in this example

But location N becomes a hotspot. Combining network described before will

not work correctly without modification.– Same value is returned from a read operation

Change each switch element so it can implement the fetch and add operation.

Distributed operation without hotspots

Page 95: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

96

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Forward propagation of fetch-and-add

M0M1

M2M3

M4M5

F & A (N,8) M6 returns N=1 N becomes 9

P0: F & A (N,1)P0: F & A (N,1)

P0: F & A (N,1)P0: F & A (N,1)

P0: F & A (N,1)P0: F & A (N,1)

P0: F & A (N,1)P0: F & A (N,1)

1

1

1

1

2

2 4F & A (N,2)

F & A (N,2)

F & A (N,2)

F & A (N,2)

F & A (N,4)

F & A (N,4)

Page 96: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

97

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Back propagation of fetch-and-add

M0M1

M2M3

M4M5

P0: 1P0: 5

P0: 3P0: 7

P0: 2P0: 6

P0: 4P0: 8

1

5

3

7

1+4

M6M7

1

1

5

1+2

1

5+2

5

1

3

5

7

1+1

5+1

3+1

7+1

Page 97: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

103

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

A quick tour of some UMA machines

Single bus multiprocessors

Bus workingmode

Arbiter logicMemory update

policyCache coherency

policy

Locked bus

Pending bus(Multimax)

Split-transactionbus

(Power challenge)

** Continue next slideWrite-through

(Multimax)

Write-back (Power challenge)

Write-update

Write-invalidate(Multimax)

(Power challenge)

Page 98: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

104

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Some real UMA machines

Aribiter Logic

Organization Bus Allocation Policy

Centralized(Multimax)

Distributed(Power Challenge)

Fixed prioroty(Multimax data bus)

Rotating

Round Robin(Multimax address bus)

Powerchallenge

Least recently used

First come first serve

Page 99: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

105

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of the Hector machine - NUMA

Station Station StationStation controller

Local ring

Global ring

Local ring

Inter-ring interfaces

Station Station Station

To be continued

Page 100: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

106

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of the Hector machine (cont.)

Station

Station bus

Proc. module

Proc. module

I/O. module

Proc. module

Station controller

Station bus interface

Proc. + cache

Memory

Station bus

I/O adaptor

Station bus

display ehternet disk

Page 101: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

107

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of the Cray T3D system NUMA

Cray Y-MP host

I/O clusters

Workstations Tapedrives

Disks Networks

Page 102: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

108

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Design space of CC-NUMA machines

CC-NUMAmachines

Complexity of nodes Main memory distributionCache consistency

schemeInterconnection network

Single processornode

Cluster

Single bus based

Crossbar based

Per columnbus

Per node

Per cluster

Snoopy cache

Snoopy cache +directory

Directory

Grid of buses

Mesh

Ring

Page 103: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

111

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

The Stanford Dash interconnection network

Cluster 11

Cluster 12

Cluster 21

Cluster 22

Cluster 13

Cluster 23

Page 104: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

112

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of a cluster

Memory

Pi 1

Ci 1

I/O Interface

Directory and Intercluster

Interface

Page 105: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

113

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Processor level

Pi 1

X

Load X

Ci 1

Access time: 1 clock

Page 106: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

114

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Local cluster level

Pi 1

Ci j

Access time: 30 clocks

Ci 1

Load X

Pi j

X

Memory

Page 107: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

115

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Home cluster level

Pi 1

Access time: 100 clocks

Ci 1

Load X

Memory

Cluster C1i (local cluster)

Pj 1

Cj 1

Cluster C1j (home cluster)

Memory

X

Interconnection network

Page 108: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

116

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Remote cluster level

Pi 1

DL: Directory logicAccess time: 135 clocks

Ci 1

Load X

Memory

Cluster C1i (local cluster)

Pj 1

Cj 1

Cluster C1j

Memory

DLi DLj

Continue on next slide

5

1

Page 109: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

117

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Remote cluster level (cont.)

Pm 1

DL: Directory logicAccess time: 135 clocks

Cm 1

D=Dirty

Memory

Cluster C1m (home cluster)

Pk 1

Ck 1

Memory

DLi DLj

8

2

Continue from previous slide

Cluster C1k (remote cluster)

4

3Read Read

Read-Req

D ClkX:

Sharing-Writeback

Read-Req Read-Rply

Page 110: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

118

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Structure of the dash directory

Reply Y-dimension

router

Reply Y-dimension

router

Reply controller

(RC)

Pseudo-CPU(PCPU)

RCboard

Reply X-dimension

router

Reply X-dimension

router

Performance monitor

DCboard

Replies to clusters Y+1/Y-1

Requests to clusters Y+1/Y-1

Replies to clusters X+1/X-1

Requests to clusters X+1/X-1

Arbitration masks

Cluster bus request

Remote cache status, bus retry

Events

Directory controller

(DC)

Cluster address/control bus

Cluster datal bus

Page 111: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

119

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Sequence of actions in a store operation requiring remote service

Pi 1

DL: Directory logic

Ci 1

Store X

Memory

Cluster C1i

Pj 1

Cj 1

Cluster C1j

Memory

DLi DLj

Continue on next slide

5

1

Read- Ex-Req

Read-exclusive4

3Read-exclusive

Inv-Ack

Inv-Req

Page 112: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

120

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Sequence of actions in a store operation requiring remote service (cont.)

Pm 1

DL: Directory logic

Cm 1

S=Shared

Memory

Cluster C1m

Pk 1

Ck 1

Memory

2

Continue from previous slide

Cluster C1k

3Read-Ex Req Read-exclusive

S ClkX:

Read-Ex Req

Read-Ex Rply

DLm DLk

Inv-Req

Clj

Page 113: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

124

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Convex exemplar architecture

CPU1

CPU2

Cache 2 Mb

Cache 2 Mb

Agent

512 Mb memory

CPU3

CPU4

Cache 2 Mb

Cache 2 Mb

Agent

512 Mb memory

CPU5

CPU6

Cache 2 Mb

Cache 2 Mb

Agent

512 Mb memory

CPU7

CPU8

Cache 2 Mb

Cache 2 Mb

Agent

512 Mb memory

I/O subsystem

5x5 crossbar (1.25 Gbytes/sec)

Hypernode 2

Hypernode 16

Hypernode 1

Scalable Coherent Interface Rings (600Mbyte/sec each)

Cache/mem control

Cache/mem control

Cache/mem control

Cache/mem control

Page 114: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

125

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Parallel matrix multiply code for cache-coherent machineglobal c(idim, idim), a(idim, idim)

global b(idim, idim), nCPUs

private i, j, k, itid

call spawn ( nCPUs )

do j = 1, idim

if ( jmod(j, nCPU).eq.itid ) then

do i = 1, idim

c(i, j) = 0.0

do k = 1, idim

c(i, j) = c(i, j) + a(i, k) * b(k, j)

enddo

enddo

endif

enddo

call join

Page 115: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

126

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Parallel matrix multiply code for non-cache-coherent machineglobal c(idim, idim), a(idim, idim)

global b(idim, idim), nCPUs

private i, j, k, itid, tmp

semaphore is (idim,idim)

call spawn ( nCPUs )

do j = 1, idim

if ( jmod(j, nCPU).eq.itid ) then

do i = 1, idim

tmp = 0.0

do k = 1, idim

call flush (a(i, k))

call flush (b(k, j))

tmp = tmp + a(i, k) * b(k, j)

enddo

Page 116: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

127

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Parallel matrix multiply code for non-cache-coherent machine (cont.) call lock (c(i ,j), is(i, j))

c(i, j) = tmp

call flush(c(i, j))

call unlock(c(i, j), is(i, j))

enddo

endif

enddo

call join

Page 117: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

134

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Ring: 0 (All CACHE Group: 0)

The hierarchical structure of the Kendall Square Research (KSR1) machine - COMA

Ring: 1 (All CACHE Group: 1)

Ring: 0 directory

Local cache directory

Local cache directory

Local cache Local cache

Processor Processor

Local cache directory

Local cache

Processor

Ring: 0 Ring: 0

Responder 2

Requester 1 Responder 1 Requester 2

Ring: 0 directory

Page 118: 1 Shared Memory MIMD Architectures Sima, Fountain and Kacsuk Chapter 18 CSE462

135

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

The convergence of scalable MIMD computers

Distributed memory computers

Scalable

Hypercube (Store & forward)

Mesh (Wormhole routing

Processor + comm. proc + router

Shared memory computers

Scalable Small size

Multistage (No cache consistency)

Shared bus (snoopy cache)

NUMA (No cache consistency)

CC-NUMA COMA (Cluster concept)

Multi-threaded computers

Scalable

Multi-threaded processor + communication processor + router + cache + directory

1st generation

2nd generation

3rd generation

4th generation