multiprocessing & cache coherency

34
11/14 Multiprocessing.1 Multiprocessing & Cache Coherency

Upload: oliver-booker

Post on 08-Jan-2018

244 views

Category:

Documents


0 download

DESCRIPTION

What is multiprocessing (REVIEW) Computer System – supports several simultaneous processes All OSes support multiprocessing More complex - must share system resources ILP running out of steam Today’s CPUs are Chip MultiProcessors CMP

TRANSCRIPT

Page 1: Multiprocessing & Cache Coherency

11/14 Multiprocessing.1

Multiprocessing

& Cache Coherency

Page 2: Multiprocessing & Cache Coherency

11/14 Multiprocessing.2

What is multiprocessing (REVIEW)

• Computer System – supports several simultaneous processes

• All OSes support multiprocessing• More complex - must share system

resources• ILP running out of steam• Today’s CPUs are Chip MultiProcessors

CMP

Page 3: Multiprocessing & Cache Coherency

11/14 Multiprocessing.3

Multiple Processes – One CPU (review) stack

task priority

CPU registers

CPU registers

ProcessorMemory

stack

task priority

CPU registers

stack

task priority

CPU registers

...

}context

Page 4: Multiprocessing & Cache Coherency

11/14 Multiprocessing.4

Context-Switch to share CPU (review) • Time-slicing

– Time-slice: period of time task runs before context-switch

– hardware interrupts system timer

– kernel Scheduling• Preemption

– Currently task halted and switched out by higher-priority task

– Typical in Embedded, Real time

Time-slice

Context switches

Context switches

Page 5: Multiprocessing & Cache Coherency

11/14 Multiprocessing.5

Process State (review) • A process can be in one of many states

WaitingforEvent

Delayed

Dormant Ready Running

Interrupted

taskdeleted

interrupted

task create

taskdelete

task deleted

context switch

delayexpired

eventoccurred

wait forevent

delay taskfor n ticks

task delete

Page 6: Multiprocessing & Cache Coherency

11/14 Multiprocessing.6

Extensions of Memory System

P1

$

Inter connection network

$

Pn

Mem Mem

P1

$

Inter connection network

$

Pn

Mem Mem

Centralized MemoryDance Hall, UMA

Distributed Memory (NUMA)

Scale

Page 7: Multiprocessing & Cache Coherency

11/14 Multiprocessing.7

symmetric• All memory is equally far away from all processors• Any processor can do any I/O (set up a DMA transfer)

Symmetric Multiprocessors

MemoryI/O controller

Graphicsoutput

CPU-Memory busbridge

Processor

I/O controller I/O controller

I/O bus

Networks

Processor

Page 8: Multiprocessing & Cache Coherency

11/14 Multiprocessing.8

Bus-Based Symmetric Shared Memory

• on chip Building blocks for larger systems; already on desktop• Attractive for servers and parallel programs

– Fine-grain resource sharing– Uniform access via loads/stores– Automatic data movement and coherent replication in caches– Cheap and powerful extension

• Normal uniprocessor mechanisms to access data

I/O devicesMem

P1

$ $

Pn

Bus

Page 9: Multiprocessing & Cache Coherency

11/14 Multiprocessing.9

SMP :: exampleConnecting IBM Power chips

•8-way SMP•Each CMP has2 cores

Page 10: Multiprocessing & Cache Coherency

11/14 Multiprocessing.10

Parallel Programming Models• Programming model : languages – libraries create

abstract view of machine• Control

– How is parallelism created– Operation ordering – Synchronization control

• Data– private vs. shared– Communicated How shared data accessed

• Synchronization– What operations can be used– What are atomic (indivisible) operations?

Page 11: Multiprocessing & Cache Coherency

11/14 Multiprocessing.11

Programming Model 1: Shared Memory

• Program: collection of threads with private variables, • AND shared variables, e.g., static variables, shared common blocks,

– Threads communicate implicitly by writing / reading shared variables.

– Thread coordination by synchronizing shared variables

PnP1P0

s s = ...y = ..s ...

Shared memory

i: 2 i: 5 Private memory

i: 8

Page 12: Multiprocessing & Cache Coherency

11/14 Multiprocessing.12

Synchronization Techniques• Mutexes – mutual exclusion locks (binary semaphore)

– threads are mostly independent and must access common data lock *l = alloc_and_init(); /* shared */ lock(l); access data unlock(l);• Barrier – global (/coordinated) synchronization

– simple use of barriers -- all threads hit the same one work_on_my_subgrid(); barrier; read_neighboring_values(); barrier;• Need atomic operations bigger than loads/stores

– atomic swap, test-and-test-and-set• Transactional memory

– Hardware equivalent of optimistic concurrency– Solves many parallel programming problems

Page 13: Multiprocessing & Cache Coherency

11/14 Multiprocessing.13

Programming Model 2: Message Passing

• Program : a collection of processes.– Usually fixed at program startup– local address space -- NO shared data.– Logically shared data partitioned.

• Processes communicate by explicit send/receive pairs– Coordination implicit in every communication event.– MPI (Message Passing Interface) most commonly used SW

PnP1P0

y = ..s ...

s: 12

i: 2

Private memory

s: 14

i: 3

s: 11

i: 1

send P1,s

Network

receive Pn,s

Page 14: Multiprocessing & Cache Coherency

11/14 Multiprocessing.14

MPI – de facto standard• MPI has become de facto standard for parallel computing

using message passing• Example: (FYI)

for(i=1;i<numprocs;i++) { sprintf(buff, "Hello %d! ", i); MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG,

MPI_COMM_WORLD); } for(i=1;i<numprocs;i++) {

MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);

printf("%d: %s\n", myid, buff); }

• Pros and Cons of standards– MPI a standard for development in the HPC community

portability– The MPI standard buit on mid-80s technology,

Page 15: Multiprocessing & Cache Coherency

11/14 Multiprocessing.15

Shared Memory VS or Message Passing• Advantages of Shared Memory:

– Implicit communication (loads/stores)– Low overhead when cached

• Disadvantages of Shared Memory:– Complex to scale well– Requires synchronization operations

• Advantages of Message Passing– Explicit Communication (sending/receiving of messages)– Easier to control data placement (no automatic caching)

• Disadvantages of Message Passing– High Message passing overhead– Complex to program

• Due to CMPs, cache-coherent shared memory systems will be dominant form of multiprocessor

Page 16: Multiprocessing & Cache Coherency

11/14 Multiprocessing.16

Caches and Cache Coherence• Caches play key role

– Reduce average data access time– Reduce bandwidth demands placed on shared interconnect

• private processor caches create a problem– Copies of a variable can be present in multiple caches – A write by one processor may not become visible to others

• stale value in their caches

• Solutions– Cache snoop architecture & protocols

Page 17: Multiprocessing & Cache Coherency

11/14 Multiprocessing.17

Example Cache Coherence Problem

notes:Processors see different values for u after event 3With write back caches, value written back to memory depends on which cache flushes or when writes back value

Processes accessing main memory see stale value

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?4

u = ?

u :51

u :5

2

u :5

3

u = 7

Page 18: Multiprocessing & Cache Coherency

11/14 Multiprocessing.18

Problems with Parallel I/O

Memory Disk: Physical memory may be stale if Cache copy is dirty

Disk Memory: Cache may hold stale data and not see memory writes

Use non-cacheable paging to solve

DISK DMA

PhysicalMemory

Proc.Cache

Memory Bus

Cached portions of page

DMA transfers

Page 19: Multiprocessing & Cache Coherency

11/14 Multiprocessing.19

Snoopy Cache-Coherence Protocols

• Cache Controller “snoops” all transactions on the shared bus– relevant transaction if for a block it contains– take action to ensure coherence

• invalidate, update, or supply value– depends on state of the block and the protocol

StateAddressData

I/O devicesMem

P1

$

Bus snoop

$

Pn

Cache-memorytransaction

Page 20: Multiprocessing & Cache Coherency

11/14 Multiprocessing.20

Write-through Invalidate Protocol• Basic Bus-Based Protocol

– Each processor has cache, state– All transactions over bus snooped

• Writes invalidate all other caches– can have multiple simultaneous readers

of block, but write invalidates them

• Two states per block in each cache– state bits associated with blocks that

are in the cache – other blocks can be seen as being in

invalid (not-present) state in that cache

State Tag Data

I/O devicesMem

P1

$ $

Pn

Bus

State Tag Data

Page 21: Multiprocessing & Cache Coherency

11/14 Multiprocessing.21

Example: Write-thru Invalidate

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?4

u = ?

u :51

u :5

2

u :5

3

u = 7

u = 7

Page 22: Multiprocessing & Cache Coherency

11/14 Multiprocessing.22

Write-through vs. Write-back• Write-through protocol is simple

– every write is observable• Every write goes on the bus

Only one write can take place at a time in any processor• Uses a lot of bandwidth!

Example: 200 MHz dual issue, CPI = 1, 15% stores of 8 bytes

30 M stores per second per processor

240 MB/s per processor!

State Tag Data

I/O devicesMem

P1

$ $

Pn

Bus

State Tag Data

Page 23: Multiprocessing & Cache Coherency

11/14 Multiprocessing.23

Invalidate vs. Update• Basic question of program behavior:

– Is a block written by one processor later read by others before it is overwritten?

• Invalidate. – yes: readers will take a miss– no: multiple writes without additional traffic

• Update. – yes: avoids misses on later references– no: multiple useless updates

Page 24: Multiprocessing & Cache Coherency

11/14 Multiprocessing.24

Coherent Memory System• Reading a location should return latest

value written by any process• Easy in uniprocessors ; Except for I/O, -

infrequent - software solutions work– eg: non cacheable operations, ..

• coherence problem more pervasive performance critical in multiprocessors

Page 25: Multiprocessing & Cache Coherency

11/14 Multiprocessing.25

Coherence Meansas if no cache exists

1. operations issued by any process occur in order issued by process, and

2. value returned by read is last value written to that location in serial order

3. Two necessary features:

Write propagation: value written must become visible to others

Write serialization: writes to location seen in same order by all– if I see w1 after w2, you should not see w2 before w1

Page 26: Multiprocessing & Cache Coherency

11/14 Multiprocessing.26

Two Hardware Cache Coherence Solutions

– “snoopy” schemes» rely on broadcast to observe all coherence traffic» well suited for buses and small-scale systems

– directory schemes» uses centralized information to avoid broadcast» scales well to large numbers of processors

Page 27: Multiprocessing & Cache Coherency

11/14 Multiprocessing.27

Snoopy Cache Protocols• all coherence-related activity is broadcast to all

processors on a bus (MESI protocol)• each processor monitors (“snoops”) bus actions• Processor reacts when activity relevant to current

cache contents• » if another processor wishes to write to a line, you

may need to “invalidate” (I.e. discard) the copy in your own cache» if another processor wishes to read a line for which you have

a dirty copy, you may need to supply

Page 28: Multiprocessing & Cache Coherency

11/14 Multiprocessing.28

MESI Invalidate Cache Protocol• 4 States (per cache block/line)

– Invalid I– Shared S: Two or more caches have copy– Dirty or Modified M: one only– Exclusive E :Only this cache has copy, not modified

• Implemented in most commercial processors, Core Duo, Core 2, IBM Power, ..

M: Modified ExclusiveE: Exclusive, unmodifiedS: Shared I: Invalid

Each cache line has a tagAddress tag

state bits

Page 29: Multiprocessing & Cache Coherency

11/14 Multiprocessing.29

MESI Protocol

• M odified / E xclusive / S hared / I nvalid • Upon loading, a line is marked E, subsequent read

OK, write marks M • If another's load is seen, mark S • Write to an S, send I to all, mark M • If another reads an M line, write it back, mark it S • Read/write to an I misses

Page 30: Multiprocessing & Cache Coherency

11/14 Multiprocessing.30

Snooper Snooper Snooper Snooper

Snoop with Level-2 Caches Possible

• Processors have two-level caches

• Inclusion property: entries in IL1 & DL1 are in L2 invalidation in L2 invalidation in L1• Snooping on L2 does not affect CPU-L1 bandwidth

CPU

L1 $

L2 $

CPU

L1 $

L2 $

CPU

L1 $

L2 $

CPU

L1 $

L2 $

Page 31: Multiprocessing & Cache Coherency

11/14 Multiprocessing.31

Cache Coherent System summary:

• Provide set of states, state transition diagram, and actions

• Manage coherence protocol– (0) Determine when to invoke coherence protocol– (a) Find info about state of block in other caches to determine

action - whether need to communicate with other cached copies– (b) Locate the other copies– (c) Communicate with those copies (invalidate/update)

• (0) is done the same way on all systems– state of the line is maintained in the cache– protocol is invoked if an “access fault” occurs on the line

• Different approaches distinguished by (a) to (c)

Page 32: Multiprocessing & Cache Coherency

11/14 Multiprocessing.32

Bus-based Coherence summary• All of (a), (b), (c) done through broadcast on bus

– faulting processor sends out a “search” – others respond to the search probe and take necessary action

• Could do it in scalable network too– broadcast to all processors, and let them respond

• Conceptually simple, but broadcast doesn’t scale with p– on bus, bus bandwidth doesn’t scale– on scalable network, every fault leads to at least p network

transactions

• Scalable coherence:– can have same cache states and state transition diagram– different mechanisms to manage protocol

Page 33: Multiprocessing & Cache Coherency

11/14 Multiprocessing.33

More Scalable coherency Approach : Directories

• Every memory block / line has associated directory entry– Tracks copies of cached blocks and their states– on a miss, find directory entry; communicate only with nodes

that have copies – in scalable networks, communication with directory and

copies is through network transactions

• alternatives for organizing directory information

• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• k processors. • each cache-block in memory:

k presence-bits, 1 dirty-bit• With each cache-block in cache:

1 valid bit, and 1 dirty (owner) bit

Page 34: Multiprocessing & Cache Coherency

11/14 Multiprocessing.34

Directory Operation

• k processors. • With each cache-block in memory:

k presence-bits, 1 dirty-bit• With each cache-block in cache:

1 valid bit, and 1 dirty (owner) bit• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• Read from memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i]

ON; }• if dirty-bit ON then { recall line from dirty proc ; update

memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

• Write to memory by processor i:• If dirty-bit OFF then {send invalidations to all caches that

have the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }