using prediction to accelerate coherence protocols

Using Prediction to Accelerate

Coherence Protocols

Shubhendu S Mukherjee and Mark D HillUniversity of Wisconsin Madison

The topic once again

Using Prediction to Accelerate Coherence Protocols

Discuss the concept of using prediction in a coherence protocol

See how it can be used to accelerate the protocol

Organization Introduction Background

Directory Protocol Two-level Branch Predictor

Cosmos Basic Structure Obtaining Predictions Implementation Issues

Integration with a Coherence Protocol How and When to act on the predictions Handling Mis-predictions Performance

Evaluation Benchmarks Results

Summary and Conclusions

Introduction

Large shared memory multi processors suffer from long latencies for misses to remotely cached blocks

Proposals to lessen these latencies Multithreading Non-blocking caches Application specific coherence protocols Predict future sharing patterns, overlap execution with coherence

work Drawbacks

More complex program model Require sophisticated compilers Existing predictors are directed at specific sharing patterns known a priori

Need for a general predictor, hence this paper!

If general predictor is not in the army then what is it?A general predictor would sit beside standard directory or cache module, monitor coherence activity and take appropriate actions

See the design of Cosmos coherence message predictor

Evaluate Cosmos on some scientific applications

Alls well that ends well?Summarize and conclude

Introduction

Background: 6810 strikes back!

Structure of a Directory Protocol

Distributed memory multiprocessor Hardware based cache coherence Directory and memory distributed among processors Physical address gives the location of memory Nodes connected to each other via a scalable interconnect Messages routed from sender to receiver Directory keeps track of sharing states, which are?

Directory Structure

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Interconnection network

Directory Directory Directory Directory

Example: Coherence Protocol Actions

Processor 1& Caches

Memory I/O

Processor 2& Caches

Memory I/O


Directory Directory

AWr A

1

2

3

4

5

1. ?

2. ?

3. ?

4. ?

5. ?


Processor 1& Caches

Memory I/O

Processor 2& Caches

Memory I/O


Directory Directory

AWr A

1

2

3

4

5

1. P1 Wr request to Dir 1

2. Dir 1 Inval request Dir 2

3. Dir 2 Inval Cache copy P2

4. Dir2 Inval response Dir 1

5. Dir 1 Wr response P1


Processor 1& Caches

Memory I/O

Processor 2& Caches

Memory I/O


Directory Directory

AWr A

1

2

3

4

5

1. P1 Wr request to Dir 1

2. Dir 1 Inval request Dir 2

3. Dir 2 Inval Cache copy P2

4. Dir2 Inval response Dir 1

5. Dir 1 Wr response P1

Point to ponder: Multiple long-latency operations (sequential)

Background: 6810 strikes back!

Branch predictor Need: Execute probable instructions without waiting,

thus improve performance

Two Level Basically a Local predictor Use PC of branch to index into Branch History

Table(Local) Use this BHT entry to index into per branch Pattern

History Table to obtain a branch prediction

Two Level Predictor

Branch PC

Table of16K entries

of 2-bitsaturatingcounters

Table of 64 entries of 14-bithistories for a single branch

10110111011001

Use 6 bits of branch PC toindex into branch history table

14-bit historyindexes into

next level

Pattern History Table

What in Universe is COSMOS?

Cosmos is a Coherence Message Predictor Predicts the sender and type of next incoming

message for a particular block. Structure : Similar to a two level branch

predictor

Structure of Cosmos

Message History Table

(MHT)

Pattern History

Tables(Per block address)

Message History Register (MHR)

Structure of Cosmos


(MHT)

Message History Register (MHR)

<sender, type>

<sender, type>

…

Number of tuples per MHR constitutes its depth

The first level table is called the Message History Table (MHT)

An MHT consists of a series of Message History Registers (MHR) (one per cache block address)

An MHR contains a sequence of <sender,type> tuples (depth)

The second level table is called the Pattern History Table(PHT)

There is one PHT for each MHR PHT is indexed by the entry in MHR Each PHT contains prediction tuples corresponding to MHR

entries

Structure of Cosmos

An Example: Producer - Consumer

repeat… if(producer) private_counter++ shared_counter = private_counter barrier else if(consumer) barrier private_counter = shared_counter else barrierendif…until done


Processor 1& Caches

Memory I/O

Processor 2& Caches

Memory I/O


Directory Directory

ConsumerProducer


Messages seen by the Producer Cache (from directory)

Producer Cache

Memory I/O

Directory

1 2 ? ?


Messages seen by the Producer Cache

Producer Cache

Memory I/O

Directory

1 2

1. Get Wr Response

2. Invalidate Wr request


Messages seen by the Consumer Cache(from directory)

Consumer Cache

Memory I/O

Directory

1 2 ? ?


Messages seen by the Consumer Cache

Consumer Cache

Memory I/O

Directory

1 2

1. Get Rd Response

2. Invalidate Rd request


Messages seen by the Directory

? ?

? ?


Messages seen by the Directory

1. Get Wr Request from producer

2. Invalidate Rd Response from consumer

3. Get Rd Request from consumer

4. Invalidate Wr Response from producer

Sharing Pattern Signature Predictable message patterns

Producer send Get Wr request to directoryreceive Get Wr response from directoryreceive Invalidate Wr request from directorysend Invalidate Wr response to directory

Consumer send Get Rd request to directoryreceive Get Rd response from directoryreceive Invalidate Rd request from directorysend Invalidate Rd response to directory


Back to Cosmos


Pattern History Table for

shared_counter

<P2, get Rd request> P1: Producer

P2: Consumer

<P2, get Rd request> ?

Global Address of

shared_counter

•Directory receives get Rd request from the consumer

Back to Cosmos


Pattern History Table for

shared_counter

<P2, get Rd request> P1: Producer

P2: Consumer

<P2, get Rd request> <P1, Inval Wr response>

Global Address of

shared_counter

•Directory receives get Rd request from the consumer

Back to Cosmos

Obtaining Predictions Index into MHR table with the address of the cache block Use the MHR entry to index into the corresponding PHT Return the prediction (if one exists) from the PHT. This prediction

is of the form < Sender , Message – type >.

Updating Cosmos Index into MHR table with the address of the cache block Use the MHR entry to index into the corresponding PHT Write new <Sender, Message – type> tuple as prediction for index

corresponding to the MHR entry Insert the <Sender, Message – type> tuple into the MHR for the

cache block

How Cosmos adapts to complex signatures

Consider one Producer and two Consumers P1 and P2Two get Rd requests arrive out of order.PHT will then be as shown below

<P1, get Rd request>

PredictionIndex




How Cosmos adapts to complex signatures


PredictionIndex

<P2, get Rd request><P3, get Rd request>




<P1, get Rd request><P2, get Rd request>


MHR with depth greater than 1

Implementation issues

Storage Issues Possible to merge the first level table with cache block

state at cache and the directory? Second level table will need more memory to catch pattern

histories for each cache block If number of pattern histories for each cache block is found

to be low, per allocate memory for the pattern histories If more pattern histories needed, allocate them from a

common pool of dynamically allocated memory Higher prediction accuracies require higher MHR depths :

may result in large amounts of memory

Integration with a Coherence protocol Predictors sit beside cache and directory module and

accelerate coherence activity in two steps:

Step 1: Monitor message activity and make a prediction

Step 2:Invoke an action based on the prediction

Key challenges: Knowing how and when to act on the predictions Handling Mis – predictions Performance

How to act on predictions

Some Examples Prediction Prediction

LocationStatic / Dynamic Action Protocol

Ld/St from Processor Cache Static Pre fetch block in shared or exclusive state

Stanford DASH protocol

Read – modify - write Directory Static Directory responds with block in exclusive state for read miss for idle block

SGI Origin Protocol

Read – modify - write Cache Static Cache requests exclusive copy on read miss

Dir1 SW, Dir1 SW+

Store from different processor

Cache Static Replace block and return to directory Dir1 SW, Dir1 SW+

Store from different processor

Directory Dynamic Invalidate and replace block to directory if exclusive

Dynamic self invalidation

Block migrates between different processors

Directory Dynamic On read miss return block to requesting processor in exclusive state

Migratory Protocols

Detecting and Handling Mis-predictions

Usual problem with predictions Mis-predictions may leave processor state / protocol state

in an inconsistent state

Actions taken after predictions can be classified into three categories

Actions that move the protocol between two legal states Actions that move the protocol to a future state, but do not

expose this state to the processor Actions that allow both processor and the protocol to move

to future states

Handling Mis-Predictions

Actions that move the protocol between two legal states

Example : Replacement of a cache block that moves the block from “exclusive” to “invalid” state

No explicit recovery in this case

Time

Get Wr request

Inval Wr response

Get Wr response

P1 Cache Directory P2 Cache

Actions that move the protocol to a future state, but do not expose this state to the processorIf mis-prediction, simply discard the future stateIf prediction is correct, commit the future state and expose it

to the processor


Time Get Wr request

Inval Wr response

Get Wr response

P1 Cache Directory P2 Cache

Inval Wr request

Predicts, updates protocol state,

generates messageSends message

Time

P1 Cache Directory P2 CachePredicts, updates

protocol state, generates messageMis-Predict


Send correct response

Actions that allow both processor and the protocol to move to future states

Need greater support for recovery Before speculation, both processor and protocol can

checkpoint their states On detecting Mis-predictions , they rollback to the check

pointed states On correct prediction, the current protocol and processor

states must be committed


Performance How prediction affects runtime

A simplistic execution model is as follows. Let :

p be the prediction accuracy for each message,f be the fraction of delay incurred on messages predicted

correctly(e.g .f = 0 means that the time of a message predicted correctlyis completely overlapped with other delays), andr be the penalty due to a mis-predicted message (e.g., r = O.5implies a mis-predicted message takes 1.5 times the delay of amessage without prediction).

Performance How prediction affects runtime

p be the prediction accuracy for each message,f be the fraction of delay incurred on messages predicted correctlyr be the penalty due to a mis-predicted message

If performance is completely determined by the number of messages

in the critical path of a parallel program, then speedup due toprediction is:

time(w/o prediction) 1----------------------------- = ----------------------------- time (with prediction) p * f + (1-p) * (1+r)

Performance

E.g.: For a prediction accuracy of 80% (p=0.8), speedup = 56% with a mis-prediction penalty of 100%(r=1) and a prediction success benefit

of 30% (f=0.3)

Evaluation Cosmos’ prediction accuracy is evaluated using traces of coherence

messages obtained from the Wisconsin Stache protocol running five parallel scientific applications

Wisconsin Stache protocolStache is a software, full-map,and write-invalidate directory protocol that uses part of local memory as a cache for remote data.

BenchmarksFive parallel scientific applications: appbt, barnes, dsmc, moldyn,

unstructured

Benchmarks

AppbtAppbt is a parallel three-dimensional computational fluid

dynamicsapplication.

BarnesBarnes simulates the interaction of a system of bodies in threedimensions using the Barnes-Hut hierarchical N-body method.

DsmcDsmc studies the properties of a gas by simulating the

movement andcollision of a large number of particles in a three-dimensional

domainwith discrete simulation Monte Carlo method.

MoldynMoldyn is a molecular dynamics application.

UnstructuredUnstructured is a computational fluid dynamics application

that usesan unstructured mesh to model a physical structure,such as

anairplane wing or body.

Benchmarks

Results

Depth of MHR

1 2 3 4

appbt barnes dsmc moldyn unstructured

C D O C D O C D O C D O C D O

91908989

77798080

84858585

80817978

42565756

62696968

94959494

73779292

84869393

92919090

79807977

86868584

85909096

65868888

74888992

C: cache prediction rate

D: Directory prediction rate

O: Overall prediction rate

Results: Observations

Overall prediction accuracy :62 ~ 86% Higher accuracy for cache compared to directory: Why ? Prediction accuracy increases with an increase in MHR depth However, not much increase beyond MHR depth of 3

Appbt:High prediction accuracyProducer-consumer sharing patternProducer reads, writes and consumer reads

Barnes :Lower accuracy than other applicationsNodes of octree are assigned different shared memory addresses in

differentiterations

Dsmc:Highest accuracy among all applicationsProducer-consumer sharing patternsProducer writes and consumer readsWhy higher than Appbt?

Moldyn:High accuracyMigratory and producer-consumer sharing patterns

Unstructured:Different dominant signatures for same data structures in different

phases ofthe applicationMigratory and producer-consumer sharing patterns

Results: Observations

Effects of noise-filters Remember them? Cosmos noise filter: Saturating counter : 0 to MAXCOUNT, here till 2 MHR depth >2, filters do not help much – Why? Predictors with MHR>1 can adapt to noise, greater accuracy for

repeating noise

Depth of MHR

1 2

appbt barnes dsmc moldyn unstructured

0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

8485

8585

8586

6269

6671

6671

8486

8688

8688

8686

8686

8686

7488

7889

7889

0 ,1, 2: MAXCOUNT


Comparison with directed optimizations

Worse: Less cost effective as more hardware required

Better: Including the composition of predictors of several directed

optimizations in a single protocol will be more complex than a single Cosmos

Can discover application-specific sharing patterns not known a priori

We explored using Prediction to Accelerate coherence protocol Protocol executes faster if future actions can be predicted and executed

speculatively. We came across Cosmos

MHT, MHR, PHT Two-level predictor Use <sender, message-type> tuple

We evaluated Cosmos using scientific applications High prediction accuracy because of predictable coherence message patterns

Cosmos is more general than directed optimizations Can be costly because of large resource usage Can be easily integrated with a protocol Can discover and track application specific patterns not known a priori

Finally, more work is needed to determine if the high prediction rates can be used to significantly reduce execution time with a coherence protocol.


Questions

?

using prediction to accelerate coherence protocols

Documents

inval request dir

coherence activity

coherence protocolsdiscuss

coherence protocolhow

coherence protocolsee

branch predictorneed

branch history tablelocaluse

branch history table14