using prediction to accelerate coherence protocols
DESCRIPTION
Using Prediction to Accelerate Coherence Protocols. Shubhendu S Mukherjee and Mark D Hill University of Wisconsin Madison. The topic once again. Using Prediction to Accelerate Coherence Protocols Discuss the concept of using prediction in a coherence protocol - PowerPoint PPT PresentationTRANSCRIPT
Using Prediction to Accelerate
Coherence Protocols
Shubhendu S Mukherjee and Mark D HillUniversity of Wisconsin Madison
The topic once again
Using Prediction to Accelerate Coherence Protocols
Discuss the concept of using prediction in a coherence protocol
See how it can be used to accelerate the protocol
Organization Introduction Background
Directory Protocol Two-level Branch Predictor
Cosmos Basic Structure Obtaining Predictions Implementation Issues
Integration with a Coherence Protocol How and When to act on the predictions Handling Mis-predictions Performance
Evaluation Benchmarks Results
Summary and Conclusions
Introduction
Large shared memory multi processors suffer from long latencies for misses to remotely cached blocks
Proposals to lessen these latencies Multithreading Non-blocking caches Application specific coherence protocols Predict future sharing patterns, overlap execution with coherence
work Drawbacks
More complex program model Require sophisticated compilers Existing predictors are directed at specific sharing patterns known a priori
Need for a general predictor, hence this paper!
If general predictor is not in the army then what is it?A general predictor would sit beside standard directory or cache module, monitor coherence activity and take appropriate actions
See the design of Cosmos coherence message predictor
Evaluate Cosmos on some scientific applications
Alls well that ends well?Summarize and conclude
Introduction
Background: 6810 strikes back!
Structure of a Directory Protocol
Distributed memory multiprocessor Hardware based cache coherence Directory and memory distributed among processors Physical address gives the location of memory Nodes connected to each other via a scalable interconnect Messages routed from sender to receiver Directory keeps track of sharing states, which are?
Directory Structure
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Interconnection network
Directory Directory Directory Directory
Example: Coherence Protocol Actions
Processor 1& Caches
Memory I/O
Processor 2& Caches
Memory I/O
Interconnection network
Directory Directory
AWr A
1
2
3
4
5
1. ?
2. ?
3. ?
4. ?
5. ?
Example: Coherence Protocol Actions
Processor 1& Caches
Memory I/O
Processor 2& Caches
Memory I/O
Interconnection network
Directory Directory
AWr A
1
2
3
4
5
1. P1 Wr request to Dir 1
2. Dir 1 Inval request Dir 2
3. Dir 2 Inval Cache copy P2
4. Dir2 Inval response Dir 1
5. Dir 1 Wr response P1
Example: Coherence Protocol Actions
Processor 1& Caches
Memory I/O
Processor 2& Caches
Memory I/O
Interconnection network
Directory Directory
AWr A
1
2
3
4
5
1. P1 Wr request to Dir 1
2. Dir 1 Inval request Dir 2
3. Dir 2 Inval Cache copy P2
4. Dir2 Inval response Dir 1
5. Dir 1 Wr response P1
Point to ponder: Multiple long-latency operations (sequential)
Background: 6810 strikes back!
Branch predictor Need: Execute probable instructions without waiting,
thus improve performance
Two Level Basically a Local predictor Use PC of branch to index into Branch History
Table(Local) Use this BHT entry to index into per branch Pattern
History Table to obtain a branch prediction
Two Level Predictor
Branch PC
Table of16K entries
of 2-bitsaturatingcounters
Table of 64 entries of 14-bithistories for a single branch
10110111011001
Use 6 bits of branch PC toindex into branch history table
14-bit historyindexes into
next level
Pattern History Table
What in Universe is COSMOS?
Cosmos is a Coherence Message Predictor Predicts the sender and type of next incoming
message for a particular block. Structure : Similar to a two level branch
predictor
Structure of Cosmos
Message History Table
(MHT)
Pattern History
Tables(Per block address)
Message History Register (MHR)
Structure of Cosmos
Message History Table
(MHT)
Message History Register (MHR)
<sender, type>
<sender, type>
…
Number of tuples per MHR constitutes its depth
The first level table is called the Message History Table (MHT)
An MHT consists of a series of Message History Registers (MHR) (one per cache block address)
An MHR contains a sequence of <sender,type> tuples (depth)
The second level table is called the Pattern History Table(PHT)
There is one PHT for each MHR PHT is indexed by the entry in MHR Each PHT contains prediction tuples corresponding to MHR
entries
Structure of Cosmos
An Example: Producer - Consumer
repeat… if(producer) private_counter++ shared_counter = private_counter barrier else if(consumer) barrier private_counter = shared_counter else barrierendif…until done
An Example: Producer - Consumer
Processor 1& Caches
Memory I/O
Processor 2& Caches
Memory I/O
Interconnection network
Directory Directory
ConsumerProducer
An Example: Producer - Consumer
Messages seen by the Producer Cache (from directory)
Producer Cache
Memory I/O
Directory
1 2 ? ?
An Example: Producer - Consumer
Messages seen by the Producer Cache
Producer Cache
Memory I/O
Directory
1 2
1. Get Wr Response
2. Invalidate Wr request
An Example: Producer - Consumer
Messages seen by the Consumer Cache(from directory)
Consumer Cache
Memory I/O
Directory
1 2 ? ?
An Example: Producer - Consumer
Messages seen by the Consumer Cache
Consumer Cache
Memory I/O
Directory
1 2
1. Get Rd Response
2. Invalidate Rd request
An Example: Producer - Consumer
Messages seen by the Directory
? ?
? ?
An Example: Producer - Consumer
Messages seen by the Directory
1. Get Wr Request from producer
2. Invalidate Rd Response from consumer
3. Get Rd Request from consumer
4. Invalidate Wr Response from producer
Sharing Pattern Signature Predictable message patterns
Producer send Get Wr request to directoryreceive Get Wr response from directoryreceive Invalidate Wr request from directorysend Invalidate Wr response to directory
Consumer send Get Rd request to directoryreceive Get Rd response from directoryreceive Invalidate Rd request from directorysend Invalidate Rd response to directory
An Example: Producer - Consumer
Back to Cosmos
Message History Table
Pattern History Table for
shared_counter
<P2, get Rd request> P1: Producer
P2: Consumer
<P2, get Rd request> ?
Global Address of
shared_counter
•Directory receives get Rd request from the consumer
Back to Cosmos
Message History Table
Pattern History Table for
shared_counter
<P2, get Rd request> P1: Producer
P2: Consumer
<P2, get Rd request> <P1, Inval Wr response>
Global Address of
shared_counter
•Directory receives get Rd request from the consumer
Back to Cosmos
Obtaining Predictions Index into MHR table with the address of the cache block Use the MHR entry to index into the corresponding PHT Return the prediction (if one exists) from the PHT. This prediction
is of the form < Sender , Message – type >.
Updating Cosmos Index into MHR table with the address of the cache block Use the MHR entry to index into the corresponding PHT Write new <Sender, Message – type> tuple as prediction for index
corresponding to the MHR entry Insert the <Sender, Message – type> tuple into the MHR for the
cache block
How Cosmos adapts to complex signatures
Consider one Producer and two Consumers P1 and P2Two get Rd requests arrive out of order.PHT will then be as shown below
<P1, get Rd request>
PredictionIndex
<P2, get Rd request>
<P2, get Rd request>
<P1, get Rd request>
How Cosmos adapts to complex signatures
<P1, get Rd request>
PredictionIndex
<P2, get Rd request><P3, get Rd request>
<P2, get Rd request>
<P3, get Rd request>
<P3, get Rd request>
<P1, get Rd request><P2, get Rd request>
<P1, get Rd request>
MHR with depth greater than 1
Implementation issues
Storage Issues Possible to merge the first level table with cache block
state at cache and the directory? Second level table will need more memory to catch pattern
histories for each cache block If number of pattern histories for each cache block is found
to be low, per allocate memory for the pattern histories If more pattern histories needed, allocate them from a
common pool of dynamically allocated memory Higher prediction accuracies require higher MHR depths :
may result in large amounts of memory
Integration with a Coherence protocol Predictors sit beside cache and directory module and
accelerate coherence activity in two steps:
Step 1: Monitor message activity and make a prediction
Step 2:Invoke an action based on the prediction
Key challenges: Knowing how and when to act on the predictions Handling Mis – predictions Performance
How to act on predictions
Some Examples Prediction Prediction
LocationStatic / Dynamic Action Protocol
Ld/St from Processor Cache Static Pre fetch block in shared or exclusive state
Stanford DASH protocol
Read – modify - write Directory Static Directory responds with block in exclusive state for read miss for idle block
SGI Origin Protocol
Read – modify - write Cache Static Cache requests exclusive copy on read miss
Dir1 SW, Dir1 SW+
Store from different processor
Cache Static Replace block and return to directory Dir1 SW, Dir1 SW+
Store from different processor
Directory Dynamic Invalidate and replace block to directory if exclusive
Dynamic self invalidation
Block migrates between different processors
Directory Dynamic On read miss return block to requesting processor in exclusive state
Migratory Protocols
Detecting and Handling Mis-predictions
Usual problem with predictions Mis-predictions may leave processor state / protocol state
in an inconsistent state
Actions taken after predictions can be classified into three categories
Actions that move the protocol between two legal states Actions that move the protocol to a future state, but do not
expose this state to the processor Actions that allow both processor and the protocol to move
to future states
Handling Mis-Predictions
Actions that move the protocol between two legal states
Example : Replacement of a cache block that moves the block from “exclusive” to “invalid” state
No explicit recovery in this case
Time
Get Wr request
Inval Wr response
Get Wr response
P1 Cache Directory P2 Cache
Actions that move the protocol to a future state, but do not expose this state to the processorIf mis-prediction, simply discard the future stateIf prediction is correct, commit the future state and expose it
to the processor
Handling Mis-Predictions
Time Get Wr request
Inval Wr response
Get Wr response
P1 Cache Directory P2 Cache
Inval Wr request
Predicts, updates protocol state,
generates messageSends message
Time
P1 Cache Directory P2 CachePredicts, updates
protocol state, generates messageMis-Predict
Handling Mis-Predictions
Send correct response
Actions that allow both processor and the protocol to move to future states
Need greater support for recovery Before speculation, both processor and protocol can
checkpoint their states On detecting Mis-predictions , they rollback to the check
pointed states On correct prediction, the current protocol and processor
states must be committed
Handling Mis-Predictions
Performance How prediction affects runtime
A simplistic execution model is as follows. Let :
p be the prediction accuracy for each message,f be the fraction of delay incurred on messages predicted
correctly(e.g .f = 0 means that the time of a message predicted correctlyis completely overlapped with other delays), andr be the penalty due to a mis-predicted message (e.g., r = O.5implies a mis-predicted message takes 1.5 times the delay of amessage without prediction).
Performance How prediction affects runtime
p be the prediction accuracy for each message,f be the fraction of delay incurred on messages predicted correctlyr be the penalty due to a mis-predicted message
If performance is completely determined by the number of messages
in the critical path of a parallel program, then speedup due toprediction is:
time(w/o prediction) 1----------------------------- = ----------------------------- time (with prediction) p * f + (1-p) * (1+r)
Performance
E.g.: For a prediction accuracy of 80% (p=0.8), speedup = 56% with a mis-prediction penalty of 100%(r=1) and a prediction success benefit
of 30% (f=0.3)
Evaluation Cosmos’ prediction accuracy is evaluated using traces of coherence
messages obtained from the Wisconsin Stache protocol running five parallel scientific applications
Wisconsin Stache protocolStache is a software, full-map,and write-invalidate directory protocol that uses part of local memory as a cache for remote data.
BenchmarksFive parallel scientific applications: appbt, barnes, dsmc, moldyn,
unstructured
Benchmarks
AppbtAppbt is a parallel three-dimensional computational fluid
dynamicsapplication.
BarnesBarnes simulates the interaction of a system of bodies in threedimensions using the Barnes-Hut hierarchical N-body method.
DsmcDsmc studies the properties of a gas by simulating the
movement andcollision of a large number of particles in a three-dimensional
domainwith discrete simulation Monte Carlo method.
MoldynMoldyn is a molecular dynamics application.
UnstructuredUnstructured is a computational fluid dynamics application
that usesan unstructured mesh to model a physical structure,such as
anairplane wing or body.
Benchmarks
Results
Depth of MHR
1 2 3 4
appbt barnes dsmc moldyn unstructured
C D O C D O C D O C D O C D O
91908989
77798080
84858585
80817978
42565756
62696968
94959494
73779292
84869393
92919090
79807977
86868584
85909096
65868888
74888992
C: cache prediction rate
D: Directory prediction rate
O: Overall prediction rate
Results
Depth of MHR
1 2 3 4
appbt barnes dsmc moldyn unstructured
C D O C D O C D O C D O C D O
91908989
77798080
84858585
80817978
42565756
62696968
94959494
73779292
84869393
92919090
79807977
86868584
85909096
65868888
74888992
C: cache prediction rate
D: Directory prediction rate
O: Overall prediction rate
Results: Observations
Overall prediction accuracy :62 ~ 86% Higher accuracy for cache compared to directory: Why ? Prediction accuracy increases with an increase in MHR depth However, not much increase beyond MHR depth of 3
Appbt:High prediction accuracyProducer-consumer sharing patternProducer reads, writes and consumer reads
Barnes :Lower accuracy than other applicationsNodes of octree are assigned different shared memory addresses in
differentiterations
Dsmc:Highest accuracy among all applicationsProducer-consumer sharing patternsProducer writes and consumer readsWhy higher than Appbt?
Moldyn:High accuracyMigratory and producer-consumer sharing patterns
Unstructured:Different dominant signatures for same data structures in different
phases ofthe applicationMigratory and producer-consumer sharing patterns
Results: Observations
Effects of noise-filters Remember them? Cosmos noise filter: Saturating counter : 0 to MAXCOUNT, here till 2 MHR depth >2, filters do not help much – Why? Predictors with MHR>1 can adapt to noise, greater accuracy for
repeating noise
Depth of MHR
1 2
appbt barnes dsmc moldyn unstructured
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
8485
8585
8586
6269
6671
6671
8486
8688
8688
8686
8686
8686
7488
7889
7889
0 ,1, 2: MAXCOUNT
Summary and Conclusions
Comparison with directed optimizations
Worse: Less cost effective as more hardware required
Better: Including the composition of predictors of several directed
optimizations in a single protocol will be more complex than a single Cosmos
Can discover application-specific sharing patterns not known a priori
We explored using Prediction to Accelerate coherence protocol Protocol executes faster if future actions can be predicted and executed
speculatively. We came across Cosmos
MHT, MHR, PHT Two-level predictor Use <sender, message-type> tuple
We evaluated Cosmos using scientific applications High prediction accuracy because of predictable coherence message patterns
Cosmos is more general than directed optimizations Can be costly because of large resource usage Can be easily integrated with a protocol Can discover and track application specific patterns not known a priori
Finally, more work is needed to determine if the high prediction rates can be used to significantly reduce execution time with a coherence protocol.
Summary and Conclusions
Questions
?