18-447: computer architectureece447/s12/lib/exe/fetch...18-447: computer architecture lecture 24:...
TRANSCRIPT
![Page 1: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/1.jpg)
18-447: Computer Architecture
Lecture 24: Runahead and Multiprocessing
Prof. Onur Mutlu
Carnegie Mellon University
Spring 2012, 4/23/2012
![Page 2: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/2.jpg)
Reminder: Lab Assignment 6
� Due Today
� Implementing a more realistic memory hierarchy
� L2 cache model
� DRAM, memory controller models
� MSHRs, multiple outstanding misses� MSHRs, multiple outstanding misses
� Extra credit: Prefetching
2
![Page 3: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/3.jpg)
Reminder: Lab Assignment 7
� Cache coherence in multi-core systems
� MESI cache coherence protocol
� Due May 4
� Extra credit: Improve the protocol (open-ended)� Extra credit: Improve the protocol (open-ended)
3
![Page 4: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/4.jpg)
Midterm II
� Midterms will be distributed today
� Please give me a 15-minute warning!
4
![Page 5: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/5.jpg)
Final Exam
� May 10
� Comprehensive (over all topics in course)
� Three cheat sheets allowed
� We will have a review session (stay tuned)
� Remember this is 30% of your grade
� I will take into account your improvement over the course
� Know the previous midterm concepts by heart
5
![Page 6: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/6.jpg)
A Belated Note on Course Feedback
� We have taken into account your feedback
� Some feedback was contradictory
� Pace of course
� Fast or slow?
� Videos help
� Homeworks
� Love or hate?
� Workload
� Too little/easy
� Too heavy
� Many of you indicated you are learning a whole lot
6
![Page 7: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/7.jpg)
Readings for Today
� Required
� Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967.
� Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp. 551-560 in Readings in Computer Architecture.
� Hill, Jouppi, Sohi, “Dataflow and Multithreading,” pp. 309-314 in Readings in Computer Architecture.
� Recommended
� Historical: Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966
7
![Page 8: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/8.jpg)
Readings for Wednesday
� Required cache coherence readings:
� Culler and Singh, Parallel Computer Architecture
� Chapter 5.1 (pp 269 – 283), Chapter 5.3 (pp 291 – 305)
� P&H, Computer Organization and Design
� Chapter 5.8 (pp 534 – 538 in 4th and 4th revised eds.)
Recommended:� Recommended:
� Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs,” IEEE Transactions on Computers, 1979
� Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984.
8
![Page 9: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/9.jpg)
Last Lecture
� Wrap up prefetching
� Markov prefetching
� Content-directed prefetching
� Execution-based prefetching
� Runahead execution� Runahead execution
9
![Page 10: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/10.jpg)
Today
� Wrap-up runahead execution
� Multiprocessing fundamentals
� The cache coherence problem
10
![Page 11: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/11.jpg)
Memory Latency Tolerance
and Runahead Execution
![Page 12: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/12.jpg)
How Do We Tolerate Stalls Due to Memory?
� Two major approaches
� Reduce/eliminate stalls
� Tolerate the effect of a stall when it happens
� Four fundamental techniques to achieve these
� Caching� Caching
� Prefetching
� Multithreading
� Out-of-order execution
� Many techniques have been developed to make these four fundamental techniques more effective in tolerating memory latency
12
![Page 13: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/13.jpg)
Review: Execution-based Prefetchers
� Idea: Pre-execute a piece of the (pruned) program solely for prefetching data
� Only need to distill pieces that lead to cache misses
� Speculative thread: Pre-executed program piece can be considered a “thread”
� Speculative thread can be executed
� On a separate processor/core
� On a separate hardware thread context (think fine-grained multithreading)
� On the same thread context in idle cycles (during cache misses)
13
![Page 14: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/14.jpg)
Review: Thread-Based Pre-Execution
� Dubois and Song, “Assisted Execution,” USC Tech Report 1998.
� Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),”Microthreading (SSMT),”ISCA 1999.
� Zilles and Sohi, “Execution-based Prediction Using Speculative Slices”, ISCA 2001.
14
![Page 15: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/15.jpg)
Compute
Compute
Load 1 Miss
Stall Compute
Load 2 Miss
Stall
Load 1 Hit Load 2 Hit
Compute
Perfect Caches:
Small Window:
Review: Runahead Execution
Compute
Compute
Miss 1
Stall Compute
Miss 2
Stall
Load 1 Miss
Runahead
Load 2 Miss Load 2 Hit
Miss 1
Miss 2
Compute
Load 1 Hit
Saved Cycles
Runahead:
![Page 16: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/16.jpg)
Review: Runahead Execution Pros and Cons
� Advantages:+ Very accurate prefetches for data/instructions (all cache levels)
+ Follows the program path
+ Simple to implement, most of the hardware is already built in
+ Versus other pre-execution based prefetching mechanisms:
+ Uses the same thread context as main thread, no waste of context
+ No need to construct a pre-execution thread
� Disadvantages/Limitations:-- Extra executed instructions
-- Limited by branch prediction accuracy
-- Cannot prefetch dependent cache misses. Solution?
-- Effectiveness limited by available “memory-level parallelism” (MLP)
-- Prefetch distance limited by memory latency
� Implemented in IBM POWER6, Sun “Rock”
16
![Page 17: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/17.jpg)
Execution-based Prefetchers Pros and Cons
+ Can prefetch pretty much any access pattern
+ Can be very low cost (e.g., runahead execution)
+ Especially if it uses the same hardware context
+ Why? The processsor is equipped to execute the program anyway
+ Can be bandwidth-efficient (e.g., runahead execution)
-- Depend on branch prediction and possibly value prediction accuracy
- Mispredicted branches dependent on missing data throw the thread off the correct execution path
-- Can be wasteful
-- speculatively execute many instructions
-- can occupy a separate thread context
17
![Page 18: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/18.jpg)
12%
35%
15%
22% 12%
22%0.7
0.8
0.9
1.0
1.1
1.2
1.3ns
Per
Cyc
le
No prefetcher, no runaheadOnly prefetcher (baseline)Only runaheadPrefetcher + runahead
Performance of Runahead Execution
18
13%
16% 52%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
pera
tion
![Page 19: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/19.jpg)
Runahead Execution vs. Large Windows
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5on
s P
er C
ycle
128-entry window (baseline)128-entry window with Runahead256-entry window384-entry window512-entry window
19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
pera
tion
![Page 20: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/20.jpg)
Runahead on In-order vs. Out-of-order
39%
14%20%
17%
15%
20%
12%22%
13%
10%
0.7
0.8
0.9
1.0
1.1
1.2
1.3on
s P
er C
ycle
in-order baselinein-order + runaheadout-of-order baselineout-of-order + runahead
20
39%
50%28%
73%
73%20%
47%15%
16%
23%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
pera
tio
![Page 21: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/21.jpg)
Effect of Runahead in Sun ROCK
� Shailender Chaudhry talk, Aug 2008.
21
![Page 22: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/22.jpg)
Limitations of the Baseline Runahead Mechanism
� Energy Inefficiency
� A large number of instructions are speculatively executed
� Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
� Ineffectiveness for pointer-intensive applications
� Runahead cannot parallelize dependent L2 cache misses� Runahead cannot parallelize dependent L2 cache misses
� Address-Value Delta (AVD) Prediction [MICRO’05]
� Irresolvable branch mispredictions in runahead mode
� Cannot recover from a mispredicted L2-miss dependent branch
� Wrong Path Events [MICRO’04]
![Page 23: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/23.jpg)
The Efficiency Problem
50%
60%
70%
80%
90%
100%
110%
% Increase in IPC
% Increase in Executed Instructions
235%
0%
10%
20%
30%
40%
50%
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
amm
p
appl
u
apsi art
equa
ke
face
rec
fma3
d
galg
el
luca
s
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
AV
G
22%27%
![Page 24: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/24.jpg)
Causes of Inefficiency
� Short runahead periods
� Overlapping runahead periods
� Useless runahead periods
� Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006.
![Page 25: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/25.jpg)
60%
70%
80%
90%
100%
110%
Incr
ease
in E
xecu
ted
Inst
ruct
ion
s
baseline runahead
all techniques
235%
Overall Impact on Executed Instructions
0%
10%
20%
30%
40%
50%
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
amm
p
appl
u
apsi art
equa
ke
face
rec
fma3
d
galg
el
luca
s
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
AV
G
Incr
ease
in E
xecu
ted
Inst
ruct
ion
s
26.5%
6.2%
![Page 26: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/26.jpg)
The Problem: Dependent Cache Misses
Compute
Load 1 Miss Load 2 MissLoad 2 Load 1 Hit
Runahead: Load 2 is dependent on Load 1
Runahead
Cannot Compute Its Address!
INV
� Runahead execution cannot parallelize dependent misses
� wasted opportunity to improve performance
� wasted energy (useless pre-execution)
� Runahead performance would improve by 25% if this limitation were ideally overcome
Miss 1 Miss 2
![Page 27: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/27.jpg)
Parallelizing Dependent Cache Misses
� Idea: Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism
� How: Predict the values of L2-miss address (pointer) loadsloads
� Address load: loads an address into its destination register, which is later used to calculate the address of another load
� as opposed to data load
� Read:
� Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005.
![Page 28: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/28.jpg)
Parallelizing Dependent Cache Misses
Compute
Load 1 Miss
Miss 1
Load 2 Miss
Miss 2
Load 2 INV Load 1 Hit
Runahead
Cannot Compute Its Address!
Compute
Load 1 Miss
Miss 1
Load 2 Hit
Miss 2
Load 2 Load 1 Hit
Value Predicted
RunaheadSaved Cycles
Can Compute Its Address
Saved Speculative Instructions
Miss
![Page 29: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/29.jpg)
Readings
� Required
� Mutlu et al., “Runahead Execution”, HPCA 2003.
� Recommended
� Mutlu et al., “Efficient Runahead Execution: Power-Efficient � Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006.
� Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005.
� Armstrong et al., “Wrong Path Events,” MICRO 2004.
29
![Page 30: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/30.jpg)
Multiprocessors and
Issues in Multiprocessing
![Page 31: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/31.jpg)
Readings for Today
� Required
� Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp. 551-560 in Readings in Computer Architecture.
� Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967.
� Hill, Jouppi, Sohi, “Dataflow and Multithreading,” pp. 309-314 in Readings in Computer Architecture.Readings in Computer Architecture.
� Recommended
� Historical: Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966
31
![Page 32: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/32.jpg)
Readings for Wednesday
� Required cache coherence readings:
� Culler and Singh, Parallel Computer Architecture
� Chapter 5.1 (pp 269 – 283), Chapter 5.3 (pp 291 – 305)
� P&H, Computer Organization and Design
� Chapter 5.8 (pp 534 – 538 in 4th and 4th revised eds.)
Recommended:� Recommended:
� Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs,” IEEE Transactions on Computers, 1979
� Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984.
32
![Page 33: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/33.jpg)
Remember: Flynn’s Taxonomy of Computers
� Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966
� SISD: Single instruction operates on single data element
� SIMD: Single instruction operates on multiple data elements
� Array processor� Array processor
� Vector processor
� MISD: Multiple instructions operate on single data element
� Closest form: systolic array processor, streaming processor
� MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)
� Multiprocessor
� Multithreaded processor
33
![Page 34: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/34.jpg)
Why Parallel Computers?
� Parallelism: Doing multiple things at a time
� Things: instructions, operations, tasks
� Main Goal
� Improve performance (Execution time or task throughput)� Execution time of a program governed by Amdahl’s Law
� Other Goals
� Reduce power consumption
� (4N units at freq F/4) consume less power than (N units at freq F)
� Why?
� Improve cost efficiency and scalability, reduce complexity
� Harder to design a single unit that performs as well as N simpler units
� Improve dependability: Redundant execution in space34
![Page 35: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/35.jpg)
Types of Parallelism and How to Exploit
Them� Instruction Level Parallelism
� Different instructions within a stream can be executed in parallel
� Pipelining, out-of-order execution, speculative execution, VLIW
� Dataflow
� Data Parallelism
� Different pieces of data can be operated on in parallel
� SIMD: Vector processing, array processing
� Systolic arrays, streaming processors
� Task Level Parallelism
� Different “tasks/threads” can be executed in parallel
� Multithreading
� Multiprocessing (multi-core)35
![Page 36: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/36.jpg)
Task-Level Parallelism: Creating Tasks
� Partition a single problem into multiple related tasks (threads)
� Explicitly: Parallel programming
� Easy when tasks are natural in the problem
� Web/database queries
� Difficult when natural task boundaries are unclear
� Transparently/implicitly: Thread level speculation
� Partition a single thread speculatively
� Run many independent tasks (processes) together
� Easy when there are many processes
� Batch simulations, different users, cloud computing workloads
� Does not improve the performance of a single task
36
![Page 37: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/37.jpg)
MIMD Processing Overview
37
![Page 38: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/38.jpg)
MIMD Processing
� Loosely coupled multiprocessors
� No shared global memory address space
� Multicomputer network
� Network-based multiprocessors
� Usually programmed via message passing
� Explicit calls (send, receive) for communication
� Tightly coupled multiprocessors
� Shared global memory address space
� Traditional multiprocessing: symmetric multiprocessing (SMP)
� Existing multi-core processors, multithreaded processors
� Programming model similar to uniprocessors (i.e., multitasking uniprocessor) except
� Operations on shared data require synchronization38
![Page 39: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/39.jpg)
Main Issues in Tightly-Coupled MP
� Shared memory synchronization
� Locks, atomic operations
� Cache consistency
� More commonly called cache coherence
� Ordering of memory operations
� What should the programmer expect the hardware to provide?
� Resource sharing, contention, partitioning
� Communication: Interconnection networks
� Load imbalance
39
![Page 40: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/40.jpg)
Aside: Hardware-based Multithreading
� Coarse grained
� Quantum based
� Event based (switch-on-event multithreading)
� Fine grained
� Cycle by cycle
� Thornton, “CDC 6600: Design of a Computer,” 1970.� Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP
1978.
� Simultaneous
� Can dispatch instructions from multiple threads at the same time
� Good for improving execution unit utilization
40
![Page 41: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/41.jpg)
Parallel Speedup Example
� a4x4 + a3x3 + a2x2 + a1x + a0
� Assume each operation 1 cycle, no communication cost, each op can be executed in a different processor
� How fast is this with a single processor?� How fast is this with a single processor?
� How fast is this with 3 processors?
41
![Page 42: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/42.jpg)
42
![Page 43: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/43.jpg)
43
![Page 44: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/44.jpg)
Speedup with 3 Processors
44
![Page 45: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/45.jpg)
Revisiting the Single-Processor Algorithm
45
Horner, “A new method of solving numerical equations of all orders, by continuous approximation,” Philosophical Transactions of the Royal Society, 1819.
![Page 46: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/46.jpg)
46
![Page 47: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/47.jpg)
Superlinear Speedup
� Can speedup be greater than P with P processing elements?
� Cache effects
� Working set effects
� Happens in two ways:
� Unfair comparisons
� Memory effects
47
![Page 48: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/48.jpg)
Utilization, Redundancy, Efficiency
� Traditional metrics
� Assume all P processors are tied up for parallel computation
� Utilization: How much processing capability is used
� U = (# Operations in parallel version) / (processors x Time)
� Redundancy: how much extra work is done with parallel processing
� R = (# of operations in parallel version) / (# operations in best single processor algorithm version)
� Efficiency
� E = (Time with 1 processor) / (processors x Time with P processors)
� E = U/R48
![Page 49: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/49.jpg)
Utilization of Multiprocessor
49
![Page 50: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/50.jpg)
50
![Page 51: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/51.jpg)
Caveats of Parallelism (I)
51
![Page 52: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/52.jpg)
Amdahl’s Law
52
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967.
![Page 53: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/53.jpg)
Amdahl’s Law Implication 1
53
![Page 54: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/54.jpg)
Amdahl’s Law Implication 2
54
![Page 55: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/55.jpg)
Caveats of Parallelism (II)
� Amdahl’s Law� f: Parallelizable fraction of a program
� N: Number of processors
Speedup =1
+1 - f f
N
� Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967.
� Maximum speedup limited by serial portion: Serial bottleneck
� Parallel portion is usually not perfectly parallel
� Synchronization overhead (e.g., updates to shared data)
� Load imbalance overhead (imperfect parallelization)
� Resource sharing overhead (contention among N processors)55
+N
![Page 56: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/56.jpg)
Midterm II Results
56
![Page 57: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/57.jpg)
Midterm II Scores
� Average: 148 / 315
� Minimum: 58 / 315
� Maximum: 258 / 315
� Std. Dev.: 52
7
Midterm 2 Distribution
57
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310
![Page 58: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/58.jpg)
Midterm II Per-Question Statistics (I)
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60
I. Potpourri
0
2
4
6
8
10
12
14
0 10 20 30 40 50
II. Vector Processing
0
5
10
15
20
25
0 10 20 30 40 50 60
III. DRAM Refresh
58
0
2
4
6
8
10
12
14
0 10 20 30 40 50 60
V. Memory Scheduling
0
5
10
15
20
25
30
0 10 20 30
IV. Dataflow
![Page 59: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/59.jpg)
Midterm II Per-Question Statistics (II)
0
2
4
6
8
10
12
14
VI. Caches and Virtual Memory
0
2
4
6
8
10
12
14
16
VII. Memory Hierarchy(Bonus)
59
0
0 10 20 30 40 50
0
0 10 20 30 40
![Page 60: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/60.jpg)
We did not cover the following slides in lecture.
These are for your preparation for the next lecture.
![Page 61: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/61.jpg)
Sequential Bottleneck
110120130140150160170180190200
61
0102030405060708090100110
0
0.04
0.08
0.12
0.16
0.2
0.24
0.28
0.32
0.36
0.4
0.44
0.48
0.52
0.56
0.6
0.64
0.68
0.72
0.76
0.8
0.84
0.88
0.92
0.96 1
N=10
N=100
N=1000
f (parallel fraction)
![Page 62: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/62.jpg)
Why the Sequential Bottleneck?
� Parallel machines have the sequential bottleneck
� Main cause: Non-parallelizable operations on data (e.g. non-parallelizable loops)
for ( i = 0 ; i < N; i++)for ( i = 0 ; i < N; i++)
A[i] = (A[i] + A[i-1]) / 2
� Single thread prepares data and spawns parallel tasks (usually sequential)
62
![Page 63: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/63.jpg)
Another Example of Sequential Bottleneck
63
![Page 64: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/64.jpg)
Bottlenecks in Parallel Portion
� Synchronization: Operations manipulating shared data cannot be parallelized
� Locks, mutual exclusion, barrier synchronization
� Communication: Tasks may need values from each other
- Causes thread serialization when shared data is contended
Load Imbalance: Parallel tasks may have different lengths� Load Imbalance: Parallel tasks may have different lengths
� Due to imperfect parallelization or microarchitectural effects
- Reduces speedup in parallel portion
� Resource Contention: Parallel tasks can share hardware resources, delaying each other
� Replicating all resources (e.g., memory) expensive
- Additional latency not present when each task runs alone
64
![Page 65: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/65.jpg)
Difficulty in Parallel Programming
� Little difficulty if parallelism is natural
� “Embarrassingly parallel” applications
� Multimedia, physical simulation, graphics
� Large web servers, databases?
� Difficulty is in
Getting parallel programs to work correctly� Getting parallel programs to work correctly
� Optimizing performance in the presence of bottlenecks
� Much of parallel computer architecture is about
� Designing machines that overcome the sequential and parallel bottlenecks to achieve higher performance and efficiency
� Making programmer’s job easier in writing correct and high-performance parallel programs
65
![Page 66: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/66.jpg)
Cache Coherence
66
![Page 67: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/67.jpg)
Cache Coherence
� Basic question: If multiple processors cache the same block, how do they ensure they all see a consistent state?
P1 P2
67
x
Interconnection Network
Main Memory
1000
![Page 68: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/68.jpg)
The Cache Coherence Problem
P1 P2 ld r2, x
1000
68
x
Interconnection Network
Main Memory
1000
![Page 69: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/69.jpg)
The Cache Coherence Problem
P1 P2
ld r2, x
ld r2, x
1000 1000
69
x
Interconnection Network
Main Memory
1000
![Page 70: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/70.jpg)
The Cache Coherence Problem
P1 P2
ld r2, xadd r1, r2, r4st x, r1
ld r2, x
10002000
70
x
Interconnection Network
Main Memory
st x, r1
1000
![Page 71: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/71.jpg)
The Cache Coherence Problem
P1 P2
ld r2, xadd r1, r2, r4st x, r1
ld r2, x
10002000
ld r5, x
Should NOT load 1000
71
x
Interconnection Network
Main Memory
st x, r1
1000
ld r5, x
![Page 72: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/72.jpg)
Cache Coherence: Whose Responsibility?
� Software
� Can the programmer ensure coherence if caches are invisible to software?
� What if the ISA provided the following instruction?� FLUSH-LOCAL A: Flushes/invalidates the cache block containing address A from a
processor’s local cache
� When does the programmer need to FLUSH-LOCAL an address?
� What if the ISA provided the following instruction?� FLUSH-GLOBAL A: Flushes/invalidates the cache block containing address A from all
other processors’ caches
� When does the programmer need to FLUSH-GLOBAL an address?
� Hardware
� Simplifies software’s job
� One idea: Invalidate all other copies of block A when a processor writes to it
72
![Page 73: 18-447: Computer Architectureece447/s12/lib/exe/fetch...18-447: Computer Architecture Lecture 24: Runahead and Multiprocessing Prof. Onur Mutlu Carnegie Mellon University Spring 2012,](https://reader033.vdocuments.mx/reader033/viewer/2022042405/5f1e061ee8fdfd6a582ab348/html5/thumbnails/73.jpg)
Snoopy Cache Coherence
� Caches “snoop” (observe) each other’s write/read operations
� A simple protocol:
� Write-through, no-write-allocate cache
PrWr / BusWrPrRd/--
73
cache
� Actions: PrRd, PrWr, BusRd, BusWr
Valid
BusWr
Invalid
PrWr / BusWr
PrRd / BusRd