memory subsystem design or nothing beats cold, hard cache print out the two pi questions on cache...
TRANSCRIPT
Memory Subsystem Design
or
Nothing Beats Cold, Hard Cache
Print out the two PI questions on cache diagrams and bring copies to class for students to work on.Reading – 5.3, 5.4
Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
The memory subsystem
Computer
Memory
Datapath
Control
Output
Input
Movie Rental Store
• You have a huge warehouse with EVERY movie ever made (hits, training films, etc.).
• Getting a movie from the warehouse takes 15 minutes.
• You can’t stay in business if every rental takes 15 minutes.
• You have some small shelves in the front office.
Office
Warehouse
Think for a bit about what you Might do to improve this (on your own)
Here are some suggested improvements to the store:1. Whenever someone rents a movie, just keep it in the front office
for a while in case someone else wants to rent it.2. Watch the trends in movie watching and attempt to guess movies
that will be rented soon – put those in the front office.3. Whenever someone rents a movie in a series (Star Wars), grab
the other movies in the series and put them in the front office.4. Buy motorcycles to ride in the warehouse to get the movies faster
Extending the analogy to locality for caches, which pair of changes most closely matches the analogous cache locality?
Selection Spatial Temporal
A 2 1
B 4 2
C 4 3
D 3 1
E None of the above
Office
Warehouse
Memory Locality
• Memory hierarchies take advantage of memory locality.
• Memory locality is the principle that future memory accesses are near past accesses.
• Memories take advantage of two types of locality– -- near in time => we will often access the
same data again very soon
– -- near in space/distance => our next access is often very close to our last access (or recent accesses).
(this sequence of addresses exhibits both temporal and spatial locality)
1,2,3,1,2,3,8,8,47,9,10,8,8...
From the book we know SRAM is very fast, expensive ($/GB), and small. We also know Disks are slow, inexpensive ($/GB), and large. Which statement best describes the role of cache when it works.
Selection Role of caching
A Locality allows us to keep frequently touched data in SRAM.
B Locality allows us the illusion of memory as fast as SRAM but as large as a disk.
C SRAM is too expensive to have large – so it must be small and caching helps use it well.
D Disks are too slow – we have to have something faster for our processor to access.
E None of these accurately describes the roll of cache.
Locality and cacheing
• Memory hierarchies exploit locality by cacheing (keeping close to the processor) data likely to be used again.
• This is done because we can build large, slow memories and small, fast memories, but we can’t build large, fast memories.
SRAM access times are 0.5 – 2.5ns at cost of $2000 to $5000 per GB.
DRAM access times are 60-120ns at cost of $20 to $75 per GB.
Disk access times are 5 to 20 million ns at cost of $.20 to $2 per GB.
A typical memory hierarchy
CPU
memory
memory
memory
memory
on-chip cache(s)
off-chip cache
main memory
disk
small expensive $/bit
cheap $/bit
big
•so then where is my program and data??
fast
slow
Cache Fundamentals
• cache hit -- an access where the data is found in the cache.
• cache miss -- an access which isn’t
• hit time -- time to access the cache
• miss penalty -- time to move data from further level to closer, then to cpu
• hit ratio -- percentage of time the data is found in the cache
• miss ratio -- (1 - hit ratio)
cpu
lowest-levelcache
next-levelmemory/cache
Cache Fundamentals, cont.
• cache block size or cache line size– the amount of data that gets transferred on a cache miss.
• instruction cache -- cache that only holds instructions.
• data cache -- cache that only caches data.
• unified cache -- cache that holds both.
cpu
lowest-levelcache
next-levelmemory/cache
Cacheing Issues
On a memory access -
• How do I know if this is a hit or miss?
On a cache miss -
• where to put the new data?
• what data to throw out?
• how to remember what data this is?
cpu
lowest-levelcache
next-levelmemory/cache
access
miss
A simple cache
• A cache that can put a line of data anywhere is called ______________
• The most popular replacement strategy is LRU ( ).
tag data
the tag identifiesthe address of the cached data
the tag identifiesthe address of the cached data
4 blocks, each block holds one word, any blockcan hold any word.
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
Fully associative
Point out the tag IDs theaddress (pointer) data is The value
Fully Associative Cache
tag data
4 blocks, each block holds one word, any blockcan hold any word.
addresses
4 00 00 01 008 00 00 10 0012 00 00 11 004 00 00 01 008 00 00 10 0020 00 01 01 004 00 00 01 008 00 00 10 0020 00 01 01 0024 00 01 10 0012 00 00 11 008 00 00 10 004 00 00 01 00
A simpler cache
• A cache that can put a line of data in exactly one place is called __________________.
• Advantages/disadvantages vs. fully-associative?
an index is usedto determine
which line an addressmight be found in
an index is usedto determine
which line an addressmight be found in
4 blocks, each block holds one word, each wordin memory maps to exactly one cache location.
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
00000100tag data
Direct Mapped
Direct Mapped Cache
tag data
4 blocks, each block holds one word, each wordin memory maps to exactly one cache location.
00 00 01 0000 00 10 0000 00 11 0000 00 01 0000 00 10 0000 01 01 0000 00 01 0000 00 10 0000 01 01 0000 01 10 0000 00 11 0000 00 10 0000 00 01 00
MMMHHMHHH M MHM
MMMHHMMHH M HHH
MMMHHMMHM M HMM
MHMHHMHHH M HHM
A B C D E None are correct481248204820241284
addresses
00
01
10
11
An n-way set-associative cache
• A cache that can put a line of data in exactly n places is called n-way set-associative.
• The cache lines/blocks that share the same index are a cache ____________.
tag data
4 entries, each block holds one word, each wordin memory maps to one of a set of n cache lines
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
00000100tag data
Set
2-way Set Associative Cache
MMMHHMMHH M MHM
MMMHHMHHH M MHM
MMMHHMMHM M HMM
MHMHHMHHH M HHM
A B C D E None are correct
tag data tag data
4 entries, each block holds one word, each wordin memory maps to one of a set of n cache lines
00 000 1 0000 001 0 0000 001 1 0000 000 1 0000 001 0 0000 010 1 0000 000 1 0000 001 0 0000 010 1 0000 011 0 0000 001 1 0000 001 0 0000 000 1 00
481248204820241284
addresses
0
1
Longer Cache Blocks
• Large cache blocks take advantage of spatial locality.• Too large of a block size can waste cache space.• Longer cache blocks require less tag space
tag data
DM, 4 blocks, each block holds two words, each wordin memory maps to exactly one cache location (this cache is twice the total size of the prior caches).
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
00000100
Longer Cache Blocks
tag data
DM, 4 blocks, each block holds two words, each word in memory maps to exactly one cache location (this cache is twice the total size of the prior caches).
00 0 00 10000 0 01 00000 0 01 10000 0 00 10000 0 01 00000 0 10 10000 0 00 10000 0 01 00000 0 10 10000 0 11 00000 0 01 10000 0 01 00000 0 00 100
481248204820241284
addresses
Cache Parameters
Cache size = Number of sets * block size * associativity
-128 blocks, 32-byte block size, direct mapped, size = ?
-128 KB cache, 64-byte blocks, 512 sets, associativity = ?
#entries
2^7*2^5=2^12
2^17 / 2^6 = 2^112^11/2^9 = 2^2
Draw it
EquationsAll “sizes” are in bytes1. log2(block_size)2. log2(cache_size /(assoc*block_size))3. 32 – log2(cache_size/assoc)
Selection # tag bits # index bits # block offset bits
A 3 2 1
B 1 2 3
C 1 3 2
D 2 1 3
E None of the above
Descriptions of caches1. Exceptional usage of the cache space in exchange for a
slow hit time2. Poor usage of the cache space in exchange for an
excellent hit time3. Reasonable usage of cache space in exchange for a
reasonable hit time
Selection Fully-Associative
8-way Set Associative
Direct Mapped
A 3 2 1
B 3 3 2
C 1 2 3
D 3 2 1
E None of the above
Cache Associativity
Block Size and Miss Rate
Handling a Cache Access
1. Use index and tag to access cache and determine hit/miss.
2. If hit, return requested data.
3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor).
-load entire missed cache line into cache
-return requested data to CPU (or higher cache)
4. If next lower memory is a cache, goto step 1 for that cache.
ICache Reg
AL
U Dcache Reg
IF ID EX MEM WB
Accessing a Sample Cache
• 64 KB cache, direct-mapped, 32-byte cache block size
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data
64 KB
/ 32 bytes =
2 K cache blocks/sets
11
=
256
32
16
hit/miss
012
...
...
...
...204520462047
block offset
Point out validBit – show the dataCan be grabbed in ||With tag compare
Accessing a Sample Cache
• 32 KB cache, 2-way set-associative, 16-byte block size
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data
32 KB
/ 16 bytes / 2 =
1 K cache sets
10
=
18
hit/miss
012
...
...
...
...102110221023
block offset
tag datavalid
=
for(int i = 0; i<10,000,000;i++)sum+=A[i];
Assume each element of A is 4 bytes and sum is kept in a register. Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks. Which changes would help the hit rate of the above code?
Selection Change
A Increase to 2-way set associativity
B Increase block size to 64 bytes
C Increase cache size to 64 KB
D A and C combined
E A, B, and C combined
Isomorphic
for(int i=0; i<10,000,000;i++)for(int j = 0; j<8192;j++)
sum+= A[j] – B[j];Assume each element of A and B are 4 bytes and each array is at least 32KB in size. Assume sum is kept in a register. Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks. Which changes would help the hit rate of the above code?
Selection Change
A Increase to 2-way set associativity
B Increase block size to 64 bytes
C Increase cache size to 64 KB
D A and C combined
E A, B, and C combined
Assume a 1KB Cache with 64 byte blocks. Assume the following byte-address are repeatedly accessed in a loop.
Selection Address Stream
A 1
B 2
C 3
D None of the above, 2-way always has a better HR than DM
E None of the above
000000 0000 000000000000 0100 000100000001 0000 000000
000000 0000 000000000000 1000 000100000001 0000 000000
000000 0000 000000000000 0100 000100000001 1000 000000
*The addresses above are broken up (bitwise) for a DM Cache.
For which of the address streams above does a 2-way set-associative cache (same size cache, same block size) suffer a worse hit rate than a DM cache?
1 2 3
1: 33% DM, 100% 2-way2: 33% DM, 0% 2-way3: 100% DM, 100% 2-way
Dealing with Stores
• Stores must be handled differently than loads, because...– they don’t necessarily require the CPU to stall.
– they change the content of cache/memory (creating memory consistency issues)
– may require a and a store to completeLoad
Draw value in cache vs. not in cache
There have been a number of issues glossed over – we’ll cover those now
Policy decisions for stores
• Keep memory and cache identical?– => all writes go to both cache and main memory
– => writes go only to cache. Modified cache lines are written back to memory when the line is replaced.
• Make room in cache for store miss?– write-allocate => on a store miss, bring written line into the cache
– write-around => on a store miss, ignore cache
Write-throughWrite-back
Store Policies• Given either high store locality or low store locality, which policies
might you expect to find?
Selection
High Locality Low Locality
Miss Policy Hit Policy Miss Policy Hit Policy
A Write-allocate Write-through Write-around Write-back
B Write-around Write-through Write-allocate Write-back
C Write-allocate Write-back Write-around Write-through
D Write-around Write-back Write-allocate Write-through
E None of the above
Dealing with stores
• On a store hit, write the new data to cache. In a write-through cache, write the data immediately to memory. In a write-back cache, mark the line as dirty.
• On a store miss, initiate a cache block load from memory for a write-allocate cache. Write directly to memory for a write-around cache.
• On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory.
Cache Performance
CPI = BCPI + MCPI– BCPI = base CPI, which means the CPI assuming perfect memory (BCPI
= peak CPI + PSPI + BSPI) PSPI => pipeline stalls per instruction BSPI => branch hazard stalls per instruction
– MCPI = the memory CPI, the number of cycles (per instruction) the processor is stalled waiting for memory.
MCPI = accesses/instruction * miss rate * miss penalty– this assumes we stall the pipeline on both read and write misses, that the
miss penalty is the same for both, that cache hits require no stalls.
– If the miss penalty or miss rate is different for Inst cache and data cache (common case), then
MCPI = I$ accesses/inst*I$MR*I$MP + D$ acc/inst*D$MR*D$MP
Cache Performance
• Instruction cache miss rate of 4%, data cache miss rate of 10%, BCPI = 1.0 (no data or control hazards), 20% of instructions are loads and stores, miss penalty = 12 cycles, CPI = ?
CPI = 1 + %insts*%miss*miss_penaltyCPI = 1+(1.0)*.04*12 + .2*.10*12 = 1+ .48+.24 =1.72
Selection CPI (rounded if necessary)
A 1.24
B 1.34
C 1.48
D 1.72
E None of the above
Example -- DEC Alpha 21164 Caches
21164 CPUcore
InstructionCache
DataCache
UnifiedL2
Cache
Off-ChipL3 Cache
• ICache and DCache -- 8 KB, DM, 32-byte lines
• L2 cache -- 96 KB, ?-way SA, 32-byte lines
• L3 cache -- 1 MB, DM, 32-byte lines
Cache Alignment
• The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).
• This results in – no overlap of cache lines– easy mapping of addresses to cache lines (no additions)– data at address X always being present in the same
location in the cache block (at byte X mod blocksize) if it is there at all.
• Think of main memory as organized into cache-line sized pieces (because in reality, it is!).
tag index block offset
memory address
.
.
.
0123456789
10...
Memory
Three types of cache misses
• Compulsory (or cold-start) misses– first access to the data.
• Capacity misses– we missed only because the cache isn’t big enough.
• Conflict misses– we missed because the data maps to the same line as other data
that forced it out of the cache.
tag data
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
DM cache
Reading Quiz Variant
• Suppose you experience a cache miss on a block (let's call it block A). You have accessed block A in the past. There have been precisely 1027 different blocks accessed between your last access to block A and your current miss. Your block size is 32-bytes and you have a 64KB cache. What kind of miss was this?
Explain the way to know if itIs a capacity vs. conflict – awouldA fully associative cache of theSame size get a miss
Selection Cache Miss
A Compulsory
B Capacity
C Conflict
D Both Capacity and Conflict
E None of the above
So, then, how do we decrease...
• Compulsory misses?
• Capacity misses?
• Conflict misses?
Block Size, Prefetch
Increase Cache Size
Increase Associativity
Cache Miss Components
Capacity
One-way conflict
Two-way conflict
Four-way conflict
LRU replacement algorithms
• only needed for associative caches
• requires one bit for 2-way set-associative, 8 bits for 4-way, 24 bits for 8-way.
• can be emulated with log n bits (NMRU)
• can be emulated with use bits for highly associative caches (like page tables)
• However, for most caches (eg, associativity <= 8), LRU is calculated exactly.
Caches in Current Processors
• A few years ago, they were DM at highest level (closest to CPU), associative further away (this is less true today). Now they are less associative near the processor (4-8), and more farther away (8-16).
• split I and D close to the processor (for throughput rather than miss rate), unified further away.
• write-through and write-back both common, but never write-through all the way to memory.
• 64-byte cache lines common (but getting larger)
• Non-blocking– processor doesn’t stall on a miss, but only on the use of a miss (if even then)
– this means the cache must be able to keep track of multiple outstanding accesses.
Prefetching
• “Watch the trends in movie watching and attempt to guess movies that will be rented soon – put those in the front office.”
• Hardware Prefetching– suppose you are walking through a single element in an array of
large objects
– hardware determines the “stride” and starts grabbing values early
• Software Prefetching– Load instruction to $0 a fair number of instructions before it is
needed
Writing Cache-Aware Code
• Focus on your working set• If your “working set” fits in L1 it will be vastly better than
a “working set” that fits only on disk.• If you have a large data set – do processing on it in chunks.• Think about regularity in data structures (can a prefetcher
guess where you are going – or are you pointer chasing)
• Instrumentation tools (PIN, Atom, PEBIL) can often help you analyze your working set
• Profiling can give you idea of which section of code is dominant which can tell you where to focus
HW – matrix example
Working Set Size
60.00%
65.00%
70.00%
75.00%
80.00%
85.00%
90.00%
95.00%
100.00% astar
bwaves
bzip2
calcul ix
gamess
gemsFDTD
gobmk
gromacs
h264ref
hmmer
lbm
les l ie3d
l ibquantum
mcf
mi lc
namd
sjeng
soplex
wrf
xalan
zeusmpCache Size (KB)
Hit
Rat
e
64, 256, 4096 Nehalem
Key Points
• Caches give illusion of a large, cheap memory with the access time of a fast, expensive memory.
• Caches take advantage of memory locality, specifically temporal locality and spatial locality.
• Cache design presents many options (block size, cache size, associativity, write policy) that an architect must combine to minimize miss rate and access time to maximize performance.