cache storage systemcs510 computer architectureslecture 13 - 1 lecture 13 cache storage system
TRANSCRIPT
Cache Storage System CS510 Computer Architectures Lecture 13 - 1
Lecture 13Lecture 13
Cache Storage SystemCache Storage SystemLecture 13Lecture 13
Cache Storage SystemCache Storage System
Cache Storage System CS510 Computer Architectures Lecture 13 - 2
4 Questions for 4 Questions for Cache Storage DesignersCache Storage Designers
4 Questions for 4 Questions for Cache Storage DesignersCache Storage Designers
• Q1: Where can a block be placed in the Cache? (Block placement)
• Q2: How is a block found if it is in the Cache? (Block identification)
• Q3: Which block should be replaced on a Cache miss? (Block replacement)
• Q4: What happens on a Cache write? (Write strategy)
Cache Storage System CS510 Computer Architectures Lecture 13 - 3
Example:Example: Alpha Alpha 21064 Data Cache21064 Data Cache
Example:Example: Alpha Alpha 21064 Data Cache21064 Data Cache
Index = 8 bits: 256 blocks = 8192/(32bytesx1)
DirectMapped
. . . . . .WriteBuffer
Lower Level Memory
=? 4:1 MUX
Block Offset
<5>
Block Address <21> <8> CPUAddress
Data In
DataOut
Valid Tag Data<1> <21> <256>
256 blocks
34Tag Index
Miss Hit(send load signal to CPU)
If 1compare
Cache Storage System CS510 Computer Architectures Lecture 13 - 4
Example:Example:
Alpha Alpha 21064 Data Cache21064 Data CacheExample:Example:
Alpha Alpha 21064 Data Cache21064 Data Cache• READ: four steps
[1] CPU sends 34-bit address to cache for Tag comparison
[2] Index selects the Tag to be compared with the Tag in the address from CPU
[3] Compare the Tags if the information in the directory is Valid
[4] If Tags match, signal CPU to load the data
2 clock cycles for these 4 steps Instructions in this 2 cycles need to be stalled
• WRITE: – the first three steps + data write
Tag Index
. . . . . .WriteBuffer
Lower LevelMemory
=? 4:1 MUX
Block Address
CPUAddress
Data In
Data Out
Valid Tag Data<1> <21> <256>
256 blocks
34
Miss Hit
Cache Storage System CS510 Computer Architectures Lecture 13 - 5
Writes in Alpha 21064Writes in Alpha 21064Writes in Alpha 21064Writes in Alpha 21064• Write Through Cache
– Write is not complete in 2 clock cycles, but data is written in the Write Buffer in 2 clock cycles, and from where data must be stored in the memory while CPU continues on working
• No-Write Merging vs. Write Merging in write buffer
100104108112
Write Address V V V V
1 0 0 01 0 0 01 0 0 01 0 0 0
No-WriteMerging
4 Writes - 4 entries
Write Address V V V V
1 1 1 10 0 0 00 0 0 00 0 0 0
100WriteMerging 16 sequential
writes in a buffer
4 Writes - merged into a single buffer entry
Cache Storage System CS510 Computer Architectures Lecture 13 - 6
Structural Hazard:Structural Hazard: Split Cache or Unified Cache?Split Cache or Unified Cache?
Structural Hazard:Structural Hazard: Split Cache or Unified Cache?Split Cache or Unified Cache?
Miss Rates for Separate Instruction Cache and Data Cache
- DM, 32-byte Blocks, SPEC92 avg(instruction reference is about 75%), DECstation 5000
Size Split Cache Unified Cache
a Instruction Cache Data Cache a
1 KB 3.06% 24.61% 13.34%
2 KB 2.26% 20.57% 9.78%
4 KB 1.78% 15.94% 7.24%
8 KB 1.10% 10.19% 4.57%
16 KB 0.64% 6.47% 2.87%
32 KB 0.39% 4.82% 1.99%
64 KB 0.15% 3.77% 1.35%
128 KB 0.02% 2.88% 0.95%
Cache Storage System CS510 Computer Architectures Lecture 13 - 7
Example:Example: Miss Rate and Average Access TimeMiss Rate and Average Access Time
Example:Example: Miss Rate and Average Access TimeMiss Rate and Average Access Time
• Compare 16 KB I-cache and 16 KB D-cache vs. 32 KB U-cache• Assume
– a hit takes 1 clock cycle and a miss takes 50 clock cycles
– a load or a store hit takes 1 extra clock cycle on a U-cache since there is only one cache port to satisfy two simultaneous request(1 Instruction fetch and 1 LD or 1 ST)
– Integer program: LD;26%, ST;9%• Answer
– instruction accesses: 100%/(100%+26%+9%) = 75%
– data accesses: (26%+9%) /(100%+26%+9%) = 25%
– the overall miss rate for the split cache
(75% x 0.64%) + (25% x 6.47%) = 2.10%
U-cache has a slightly lower miss rate of 1.99%
Continue on the next slide
Cache Storage System CS510 Computer Architectures Lecture 13 - 8
Example:Example: Miss Rate and Average Access TimeMiss Rate and Average Access Time
Example:Example: Miss Rate and Average Access TimeMiss Rate and Average Access Time• Average access time = % instructions x (hit time + instruction mis
s rate x Miss penalty) + %data x (hit time + data miss rate x Miss penalty)
• 16 KB split cache average access timesplit
= 75% x (1+0.64%x50) + 25% x (1+6.47%x50) = 0.990 + 1.059
= 2.05
• 32 KB unified cache average access timeunified
= 75% x (1+1.99%x50) + 25% x (1+1+1.99%x50) = 1.496 + 0.749
= 2.24
• The split cache in this example have a better average access time than the single-ported unified cache even though their effective miss rate is higher
Stall cycle due to instruction fetch
Cache Storage System CS510 Computer Architectures Lecture 13 - 9
Cache PerformanceCache PerformanceCache PerformanceCache Performance
CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time
Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty)
Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty
Cache Storage System CS510 Computer Architectures Lecture 13 - 10
Cache PerformanceCache PerformanceCache PerformanceCache Performance
CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time
Cache Storage System CS510 Computer Architectures Lecture 13 - 11
Cache PerformanceCache PerformanceCache PerformanceCache Performance
• What is the impact of two different cache organizations(DM and SA Cache) on the performance of a CPU?
• Assume
– CPI with a perfect cache -- 2.0
– clock cycle time(=cache access time) -- 2 ns
– 1.3 memory references per instruction
– 64 KB cache : block size -- 32 bytes
– one cache - direct mapped, the other - two-way SA
– assume the CPU clock cycle time must be stretched 1.10 times to accommodate the selection MUX of the SA(2x1.1)
– cache miss penalty -- 70 ns for either caches
– miss rate : Direct mapped -- 1.4%, SA -- 1.0%
Cache Storage System CS510 Computer Architectures Lecture 13 - 12
2-way Set Associative, 2-way Set Associative, Address to Select WordAddress to Select Word2-way Set Associative, 2-way Set Associative, Address to Select WordAddress to Select Word
Two sets ofAddress tagsand data RAM
Use addressbits to selectcorrect DRAM
<22> <7> <5>Block Address
Blockoffset
Tag Index
CPU
AddressData Datain Out
WriteBuffer
Low Level Memory
Valid Tag<1> <22>
Data<64>
. . . . . .
. . .
. . .
2:1MUX
=?=?
Set 0 index
Set 1 index
Set 0
Set 1
Cache Storage System CS510 Computer Architectures Lecture 13 - 13
Cache PerformanceCache PerformanceCache PerformanceCache Performance
Answer
– Average access time = hit time + Miss rate x Miss penalty• Average access timedirect = 2+(.014x70) = 2.98ns
• Average access time2-way = 2x1.10+(.010x70) = 2.90ns
– CPU performance
CPU time = IC x (CPIex + Misses per Instr. x Miss penalty) x clock cycle time
• CPU timedirect = IC x (2.0x2+(1.3x.014x70)) = 5.27 x IC
• CPU time2-way = IC x (2.0x2x1.10+(1.3x.010x70))= 5.31 x IC
– the direct mapped cache leads to slightly better average performance
Cache Storage System CS510 Computer Architectures Lecture 13 - 14
Improving Cache Improving Cache PerformancePerformance
Improving Cache Improving Cache PerformancePerformance
• Average memory access time = Hit time + Miss rate x Miss penalty (ns or clocks)
• Improve performance by:
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Cache Storage System CS510 Computer Architectures Lecture 13 - 15
Cache Storage System CS510 Computer Architectures Lecture 13 - 16
Improving Cache Improving Cache PerformancePerformance
Improving Cache Improving Cache PerformancePerformance
Cache Storage System CS510 Computer Architectures Lecture 13 - 17
Improving Cache Improving Cache PerformancePerformance
Improving Cache Improving Cache PerformancePerformance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Cache Storage System CS510 Computer Architectures Lecture 13 - 18
CacheCachePerformance ImprovementPerformance Improvement
Reducing Miss RateReducing Miss Rate
Cache Storage System CS510 Computer Architectures Lecture 13 - 19
Reducing MissesReducing MissesReducing MissesReducing MissesClassifying Misses: 3 Cs
Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses.(Misses in Infinite Cache)
Capacity: If The cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Size of Cache)
Conflict: If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses.(Misses in N-way Set Associative, Size of Cache)
Cache Storage System CS510 Computer Architectures Lecture 13 - 20
3Cs Absolute Miss Rate3Cs Absolute Miss Rate3Cs Absolute Miss Rate3Cs Absolute Miss Rate
1-way(DM)
2-way
Capacity Capacity
4-way
8-way
Conflict
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Compulsory Cache Size (KB)
1 2 4 8 16 32 64 128
Cache Storage System CS510 Computer Architectures Lecture 13 - 21
2:1 Cache Rule on 2:1 Cache Rule on Set AssociativitySet Associativity
2:1 Cache Rule on 2:1 Cache Rule on Set AssociativitySet Associativity
2:1 Cache rule of thumb
A direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8 16 32 64 128
1-way(DM)
2-way
4-way
8-way
Capacity
Compulsory
Conflict
Cache Storage System CS510 Computer Architectures Lecture 13 - 22
3Cs Relative Miss Rate3Cs Relative Miss Rate(Scaled To Direct-Mapped Miss Ratio)(Scaled To Direct-Mapped Miss Ratio)
3Cs Relative Miss Rate3Cs Relative Miss Rate(Scaled To Direct-Mapped Miss Ratio)(Scaled To Direct-Mapped Miss Ratio)
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0%
20%
40%
60%
80%
100%
2 4 8 16 32 64 128
1-way
2-way4-way
8-way
Capacity Capacity
Compulsory Compulsory
Con
flic
t
Con
flic
t
1
Cache Storage System CS510 Computer Architectures Lecture 13 - 23
How Can Reduce Misses?How Can Reduce Misses?How Can Reduce Misses?How Can Reduce Misses?
• Change Block Size? Which of 3Cs affected?
• Change Associativity? Which of 3Cs affected?
• Change Compiler? Which of 3Cs affected?
Cache Storage System CS510 Computer Architectures Lecture 13 - 24
1. Reduce Misses via 1. Reduce Misses via Larger Block SizeLarger Block Size
1. Reduce Misses via 1. Reduce Misses via Larger Block SizeLarger Block Size
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16 32 64 128 256
1K
4K16K64K256K
Cache Capacity
For larger caches, Miss rate reduces as the block size increases.But block size should not be large with respect to the cache capacity.It will rather heart the miss rate.
Cache Storage System CS510 Computer Architectures Lecture 13 - 25
2. Reduce Misses via 2. Reduce Misses via Higher AssociativityHigher Associativity
2. Reduce Misses via 2. Reduce Misses via Higher AssociativityHigher Associativity
• 2:1 Cache Rule: – Miss Rate DM cache size N Miss Rate SA(2-way) cache size N/2
• Beware: Execution time is only final measure!– Will Clock Cycle time increase?
– Hill [1988] suggested hit times for TTL or ECL board-level external cache +10%, and CMOS internal cache + 2% for 2-way vs. 1-way
Cache Storage System CS510 Computer Architectures Lecture 13 - 26
Example: AMAT vs. Miss RateExample: AMAT vs. Miss RateExample: AMAT vs. Miss RateExample: AMAT vs. Miss RateExample: assume CCT = 1.10 for 2-way, 1.12 for 4-way,
1.14 for 8-way vs. CCT direct mapped, 32 bytes /block
AMAT is not improved significantly by more associativity
AMAT is improvedby more associativity
Cache Memmory Access Time Size (KB) 1-way 2-way 4-way 8-way
1 7.56 6.60 6.22 5.44 2 5.09 4.90 4.62 4.09 4 4.60 3.95 3.57 3.19 8 3.30 3.00 2.87 2.59 16 2.45 2.20 2.12 2.04 32 2.00 1.80 1.77 1.79 64 1.70 1.60 1.57 1.59 128 1.50 1.45 1.42 1.44
Cache Storage System CS510 Computer Architectures Lecture 13 - 27
3. Reducing Misses via 3. Reducing Misses via Victim CacheVictim Cache
3. Reducing Misses via 3. Reducing Misses via Victim CacheVictim Cache
• How to combine fast hit time of Direct Mapped yet still avoid conflict misses?
• Add buffer to place data discarded from cache
• Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache
Victim Cache
Tag
Data
=?
=?
CPUAddressData DataIn Out
WriteBuffer
Low Level Memory
Cache
Cache Storage System CS510 Computer Architectures Lecture 13 - 28
4. Reducing Misses via 4. Reducing Misses via Pseudo-AssociativityPseudo-Associativity
4. Reducing Misses via 4. Reducing Misses via Pseudo-AssociativityPseudo-Associativity
• How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache?
Hit Time
Pseudo-Hit Time
Miss Penalty
•Divide cache: on a DM miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)
•Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles– Better for caches not tied directly to processor
Cache Storage System CS510 Computer Architectures Lecture 13 - 29
5. Reducing Misses by 5. Reducing Misses by HW PrefetchingHW Prefetching
5. Reducing Misses by 5. Reducing Misses by HW PrefetchingHW Prefetching
• Prefetch into the Cache or into an external buffer e.g., Instruction Prefetching– Alpha 21064 fetches 2 consecutive blocks on a miss
– Extra prefetched block is placed in instruction stream buffer
– On miss check instruction stream buffer
• Works with data blocks too:– Jouppi [1990]: 1 data stream buffer got 25% of misses from 4KB c
ache; 4 streams got 43%
– Palacharla & Kessler [1994]: for scientific programs for 8 streams got 50% to 70% of misses from two 64KB, 4-way set associative caches
• Prefetching relies on extra memory bandwidth that can be used without penalty
Cache Storage System CS510 Computer Architectures Lecture 13 - 30
6. Reducing Misses by 6. Reducing Misses by Compiler Prefetching of DataCompiler Prefetching of Data
6. Reducing Misses by 6. Reducing Misses by Compiler Prefetching of DataCompiler Prefetching of Data
• Prefetch Instructions to request the data in advance• Data Prefetch
– Register Prefetch: Load the value of data into register (HP PA-RISC loads)
– Cache Prefetch: load data into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults; non-faulting or nonbinding prefetch- it does’t change R or M content and it cannot cause exceptions---a form of speculative execution
– Make sense only if a nonblocking cache(lockup-free cache) is used i.e., processor can proceed while waiting for prefetched data
• Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced misses?
Cache Storage System CS510 Computer Architectures Lecture 13 - 31
Big PictureBig Picture
The challenge in designing memory hierarchies is that every change that potentially improves the miss rate can also negatively affect overall performance.
This combination of positive and negative effects is what makes the design of a memory hierarchy challenging.
Increase size Decr capacity misses May incr access time
Increase assoc. Decreases miss rate May incr access time due to conflict missesIncrease blk size Decr miss rate for a May incr miss penalty wide range of blk sizes
Design change Effect on miss rate Possible negative performance effect
Cache Storage System CS510 Computer Architectures Lecture 13 - 32
Compiler Prefetching of DataCompiler Prefetching of DataCompiler Prefetching of DataCompiler Prefetching of Dataa and b are 8-bytes long -> 2 data in a block
for (i=0; i<3; i=i+1) for (j=0; j<100; j=j+1) a[i,j] = b[j,0]*b[j+1,0];
Analysis
– Spatial Locality: Exists in a but not in b The misses due to a -> misses on every other a[i,j](2 data/blk) a[0,0], a[0,1],..,a[0,99],a[1,0],a[1,1],..,a[1,99],a[2,0],..,a[2,99]
•Total 3 x 100 /2 or 150 misses
– Misses due to b: ignoring potential conflict missesb[0,0],b[0,1],b[0,2],b[1,0],b[1,2]b[1,2],…,b[99,0],b[99,1],b[99,2],b[100,0],b[100,1],b[100,2]
•100 misses on b[j+1,0] when i=0•1 miss on b[j,0] when i=0 and j=0
i=0: b[0,0]b[1,0]b[1,0]b[2,0]b[2,0]b[3,0]…b[99,0]b[99,0]b[100,0] ; purples represents missesHit on every b accesses when i=1 and i=2
•Total 101 misses
– Total misses : 251 misses
Cache Storage System CS510 Computer Architectures Lecture 13 - 33
Compiler Prefetching of DataCompiler Prefetching of DataCompiler Prefetching of DataCompiler Prefetching of DataAssume that the miss penalty is so large that we need to prefetch at least seven iterations in advance Split the iterations of i-loop such that i=0(j-loop,fetch all b’s) and remainders(i-loop)
for (j =0; j < 100; j = j+1) {prefetch (b[j+7,0]); prefetch (a[0, j+7]); a[0,j] = b[j,0] * b[j+1,0];}
for (i =1; i <3; i= i+1) for (j =0; j < 100; j = j+1) {
prefetch (a[i,j+7]); a[i,j] = b[j,0] * b[j+1,0];}
Total number of misses; 7/2 x 3 misses of a and 7 misses of b = 19 misses
a[0,0]a[0.1]a[0,2]a[0,3]a[0,4]a[0,5]a[0,6]a[0,7]a[0,8]a[0,9]…..b[0,0]b[1,0]b[1,0]b[2,0]….b[5,0]b[6,0]b[6.0]b[7,0]b[7,0]b[8,0]
PF: b[7,0],b[8,0],…,b[100,0]PF: a[0,7],a[0,8],…,a[0,99]7/2 misses of a, 7 misses of b
PF: a[1,7],a[1,8],…,a[1,99],…,a[2,99]7/2 x 2 misses of a
Cache Storage System CS510 Computer Architectures Lecture 13 - 34
7. Reducing Misses by 7. Reducing Misses by Compiler OptimizationsCompiler Optimizations7. Reducing Misses by 7. Reducing Misses by Compiler OptimizationsCompiler Optimizations
• Instructions– Reorder procedures in memory so as to reduce misses– Profiling to look at conflicts– McFarling [1989] reduced cache misses by 75% on 8KB direct mapped ca
che with 4 byte blocks
• Data– Merging Arrays: improve spatial locality by single array of compound el
ements vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in the order stored in memory
– Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows
Cache Storage System CS510 Computer Architectures Lecture 13 - 35
Merging Arrays ExampleMerging Arrays ExampleMerging Arrays ExampleMerging Arrays Example
/* Before */
int val[SIZE];
int key[SIZE];
/* After */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key
Cache Storage System CS510 Computer Architectures Lecture 13 - 36
Loop Interchange ExampleLoop Interchange ExampleLoop Interchange ExampleLoop Interchange Example
Sequential accesses Instead of striding through memory every 100 words
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Processing Sequence
x[0,0],x[0,1],…,x[0,m],x[1,0],x[1,1],…,x[1,m], Spatial………… Loacityx[n,0],x[n,1],…,x[n,m]
x[0,0],x[1,0],...,x[n,0],x[0,1],x[1,1],...,x[n,1], No…………. Spatialx[0,m],x[1,m],…,x[n,m] Locality
Cache Storage System CS510 Computer Architectures Lecture 13 - 37
Loop Fusion ExampleLoop Fusion ExampleLoop Fusion ExampleLoop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){a[i][j] = 1/b[i][j] * c[i][j];d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c 1 miss per accessvs
Cache Storage System CS510 Computer Architectures Lecture 13 - 38
Blocking ExampleBlocking ExampleBlocking ExampleBlocking Example/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
i
jx
i
ky
1
1
k
jz 3
• Two Inner Loops:– Read all NxN elements of z[ ]– Read N elements of 1 row of y[ ] repeatedly– Write N elements of 1 row of x[ ]
• Capacity Misses is a function of N & Cache Size:– Can store 3 NxN => no capacity misses; possible conflict misses
• Idea: compute on BxB sub-matrix
Cache Storage System CS510 Computer Architectures Lecture 13 - 39
Blocking ExampleBlocking ExampleBlocking ExampleBlocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
i
jx
k
i
y
k
jz
1
1
Blocking is useful to avoid capacity misses when dealing with large arrays
• Capacity Misses from 2N3 + N2 to 2N3/B +N2
• B called Blocking Factor, in this example B=3• Conflict Misses Too?
Cache Storage System CS510 Computer Architectures Lecture 13 - 40
Reducing Conflict Misses Reducing Conflict Misses by Blockingby Blocking
Reducing Conflict Misses Reducing Conflict Misses by Blockingby Blocking
Impact of Conflict misses in caches not FA on Blocking factor– Lam et al [1991] a blocking factor of 24 had a fifth the misses of 48 despite
both fit in cache
Blocking Factor
0
0.05
0.1
0 50 100 150
Fully Associative Cache
Direct Mapped Cache Mis
s ra
te %
Blocking to reduce capacity misses, also, reduce conflict misses by increasing associativity. Block size smaller than capacity can also reduce conflict misses
Cache Storage System CS510 Computer Architectures Lecture 13 - 41
Summary: Compiler Optimization Summary: Compiler Optimization to Reduce Cache Missesto Reduce Cache Misses
Summary: Compiler Optimization Summary: Compiler Optimization to Reduce Cache Missesto Reduce Cache Misses
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
mergedarrays
loopinterchange
loop fusion blocking
Cache Storage System CS510 Computer Architectures Lecture 13 - 42
SummarySummarySummarySummary
• 3 Cs: Compulsory, Capacity, Conflict Misses• Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Remember danger of concentrating on just one parameter when evaluating performance
Cache Storage System CS510 Computer Architectures Lecture 13 - 43
Cache Storage System CS510 Computer Architectures Lecture 13 - 44
Cache Cache Performance ImprovementPerformance Improvement
Reducing Miss PenaltyReducing Miss Penalty
Cache Cache Performance ImprovementPerformance Improvement
Reducing Miss PenaltyReducing Miss Penalty
Cache Storage System CS510 Computer Architectures Lecture 13 - 45
Reducing Miss Penalty:Reducing Miss Penalty: 1. Read Priority over Write on Miss1. Read Priority over Write on Miss
Reducing Miss Penalty:Reducing Miss Penalty: 1. Read Priority over Write on Miss1. Read Priority over Write on Miss
• Write Through with write buffers offer RAW conflicts with main memory reads on cache misses– If simply wait for write buffer to empty might increase read miss
penalty by 50% (old MIPS 1000)
– Check write buffer contents before read; if no conflicts, let the memory access continue
• Write Back?– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read, and then do the write
– CPU stalls less since restarts as soon as do read
Cache Storage System CS510 Computer Architectures Lecture 13 - 46
Reducing Miss Penalty:Reducing Miss Penalty: 2. Subblock Placement2. Subblock Placement
Reducing Miss Penalty:Reducing Miss Penalty: 2. Subblock Placement2. Subblock Placement
• Don’t have to load full block on a miss• Have bits per sub-block to indicate valid• (Originally invented to reduce tag storage)
100
300
200
204
1
1
0
0
1
1
1
0
1
0
0
0
1
0
1
0
Sub-blocksValid Bits
Cache Storage System CS510 Computer Architectures Lecture 13 - 47
Reducing Miss Penalty:Reducing Miss Penalty: 3. Early Restart and Critical Word First3. Early Restart and Critical Word First
Reducing Miss Penalty:Reducing Miss Penalty: 3. Early Restart and Critical Word First3. Early Restart and Critical Word First• Don’t wait for full block to be loaded before restarting CPU
– Early Restart: As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution while loading remainder of the block
– Critical Word First: Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
• Generally useful only in large blocks, • Spatial locality a problem; tend to want the next sequential word,
so not clear if benefit by early restart
Cache Storage System CS510 Computer Architectures Lecture 13 - 48
Reducing Miss Penalty:Reducing Miss Penalty:
4. Non-blocking Caches 4. Non-blocking Caches to Reduce Stalls on Missesto Reduce Stalls on Misses
Reducing Miss Penalty:Reducing Miss Penalty:
4. Non-blocking Caches 4. Non-blocking Caches to Reduce Stalls on Missesto Reduce Stalls on Misses
• Non-blocking cache or lockup-free cache allowing the data cache to continue to supply cache hits during a miss
• “hit under miss” reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU
• “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses– Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
Cache Storage System CS510 Computer Architectures Lecture 13 - 49
Value of Hit under Miss Value of Hit under Miss for SPECfor SPEC
Value of Hit under Miss Value of Hit under Miss for SPECfor SPEC
• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
Hit Under i MissesAv
g. M
em. A
cces
s Ti
me
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
eqnt
ott
esp
resso
xlis
p
com
pre
ss
md
ljsp
2
ear
fppp
p
tom
catv
sw
m2
56
dodu
c
su
2cor
wave5
md
ljd
p2
hyd
ro2
d
alvi
nn
nasa7
sp
ice2
g6
ora
0->1
1->2
2->64
Base
Integer Floating Point
Cache Storage System CS510 Computer Architectures Lecture 13 - 50
Reducing Miss Penalty:Reducing Miss Penalty: 5. Second Level Cache5. Second Level Cache
Reducing Miss Penalty:Reducing Miss Penalty: 5. Second Level Cache5. Second Level Cache
CPU
Memory
L1
L2
Cache Storage System CS510 Computer Architectures Lecture 13 - 51
Reducing Miss Penalty:Reducing Miss Penalty: 5. Second Level Cache5. Second Level Cache
Reducing Miss Penalty:Reducing Miss Penalty: 5. Second Level Cache5. Second Level Cache
• L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 xMiss PenaltyL2)
• Definitions: – Local miss rate: misses in this cache divided by the total
number of memory accesses to this cache (Miss rateL2)
– Global miss rate: misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)
Cache Storage System CS510 Computer Architectures Lecture 13 - 52
Second Level Cache: Second Level Cache: An ExampleAn Example
Second Level Cache: Second Level Cache: An ExampleAn Example
• Example :
– 1000 memory references
– 40 misses in L1
– 20 misses in L2
• the miss rate (either local or global) for L1 : 40/1000 or 4%• the local miss rate for L2 : 20/40 or 50 %• the global miss rate for L2 : 20/1000 or 2%
Cache Storage System CS510 Computer Architectures Lecture 13 - 53
Comparing Local and Comparing Local and Global Miss RatesGlobal Miss Rates
• 32 KByte 1st level cache;Increasing 2nd level cache
• Global miss rate close to single level cache rate provided L2 >> L1
• Don’t use local miss rate, it must be a function of L1 miss rate, so use L2 global miss rate
• L2 is not tied to CPU clock cycle, thus it will only affect the L1 miss penalty, not the CPU clock cycle
• In L2 design, will it lower AMAT? and how much it will cost? must be considered
• Generally Fast Hit Times and fewer misses
• Since hits are few, target miss rate reduces
Linear Scale
L2
Log Scale
L2
L1
Cache Storage System CS510 Computer Architectures Lecture 13 - 54
Reducing Misses: Reducing Misses: Which Apply to L2 Cache?Which Apply to L2 Cache?
Reducing Misses: Reducing Misses: Which Apply to L2 Cache?Which Apply to L2 Cache?
Reducing Miss Rate1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by Compiler Optimizations
Cache Storage System CS510 Computer Architectures Lecture 13 - 55
The Impact of L2 Cache AssociatiThe Impact of L2 Cache Associativity on the Miss Penaltyvity on the Miss Penalty
The Impact of L2 Cache AssociatiThe Impact of L2 Cache Associativity on the Miss Penaltyvity on the Miss Penalty
• Does set associativity make more sense for L2 caches?– two way set associativity : increases hit time by 10% of clock cyc
le(0.1 clock cycle)
– hit timeL2 for direct mapped = 10 cycles
– local miss rateL2 for direct mapped = 25%
– local miss rateL2 for 2-way SA = 20%
– Miss penaltyL2 = 50 cycles
• the first-level cache miss penalty(Miss penaltyL1=Hit TimeL2 + Miss rateL2 x Miss penaltyL2)
Miss penalty 1-wayL1 = 10 + 25%x50 = 22.5 clock cycles
Miss penalty 2-wayL1 = 10.1 + 20%x50= 20.1 clock cycles
*worst case(hit time is the integral number of clocks, thus rounding 10.1=11)
Miss penalty 2-wayL1 = 11 + 20%x50= 21.0 clock cycles
Cache Storage System CS510 Computer Architectures Lecture 13 - 56
L2 Cache Block Size and AMATL2 Cache Block Size and AMATL2 Cache Block Size and AMATL2 Cache Block Size and AMAT
512KB L2 cache; memory bus=32-bit.
Long MAT: 1 clock for sending address,
6 clocks for accessing data: 1 word/clock
Block Size of L2 cache(bytes)
11.11.21.31.41.51.61.71.81.9
2
16 32 64 128 256 512
1.361.28 1.27
1.34
1.54
1.95
Rel
ativ
e C
PU
exe
cuti
on t
ime
Small L2 cache;increase conflict misses as the block size increases, but L2 capacity is large
popular size
Cache Storage System CS510 Computer Architectures Lecture 13 - 57
L2 Cache Block SizeL2 Cache Block SizeL2 Cache Block SizeL2 Cache Block Size
• Multilevel Inclusion Property
– All data in the first-level cache are always in the second-level cache
– Consistency between I/O and caches ( or between caches in multiprocessors) can be determined by checking L2 cache
– Drawback• Different block sizes between caches in different levels
� Usually, smaller cache has smaller cache blocks
� Thus, different block sizes for L1 and L2
� Still maintain multilevel inclusion property
• On a miss, complex invalidations, require non-blocking secondary caches
Cache Storage System CS510 Computer Architectures Lecture 13 - 58
Reducing Miss Penalty SummaryReducing Miss Penalty SummaryReducing Miss Penalty SummaryReducing Miss Penalty Summary
• Five techniques– Read priority over write on miss
– Sub-block placement
– Early Restart and Critical Word First on miss
– Non-blocking Caches (Hit Under Miss)
– Second Level Cache
• Can be applied recursively to Multilevel Caches– Danger is that time to access DRAM will increase with
multiple levels in between
Cache Storage System CS510 Computer Architectures Lecture 13 - 59
Cache Storage System CS510 Computer Architectures Lecture 13 - 60
CacheCachePerformance ImprovementPerformance Improvement
Fast Hit TimeFast Hit Time
Cache Storage System CS510 Computer Architectures Lecture 13 - 61
Fast Hit Times:Fast Hit Times: 1. Small and Simple Caches1. Small and Simple Caches
Fast Hit Times:Fast Hit Times: 1. Small and Simple Caches1. Small and Simple Caches
• Smaller and Simple hardware is faster
– Smaller Cache• On-chip Cache• On-chip Tag, Off-chip Data
– Simple Cache• Direct Mapped Cache allows to transmit data while checking
tag
• Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache
Cache Hit TimeMost of time is spent on Reading TAG Memory using
INDEX part of address and Compare
Cache Storage System CS510 Computer Architectures Lecture 13 - 62
Fast Hit Times:Fast Hit Times: 2. Fast Hits by Avoiding 2. Fast Hits by Avoiding
Address TranslationAddress Translation
Fast Hit Times:Fast Hit Times: 2. Fast Hits by Avoiding 2. Fast Hits by Avoiding
Address TranslationAddress Translation• Send virtual address to cache? Called Virtually Addressed Cache or
just Virtual Cache vs. Physical Cache
– Make the common case fast• Common case is hit, thus allows to eliminate the virtual address
translation on hit
– Every time process is switched logically must flush the cache; otherwise get false hits with the same virtual address
• Cost is time to flush + Compulsory misses from empty cache
– Dealing with aliases (sometimes called synonyms); OS as well as user programs map two different virtual addresses to the same physical address
– I/O(physical address) must interact with cache--virtual address
Cache Storage System CS510 Computer Architectures Lecture 13 - 63
Solutions to Solutions to Cache Flush and AliasCache Flush and Alias
•Solution to aliases–HW: Anti-aliasing: guarantee that every cache
block has a unique physical address–SW: guarantee the lower n bits of aliases must be same; and they are as long as to cover the
index field For direct mapped, they must be unique;
page coloring makes the least significant several bits of physical and virtual addresses identical
•Solution to cache flush–Add process identifier(PID) to the address tag that identifies the process as well as address
within process: cannot get a hit if wrong process
Cache Storage System CS510 Computer Architectures Lecture 13 - 64
Conventional Conventional Virtually Addressed CacheVirtually Addressed Cache
Hit Time is slow
CPU
TB
Cache
MEM
VA
PA
PA
1. Translate the virtual address into a physical address
2. Then access Cache with the physical address
Cache Storage System CS510 Computer Architectures Lecture 13 - 65
Alternative Alternative Virtually Addressed CacheVirtually Addressed Cache
• Synonym Problem• Higher Miss penalty
CPU
Cache
TB
MEM
VA
VA
PA
VATags
1. Access cache with the virtual address2. If miss, translate the virtual address for
memory access
Cache Storage System CS510 Computer Architectures Lecture 13 - 66
Cache Access and Cache Access and VA Translation OverlappingVA Translation Overlapping
Requires cache index to remain invariant across translation, i.e. Index is physical address part
CPU
Cache TB
MEM
VPA
PPAL2 $
P-Offset
P-Offset
VPA P-offset
Virtual address
1. Access cache and the virtual address translation simultaneously
2. When miss, access L2 cache with the physical address
Cache Storage System CS510 Computer Architectures Lecture 13 - 67
2. Avoiding Translation:2. Avoiding Translation: Process ID impactProcess ID impact
2. Avoiding Translation:2. Avoiding Translation: Process ID impactProcess ID impact
2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k Cache Size
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
Miss Rate
• Green: Uniprocess Without process switch
• Red: Multiprocess Process switches using PIDs when flush cache
• Blue: Multiprocess Process switches w/o PIDs, simply flushes cache
Direct mapped cacheBlock size=16 bytesFor Ultrix running on a VAX
Cache Storage System CS510 Computer Architectures Lecture 13 - 68
2. Avoiding Translation:2. Avoiding Translation: Index with Physical Portion of AddressIndex with Physical Portion of Address
2. Avoiding Translation:2. Avoiding Translation: Index with Physical Portion of AddressIndex with Physical Portion of Address
• If INDEX is a physical part of address, can start TAG access in parallel with address translation so that can compare to physical tag
• Limits cache to page size: what if want bigger caches and uses same trick?
– Higher associativity increases the Index, thus enlarge page– Page coloring - OS makes the last few bits of virtual and
physical address identical causes to compare only remainder of the page offset bits
31 12 11 0
Page Address Address Tag
Page Offset Index Block Offset
Address Translation Tag Access
Cache Storage System CS510 Computer Architectures Lecture 13 - 69
Fast Hit times:Fast Hit times: 3. Fast Hit Times via Pipelined Writes3. Fast Hit Times via Pipelined Writes
Fast Hit times:Fast Hit times: 3. Fast Hit Times via Pipelined Writes3. Fast Hit Times via Pipelined Writes• Pipeline Tag Check and Update Cache as separate stages;
current write tag check & previous write cache update • Only Write in the pipeline; empty during a miss
=?
Delayed Write BufferTag
DataM
UX
=?
CPUAddress
Data DataIn Out
WriteBuffer
Low Level Memory
Delayed Write Buffer; must be checked on reads;
either complete write or read from buffer
CurrentWrite
PreviousWrite
Cache Storage System CS510 Computer Architectures Lecture 13 - 70
Fast Hit times:Fast Hit times: 4. Fast Writes on Misses via 4. Fast Writes on Misses via
Small Sub-blocksSmall Sub-blocks
Fast Hit times:Fast Hit times: 4. Fast Writes on Misses via 4. Fast Writes on Misses via
Small Sub-blocksSmall Sub-blocks• If most writes are 1 word, sub-block size is 1 word, & write through
then always write sub-block and tag immediately – Tag match and valid bit already set: Writing the block was proper, &
nothing lost by setting valid bit on again.
– Tag match and valid bit not set: The tag match means that this is the proper block; writing the data into the sub-block makes it appropriate to turn the valid bit on.
– Tag mismatch: This is a miss and will modify the data portion of the block. As this is a write-through cache, however, no harm was done; memory still has an up-to-date copy of the old value. Only the tag to the address of the write and the valid bits of the other sub-blocks need to be changed because the valid bit for this sub-block has already been set
• Doesn’t work with write back due to last case
Cache Storage System CS510 Computer Architectures Lecture 13 - 71
Cache Optimization SummaryCache Optimization SummaryCache Optimization SummaryCache Optimization Summary
Larger Block Size + - 0Higher Associativity + - 1Victim Caches + 2Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0
Priority to Read Miss over write miss + 1Sub-block Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2
Small & Simple Cache - + 0Avoiding Address Translation + 2Pipelining Writes for fast write hit + 1
Technique MR MP HT Complexity
Cache Storage System CS510 Computer Architectures Lecture 13 - 72
What is the Impact of What What is the Impact of What You’ve Learned About Caches?You’ve Learned About Caches?
What is the Impact of What What is the Impact of What You’ve Learned About Caches?You’ve Learned About Caches?
• 1960-1985: Speed = (no. operations)
• 1995
– Pipelined Execution & Fast Clock Rate
– Out-of-Order completion
– Superscalar Instruction Issue
• 1995: Speed = (non-cached memory accesses)
• What does this mean for
– Compilers?,Operating Systems?, Algorithms? Data Structures?
1
10
100
1000
198
0
198
1
198
2
198
3
198
4
198
5
198
6
198
7
198
8
198
9
199
0
199
1
199
2
199
3
199
4
199
5
199
6
199
7
199
8
199
9
200
0
DRAM
CPU
Cache Storage System CS510 Computer Architectures Lecture 13 - 73
Cross Cutting IssuesCross Cutting IssuesCross Cutting IssuesCross Cutting Issues
• Parallel Execution vs. Cache locality– Want far separation to find independent operations vs.
want reuse of data accesses to avoid misses
• I/O and consistency of data between cache and memory– Caches => multiple copies of data
– Consistency by HW or by SW?
– Where connect I/O to computer?
Cache Storage System CS510 Computer Architectures Lecture 13 - 74
Alpha 21064Alpha 21064Alpha 21064Alpha 21064• Separate Instr & Data TLB
& Caches• TLBs fully associative• TLB updates in SW
(‘priv Arch Libr”)• Caches 8KB direct mappe
d• Critical 8 bytes first• Prefetch instr. stream buff
er• 2 MB L2 cache, direct ma
pped (off-chip)• 256 bit path to main mem
ory, 4 x 64-bit modules
Cache Storage System CS510 Computer Architectures Lecture 13 - 75
Alpha Memory Performance:Alpha Memory Performance: Miss RatesMiss Rates
Alpha Memory Performance:Alpha Memory Performance: Miss RatesMiss Rates
100.00%Al
phaS
ort
TP
C-B
(d
b2
)
TP
C-B
(d
b1
)
Esp
resso
Li
Eqnt
ott
Sc
Gcc
Com
pre
ss
Md
ljsp
2
Ora
Fppp
p
Ear
Sw
m2
56
Dod
uc
Alvi
nnTo
mca
tvW
ave5
Md
ljp
2
Hyd
ro2
d
Sp
ice
Nasa7
Su
2cor
Mis
s R
ate
(log
scale
)
0.01%
0.10%
1.00%
10.00%
I $ 8K
D $ 8K
L2 2M
D$
I$
L2
Cache Storage System CS510 Computer Architectures Lecture 13 - 76
Alpha CPI ComponentsAlpha CPI ComponentsAlpha CPI ComponentsAlpha CPI ComponentsPrimary limitations of performance
– Instruction stall: branch mispredict; – Other: compute + reg conflicts, structural conflicts
CP
I
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
Alph
aSor
t
TP
C-B
(d
b2
)
TP
C-B
(d
b1
)
Esp
resso
Li
Eqnt
ott
Sc
Gcc
Com
pre
ss
Md
ljsp
2
Ora
Fppp
p
Ear
Sw
m2
56
Dod
uc
Alvi
nn
Tom
catv
Wave5
Md
ljp
2
Hyd
ro2
d
L2
I$
D$
I Stall
Other
Co
mm
erc
ia
lw
ork
loa
ds
Cache Storage System CS510 Computer Architectures Lecture 13 - 77
Pitfall: Predicting Cache Performance Pitfall: Predicting Cache Performance from Different Pgm (ISA, compiler,...)from Different Pgm (ISA, compiler,...)
• 4KB Data cache miss rate 8%,12%,or 28%?
• 1KB Instr cache miss rate 0%,3%,or 10%?
• Alpha vs. MIPS for 8KB Data:17% vs. 10%
Cache Size (KB)
Miss Rate
0%
5%
10%
15%
20%
25%
30%
35%
1 2 4 8 16 32 64 128
D: tomcatv
D: gcc
D: espresso
I: gcc
I: espresso
I: tomcatv
D$ miss rate
I$ miss rate
Miss Rate varies dependingon programs
Cache Storage System CS510 Computer Architectures Lecture 13 - 78
Pitfall: Simulating Pitfall: Simulating too Small an Address Tracetoo Small an Address Trace
Pitfall: Simulating Pitfall: Simulating too Small an Address Tracetoo Small an Address Trace
Instructions Executed (billions)
Cummlative
AverageMemoryAccess
Time
1
1.5
2
2.5
3
3.5
4
4.5
0 1 2 3 4 5 6 7 8 9 10 11 12
SOR(FORTRAN)
Tree(Scheme) Multi(multiprogrammed workload)
TV(Pascal)
Cumulativeaveragememoryaccess
time
Instructions Executed(billions)