cache storage systemcs510 computer architectureslecture 13 - 1 lecture 13 cache storage system

Cache Storage System CS510 Computer Architectures Lecture 13 - 1

Lecture 13Lecture 13

Cache Storage SystemCache Storage SystemLecture 13Lecture 13

Cache Storage SystemCache Storage System


4 Questions for 4 Questions for Cache Storage DesignersCache Storage Designers

4 Questions for 4 Questions for Cache Storage DesignersCache Storage Designers

• Q1: Where can a block be placed in the Cache? (Block placement)

• Q2: How is a block found if it is in the Cache? (Block identification)

• Q3: Which block should be replaced on a Cache miss? (Block replacement)

• Q4: What happens on a Cache write? (Write strategy)


Example:Example: Alpha Alpha 21064 Data Cache21064 Data Cache

Example:Example: Alpha Alpha 21064 Data Cache21064 Data Cache

Index = 8 bits: 256 blocks = 8192/(32bytesx1)

DirectMapped

. . . . . .WriteBuffer

Lower Level Memory

=? 4:1 MUX

Block Offset

<5>

Block Address <21> <8> CPUAddress

Data In

DataOut

Valid Tag Data<1> <21> <256>

256 blocks

34Tag Index

Miss Hit(send load signal to CPU)

If 1compare


Example:Example:

Alpha Alpha 21064 Data Cache21064 Data CacheExample:Example:

Alpha Alpha 21064 Data Cache21064 Data Cache• READ: four steps

[1] CPU sends 34-bit address to cache for Tag comparison

[2] Index selects the Tag to be compared with the Tag in the address from CPU

[3] Compare the Tags if the information in the directory is Valid

[4] If Tags match, signal CPU to load the data

2 clock cycles for these 4 steps Instructions in this 2 cycles need to be stalled

• WRITE: – the first three steps + data write

Tag Index

. . . . . .WriteBuffer

Lower LevelMemory

=? 4:1 MUX

Block Address

CPUAddress

Data In

Data Out

Valid Tag Data<1> <21> <256>

256 blocks

34

Miss Hit


Writes in Alpha 21064Writes in Alpha 21064Writes in Alpha 21064Writes in Alpha 21064• Write Through Cache

– Write is not complete in 2 clock cycles, but data is written in the Write Buffer in 2 clock cycles, and from where data must be stored in the memory while CPU continues on working

• No-Write Merging vs. Write Merging in write buffer

100104108112

Write Address V V V V

1 0 0 01 0 0 01 0 0 01 0 0 0

No-WriteMerging

4 Writes - 4 entries

Write Address V V V V

1 1 1 10 0 0 00 0 0 00 0 0 0

100WriteMerging 16 sequential

writes in a buffer

4 Writes - merged into a single buffer entry


Structural Hazard:Structural Hazard: Split Cache or Unified Cache?Split Cache or Unified Cache?

Structural Hazard:Structural Hazard: Split Cache or Unified Cache?Split Cache or Unified Cache?

Miss Rates for Separate Instruction Cache and Data Cache

- DM, 32-byte Blocks, SPEC92 avg(instruction reference is about 75%), DECstation 5000

Size Split Cache Unified Cache

a Instruction Cache Data Cache a

1 KB 3.06% 24.61% 13.34%

2 KB 2.26% 20.57% 9.78%

4 KB 1.78% 15.94% 7.24%

8 KB 1.10% 10.19% 4.57%

16 KB 0.64% 6.47% 2.87%

32 KB 0.39% 4.82% 1.99%

64 KB 0.15% 3.77% 1.35%

128 KB 0.02% 2.88% 0.95%


Example:Example: Miss Rate and Average Access TimeMiss Rate and Average Access Time


• Compare 16 KB I-cache and 16 KB D-cache vs. 32 KB U-cache• Assume

– a hit takes 1 clock cycle and a miss takes 50 clock cycles

– a load or a store hit takes 1 extra clock cycle on a U-cache since there is only one cache port to satisfy two simultaneous request(1 Instruction fetch and 1 LD or 1 ST)

– Integer program: LD;26%, ST;9%• Answer

– instruction accesses: 100%/(100%+26%+9%) = 75%

– data accesses: (26%+9%) /(100%+26%+9%) = 25%

– the overall miss rate for the split cache

(75% x 0.64%) + (25% x 6.47%) = 2.10%

U-cache has a slightly lower miss rate of 1.99%

Continue on the next slide



Example:Example: Miss Rate and Average Access TimeMiss Rate and Average Access Time• Average access time = % instructions x (hit time + instruction mis

s rate x Miss penalty) + %data x (hit time + data miss rate x Miss penalty)

• 16 KB split cache average access timesplit

= 75% x (1+0.64%x50) + 25% x (1+6.47%x50) = 0.990 + 1.059

= 2.05

• 32 KB unified cache average access timeunified

= 75% x (1+1.99%x50) + 25% x (1+1+1.99%x50) = 1.496 + 0.749

= 2.24

• The split cache in this example have a better average access time than the single-ported unified cache even though their effective miss rate is higher

Stall cycle due to instruction fetch


Cache PerformanceCache PerformanceCache PerformanceCache Performance

CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time

Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty)

Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty



CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time

Misses per instruction = Memory accesses per instruction x Miss rate

CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time



• What is the impact of two different cache organizations(DM and SA Cache) on the performance of a CPU?

• Assume

– CPI with a perfect cache -- 2.0

– clock cycle time(=cache access time) -- 2 ns

– 1.3 memory references per instruction

– 64 KB cache : block size -- 32 bytes

– one cache - direct mapped, the other - two-way SA

– assume the CPU clock cycle time must be stretched 1.10 times to accommodate the selection MUX of the SA(2x1.1)

– cache miss penalty -- 70 ns for either caches

– miss rate : Direct mapped -- 1.4%, SA -- 1.0%


2-way Set Associative, 2-way Set Associative, Address to Select WordAddress to Select Word2-way Set Associative, 2-way Set Associative, Address to Select WordAddress to Select Word

Two sets ofAddress tagsand data RAM

Use addressbits to selectcorrect DRAM

<22> <7> <5>Block Address

Blockoffset

Tag Index

CPU

AddressData Datain Out

WriteBuffer

Low Level Memory

Valid Tag<1> <22>

Data<64>

. . . . . .

. . .

. . .

2:1MUX

=?=?

Set 0 index

Set 1 index

Set 0

Set 1



Answer

– Average access time = hit time + Miss rate x Miss penalty• Average access timedirect = 2+(.014x70) = 2.98ns

• Average access time2-way = 2x1.10+(.010x70) = 2.90ns

– CPU performance

CPU time = IC x (CPIex + Misses per Instr. x Miss penalty) x clock cycle time

• CPU timedirect = IC x (2.0x2+(1.3x.014x70)) = 5.27 x IC

• CPU time2-way = IC x (2.0x2x1.10+(1.3x.010x70))= 5.31 x IC

– the direct mapped cache leads to slightly better average performance


Improving Cache Improving Cache PerformancePerformance


• Average memory access time = Hit time + Miss rate x Miss penalty (ns or clocks)

• Improve performance by:

1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.




1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.


CacheCachePerformance ImprovementPerformance Improvement

Reducing Miss RateReducing Miss Rate


Reducing MissesReducing MissesReducing MissesReducing MissesClassifying Misses: 3 Cs

Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses.(Misses in Infinite Cache)

Capacity: If The cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Size of Cache)

Conflict: If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses.(Misses in N-way Set Associative, Size of Cache)


3Cs Absolute Miss Rate3Cs Absolute Miss Rate3Cs Absolute Miss Rate3Cs Absolute Miss Rate

1-way(DM)

2-way

Capacity Capacity

4-way

8-way

Conflict

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Compulsory Cache Size (KB)

1 2 4 8 16 32 64 128


2:1 Cache Rule on 2:1 Cache Rule on Set AssociativitySet Associativity

2:1 Cache Rule on 2:1 Cache Rule on Set AssociativitySet Associativity

2:1 Cache rule of thumb

A direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8 16 32 64 128

1-way(DM)

2-way

4-way

8-way

Capacity

Compulsory

Conflict


3Cs Relative Miss Rate3Cs Relative Miss Rate(Scaled To Direct-Mapped Miss Ratio)(Scaled To Direct-Mapped Miss Ratio)

3Cs Relative Miss Rate3Cs Relative Miss Rate(Scaled To Direct-Mapped Miss Ratio)(Scaled To Direct-Mapped Miss Ratio)

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%

2 4 8 16 32 64 128

1-way

2-way4-way

8-way

Capacity Capacity

Compulsory Compulsory

Con

flic

t

Con

flic

t

1


How Can Reduce Misses?How Can Reduce Misses?How Can Reduce Misses?How Can Reduce Misses?

• Change Block Size? Which of 3Cs affected?

• Change Associativity? Which of 3Cs affected?

• Change Compiler? Which of 3Cs affected?


1. Reduce Misses via 1. Reduce Misses via Larger Block SizeLarger Block Size

1. Reduce Misses via 1. Reduce Misses via Larger Block SizeLarger Block Size

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16 32 64 128 256

1K

4K16K64K256K

Cache Capacity

For larger caches, Miss rate reduces as the block size increases.But block size should not be large with respect to the cache capacity.It will rather heart the miss rate.


2. Reduce Misses via 2. Reduce Misses via Higher AssociativityHigher Associativity

2. Reduce Misses via 2. Reduce Misses via Higher AssociativityHigher Associativity

• 2:1 Cache Rule: – Miss Rate DM cache size N Miss Rate SA(2-way) cache size N/2

• Beware: Execution time is only final measure!– Will Clock Cycle time increase?

– Hill [1988] suggested hit times for TTL or ECL board-level external cache +10%, and CMOS internal cache + 2% for 2-way vs. 1-way


Example: AMAT vs. Miss RateExample: AMAT vs. Miss RateExample: AMAT vs. Miss RateExample: AMAT vs. Miss RateExample: assume CCT = 1.10 for 2-way, 1.12 for 4-way,

1.14 for 8-way vs. CCT direct mapped, 32 bytes /block

AMAT is not improved significantly by more associativity

AMAT is improvedby more associativity

Cache Memmory Access Time Size (KB) 1-way 2-way 4-way 8-way

1 7.56 6.60 6.22 5.44 2 5.09 4.90 4.62 4.09 4 4.60 3.95 3.57 3.19 8 3.30 3.00 2.87 2.59 16 2.45 2.20 2.12 2.04 32 2.00 1.80 1.77 1.79 64 1.70 1.60 1.57 1.59 128 1.50 1.45 1.42 1.44


3. Reducing Misses via 3. Reducing Misses via Victim CacheVictim Cache

3. Reducing Misses via 3. Reducing Misses via Victim CacheVictim Cache

• How to combine fast hit time of Direct Mapped yet still avoid conflict misses?

• Add buffer to place data discarded from cache

• Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache

Victim Cache

Tag

Data

=?

=?

CPUAddressData DataIn Out

WriteBuffer

Low Level Memory

Cache


4. Reducing Misses via 4. Reducing Misses via Pseudo-AssociativityPseudo-Associativity

4. Reducing Misses via 4. Reducing Misses via Pseudo-AssociativityPseudo-Associativity

• How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache?

Hit Time

Pseudo-Hit Time

Miss Penalty

•Divide cache: on a DM miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)

•Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles– Better for caches not tied directly to processor


5. Reducing Misses by 5. Reducing Misses by HW PrefetchingHW Prefetching

5. Reducing Misses by 5. Reducing Misses by HW PrefetchingHW Prefetching

• Prefetch into the Cache or into an external buffer e.g., Instruction Prefetching– Alpha 21064 fetches 2 consecutive blocks on a miss

– Extra prefetched block is placed in instruction stream buffer

– On miss check instruction stream buffer

• Works with data blocks too:– Jouppi [1990]: 1 data stream buffer got 25% of misses from 4KB c

ache; 4 streams got 43%

– Palacharla & Kessler [1994]: for scientific programs for 8 streams got 50% to 70% of misses from two 64KB, 4-way set associative caches

• Prefetching relies on extra memory bandwidth that can be used without penalty


6. Reducing Misses by 6. Reducing Misses by Compiler Prefetching of DataCompiler Prefetching of Data

6. Reducing Misses by 6. Reducing Misses by Compiler Prefetching of DataCompiler Prefetching of Data

• Prefetch Instructions to request the data in advance• Data Prefetch

– Register Prefetch: Load the value of data into register (HP PA-RISC loads)

– Cache Prefetch: load data into cache (MIPS IV, PowerPC, SPARC v. 9)

– Special prefetching instructions cannot cause faults; non-faulting or nonbinding prefetch- it does’t change R or M content and it cannot cause exceptions---a form of speculative execution

– Make sense only if a nonblocking cache(lockup-free cache) is used i.e., processor can proceed while waiting for prefetched data

• Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced misses?


Big PictureBig Picture

The challenge in designing memory hierarchies is that every change that potentially improves the miss rate can also negatively affect overall performance.

This combination of positive and negative effects is what makes the design of a memory hierarchy challenging.

Increase size Decr capacity misses May incr access time

Increase assoc. Decreases miss rate May incr access time due to conflict missesIncrease blk size Decr miss rate for a May incr miss penalty wide range of blk sizes

Design change Effect on miss rate Possible negative performance effect


Compiler Prefetching of DataCompiler Prefetching of DataCompiler Prefetching of DataCompiler Prefetching of Dataa and b are 8-bytes long -> 2 data in a block

for (i=0; i<3; i=i+1) for (j=0; j<100; j=j+1) a[i,j] = b[j,0]*b[j+1,0];

Analysis

– Spatial Locality: Exists in a but not in b The misses due to a -> misses on every other a[i,j](2 data/blk) a[0,0], a[0,1],..,a[0,99],a[1,0],a[1,1],..,a[1,99],a[2,0],..,a[2,99]

•Total 3 x 100 /2 or 150 misses

– Misses due to b: ignoring potential conflict missesb[0,0],b[0,1],b[0,2],b[1,0],b[1,2]b[1,2],…,b[99,0],b[99,1],b[99,2],b[100,0],b[100,1],b[100,2]

•100 misses on b[j+1,0] when i=0•1 miss on b[j,0] when i=0 and j=0

i=0: b[0,0]b[1,0]b[1,0]b[2,0]b[2,0]b[3,0]…b[99,0]b[99,0]b[100,0] ; purples represents missesHit on every b accesses when i=1 and i=2

•Total 101 misses

– Total misses : 251 misses


Compiler Prefetching of DataCompiler Prefetching of DataCompiler Prefetching of DataCompiler Prefetching of DataAssume that the miss penalty is so large that we need to prefetch at least seven iterations in advance Split the iterations of i-loop such that i=0(j-loop,fetch all b’s) and remainders(i-loop)

for (j =0; j < 100; j = j+1) {prefetch (b[j+7,0]); prefetch (a[0, j+7]); a[0,j] = b[j,0] * b[j+1,0];}

for (i =1; i <3; i= i+1) for (j =0; j < 100; j = j+1) {

prefetch (a[i,j+7]); a[i,j] = b[j,0] * b[j+1,0];}

Total number of misses; 7/2 x 3 misses of a and 7 misses of b = 19 misses

a[0,0]a[0.1]a[0,2]a[0,3]a[0,4]a[0,5]a[0,6]a[0,7]a[0,8]a[0,9]…..b[0,0]b[1,0]b[1,0]b[2,0]….b[5,0]b[6,0]b[6.0]b[7,0]b[7,0]b[8,0]

PF: b[7,0],b[8,0],…,b[100,0]PF: a[0,7],a[0,8],…,a[0,99]7/2 misses of a, 7 misses of b

PF: a[1,7],a[1,8],…,a[1,99],…,a[2,99]7/2 x 2 misses of a


7. Reducing Misses by 7. Reducing Misses by Compiler OptimizationsCompiler Optimizations7. Reducing Misses by 7. Reducing Misses by Compiler OptimizationsCompiler Optimizations

• Instructions– Reorder procedures in memory so as to reduce misses– Profiling to look at conflicts– McFarling [1989] reduced cache misses by 75% on 8KB direct mapped ca

che with 4 byte blocks

• Data– Merging Arrays: improve spatial locality by single array of compound el

ements vs. 2 arrays

– Loop Interchange: change nesting of loops to access data in the order stored in memory

– Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap

– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows


Merging Arrays ExampleMerging Arrays ExampleMerging Arrays ExampleMerging Arrays Example

/* Before */

int val[SIZE];

int key[SIZE];

/* After */

struct merge {

int val;

int key;

};

struct merge merged_array[SIZE];

Reducing conflicts between val & key


Loop Interchange ExampleLoop Interchange ExampleLoop Interchange ExampleLoop Interchange Example

Sequential accesses Instead of striding through memory every 100 words

/* Before */

for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)

for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];

/* After */

for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)

for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

Processing Sequence

x[0,0],x[0,1],…,x[0,m],x[1,0],x[1,1],…,x[1,m], Spatial………… Loacityx[n,0],x[n,1],…,x[n,m]

x[0,0],x[1,0],...,x[n,0],x[0,1],x[1,1],...,x[n,1], No…………. Spatialx[0,m],x[1,m],…,x[n,m] Locality


Loop Fusion ExampleLoop Fusion ExampleLoop Fusion ExampleLoop Fusion Example

/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];

/* After */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){a[i][j] = 1/b[i][j] * c[i][j];d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c 1 miss per accessvs


Blocking ExampleBlocking ExampleBlocking ExampleBlocking Example/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

{r = 0;

for (k = 0; k < N; k = k+1){

r = r + y[i][k]*z[k][j];};

x[i][j] = r;

};

i

jx

i

ky

1

1

k

jz 3

• Two Inner Loops:– Read all NxN elements of z[ ]– Read N elements of 1 row of y[ ] repeatedly– Write N elements of 1 row of x[ ]

• Capacity Misses is a function of N & Cache Size:– Can store 3 NxN => no capacity misses; possible conflict misses

• Idea: compute on BxB sub-matrix


Blocking ExampleBlocking ExampleBlocking ExampleBlocking Example

/* After */

for (jj = 0; jj < N; jj = jj+B)

for (kk = 0; kk < N; kk = kk+B)

for (i = 0; i < N; i = i+1)

for (j = jj; j < min(jj+B-1,N); j = j+1)

{r = 0;

for (k = kk; k < min(kk+B-1,N); k = k+1) {

r = r + y[i][k]*z[k][j];};

x[i][j] = x[i][j] + r;

};

i

jx

k

i

y

k

jz

1

1

Blocking is useful to avoid capacity misses when dealing with large arrays

• Capacity Misses from 2N3 + N2 to 2N3/B +N2

• B called Blocking Factor, in this example B=3• Conflict Misses Too?


Reducing Conflict Misses Reducing Conflict Misses by Blockingby Blocking

Reducing Conflict Misses Reducing Conflict Misses by Blockingby Blocking

Impact of Conflict misses in caches not FA on Blocking factor– Lam et al [1991] a blocking factor of 24 had a fifth the misses of 48 despite

both fit in cache

Blocking Factor

0

0.05

0.1

0 50 100 150

Fully Associative Cache

Direct Mapped Cache Mis

s ra

te %

Blocking to reduce capacity misses, also, reduce conflict misses by increasing associativity. Block size smaller than capacity can also reduce conflict misses


Summary: Compiler Optimization Summary: Compiler Optimization to Reduce Cache Missesto Reduce Cache Misses

Summary: Compiler Optimization Summary: Compiler Optimization to Reduce Cache Missesto Reduce Cache Misses

Performance Improvement

1 1.5 2 2.5 3

compress

cholesky(nasa7)

spice

mxm (nasa7)

btrix (nasa7)

tomcatv

gmty (nasa7)

vpenta (nasa7)

mergedarrays

loopinterchange

loop fusion blocking


SummarySummarySummarySummary

• 3 Cs: Compulsory, Capacity, Conflict Misses• Reducing Miss Rate

1. Reduce Misses via Larger Block Size

2. Reduce Misses via Higher Associativity

3. Reducing Misses via Victim Cache

4. Reducing Misses via Pseudo-Associativity

5. Reducing Misses by HW Prefetching Instr, Data

6. Reducing Misses by SW Prefetching Data

7. Reducing Misses by Compiler Optimizations

• Remember danger of concentrating on just one parameter when evaluating performance


Cache Cache Performance ImprovementPerformance Improvement

Reducing Miss PenaltyReducing Miss Penalty

Cache Cache Performance ImprovementPerformance Improvement

Reducing Miss PenaltyReducing Miss Penalty


Reducing Miss Penalty:Reducing Miss Penalty: 1. Read Priority over Write on Miss1. Read Priority over Write on Miss

Reducing Miss Penalty:Reducing Miss Penalty: 1. Read Priority over Write on Miss1. Read Priority over Write on Miss

• Write Through with write buffers offer RAW conflicts with main memory reads on cache misses– If simply wait for write buffer to empty might increase read miss

penalty by 50% (old MIPS 1000)

– Check write buffer contents before read; if no conflicts, let the memory access continue

• Write Back?– Read miss replacing dirty block

– Normal: Write dirty block to memory, and then do the read

– Instead copy the dirty block to a write buffer, then do the read, and then do the write

– CPU stalls less since restarts as soon as do read


Reducing Miss Penalty:Reducing Miss Penalty: 2. Subblock Placement2. Subblock Placement

Reducing Miss Penalty:Reducing Miss Penalty: 2. Subblock Placement2. Subblock Placement

• Don’t have to load full block on a miss• Have bits per sub-block to indicate valid• (Originally invented to reduce tag storage)

100

300

200

204

1

1

0

0

1

1

1

0

1

0

0

0

1

0

1

0

Sub-blocksValid Bits


Reducing Miss Penalty:Reducing Miss Penalty: 3. Early Restart and Critical Word First3. Early Restart and Critical Word First

Reducing Miss Penalty:Reducing Miss Penalty: 3. Early Restart and Critical Word First3. Early Restart and Critical Word First• Don’t wait for full block to be loaded before restarting CPU

– Early Restart: As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution while loading remainder of the block

– Critical Word First: Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

• Generally useful only in large blocks, • Spatial locality a problem; tend to want the next sequential word,

so not clear if benefit by early restart


Reducing Miss Penalty:Reducing Miss Penalty:

4. Non-blocking Caches 4. Non-blocking Caches to Reduce Stalls on Missesto Reduce Stalls on Misses

Reducing Miss Penalty:Reducing Miss Penalty:

4. Non-blocking Caches 4. Non-blocking Caches to Reduce Stalls on Missesto Reduce Stalls on Misses

• Non-blocking cache or lockup-free cache allowing the data cache to continue to supply cache hits during a miss

• “hit under miss” reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU

• “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses– Significantly increases the complexity of the cache controller as

there can be multiple outstanding memory accesses


Value of Hit under Miss Value of Hit under Miss for SPECfor SPEC

Value of Hit under Miss Value of Hit under Miss for SPECfor SPEC

• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

Hit Under i MissesAv

g. M

em. A

cces

s Ti

me

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

eqnt

ott

esp

resso

xlis

p

com

pre

ss

md

ljsp

2

ear

fppp

p

tom

catv

sw

m2

56

dodu

c

su

2cor

wave5

md

ljd

p2

hyd

ro2

d

alvi

nn

nasa7

sp

ice2

g6

ora

0->1

1->2

2->64

Base

Integer Floating Point


Reducing Miss Penalty:Reducing Miss Penalty: 5. Second Level Cache5. Second Level Cache


CPU

Memory

L1

L2




• L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 xMiss PenaltyL2)

• Definitions: – Local miss rate: misses in this cache divided by the total

number of memory accesses to this cache (Miss rateL2)

– Global miss rate: misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)


Second Level Cache: Second Level Cache: An ExampleAn Example

Second Level Cache: Second Level Cache: An ExampleAn Example

• Example :

– 1000 memory references

– 40 misses in L1

– 20 misses in L2

• the miss rate (either local or global) for L1 : 40/1000 or 4%• the local miss rate for L2 : 20/40 or 50 %• the global miss rate for L2 : 20/1000 or 2%


Comparing Local and Comparing Local and Global Miss RatesGlobal Miss Rates

• 32 KByte 1st level cache;Increasing 2nd level cache

• Global miss rate close to single level cache rate provided L2 >> L1

• Don’t use local miss rate, it must be a function of L1 miss rate, so use L2 global miss rate

• L2 is not tied to CPU clock cycle, thus it will only affect the L1 miss penalty, not the CPU clock cycle

• In L2 design, will it lower AMAT? and how much it will cost? must be considered

• Generally Fast Hit Times and fewer misses

• Since hits are few, target miss rate reduces

Linear Scale

L2

Log Scale

L2

L1


Reducing Misses: Reducing Misses: Which Apply to L2 Cache?Which Apply to L2 Cache?

Reducing Misses: Reducing Misses: Which Apply to L2 Cache?Which Apply to L2 Cache?

Reducing Miss Rate1. Reduce Misses via Larger Block Size

2. Reduce Conflict Misses via Higher Associativity

3. Reducing Conflict Misses via Victim Cache

4. Reducing Conflict Misses via Pseudo-Associativity

5. Reducing Misses by HW Prefetching Instr, Data

6. Reducing Misses by SW Prefetching Data

7. Reducing Capacity/Conf. Misses by Compiler Optimizations


The Impact of L2 Cache AssociatiThe Impact of L2 Cache Associativity on the Miss Penaltyvity on the Miss Penalty

The Impact of L2 Cache AssociatiThe Impact of L2 Cache Associativity on the Miss Penaltyvity on the Miss Penalty

• Does set associativity make more sense for L2 caches?– two way set associativity : increases hit time by 10% of clock cyc

le(0.1 clock cycle)

– hit timeL2 for direct mapped = 10 cycles

– local miss rateL2 for direct mapped = 25%

– local miss rateL2 for 2-way SA = 20%

– Miss penaltyL2 = 50 cycles

• the first-level cache miss penalty(Miss penaltyL1=Hit TimeL2 + Miss rateL2 x Miss penaltyL2)

Miss penalty 1-wayL1 = 10 + 25%x50 = 22.5 clock cycles

Miss penalty 2-wayL1 = 10.1 + 20%x50= 20.1 clock cycles

*worst case(hit time is the integral number of clocks, thus rounding 10.1=11)

Miss penalty 2-wayL1 = 11 + 20%x50= 21.0 clock cycles


L2 Cache Block Size and AMATL2 Cache Block Size and AMATL2 Cache Block Size and AMATL2 Cache Block Size and AMAT

512KB L2 cache; memory bus=32-bit.

Long MAT: 1 clock for sending address,

6 clocks for accessing data: 1 word/clock

Block Size of L2 cache(bytes)

11.11.21.31.41.51.61.71.81.9

2

16 32 64 128 256 512

1.361.28 1.27

1.34

1.54

1.95

Rel

ativ

e C

PU

exe

cuti

on t

ime

Small L2 cache;increase conflict misses as the block size increases, but L2 capacity is large

popular size


L2 Cache Block SizeL2 Cache Block SizeL2 Cache Block SizeL2 Cache Block Size

• Multilevel Inclusion Property

– All data in the first-level cache are always in the second-level cache

– Consistency between I/O and caches ( or between caches in multiprocessors) can be determined by checking L2 cache

– Drawback• Different block sizes between caches in different levels

� Usually, smaller cache has smaller cache blocks

� Thus, different block sizes for L1 and L2

� Still maintain multilevel inclusion property

• On a miss, complex invalidations, require non-blocking secondary caches


Reducing Miss Penalty SummaryReducing Miss Penalty SummaryReducing Miss Penalty SummaryReducing Miss Penalty Summary

• Five techniques– Read priority over write on miss

– Sub-block placement

– Early Restart and Critical Word First on miss

– Non-blocking Caches (Hit Under Miss)

– Second Level Cache

• Can be applied recursively to Multilevel Caches– Danger is that time to access DRAM will increase with

multiple levels in between


CacheCachePerformance ImprovementPerformance Improvement

Fast Hit TimeFast Hit Time


Fast Hit Times:Fast Hit Times: 1. Small and Simple Caches1. Small and Simple Caches

Fast Hit Times:Fast Hit Times: 1. Small and Simple Caches1. Small and Simple Caches

• Smaller and Simple hardware is faster

– Smaller Cache• On-chip Cache• On-chip Tag, Off-chip Data

– Simple Cache• Direct Mapped Cache allows to transmit data while checking

tag

• Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache

Cache Hit TimeMost of time is spent on Reading TAG Memory using

INDEX part of address and Compare


Fast Hit Times:Fast Hit Times: 2. Fast Hits by Avoiding 2. Fast Hits by Avoiding

Address TranslationAddress Translation

Fast Hit Times:Fast Hit Times: 2. Fast Hits by Avoiding 2. Fast Hits by Avoiding

Address TranslationAddress Translation• Send virtual address to cache? Called Virtually Addressed Cache or

just Virtual Cache vs. Physical Cache

– Make the common case fast• Common case is hit, thus allows to eliminate the virtual address

translation on hit

– Every time process is switched logically must flush the cache; otherwise get false hits with the same virtual address

• Cost is time to flush + Compulsory misses from empty cache

– Dealing with aliases (sometimes called synonyms); OS as well as user programs map two different virtual addresses to the same physical address

– I/O(physical address) must interact with cache--virtual address


Solutions to Solutions to Cache Flush and AliasCache Flush and Alias

•Solution to aliases–HW: Anti-aliasing: guarantee that every cache

block has a unique physical address–SW: guarantee the lower n bits of aliases must be same; and they are as long as to cover the

index field For direct mapped, they must be unique;

page coloring makes the least significant several bits of physical and virtual addresses identical

•Solution to cache flush–Add process identifier(PID) to the address tag that identifies the process as well as address

within process: cannot get a hit if wrong process


Conventional Conventional Virtually Addressed CacheVirtually Addressed Cache

Hit Time is slow

CPU

TB

Cache

MEM

VA

PA

PA

1. Translate the virtual address into a physical address

2. Then access Cache with the physical address


Alternative Alternative Virtually Addressed CacheVirtually Addressed Cache

• Synonym Problem• Higher Miss penalty

CPU

Cache

TB

MEM

VA

VA

PA

VATags

1. Access cache with the virtual address2. If miss, translate the virtual address for

memory access


Cache Access and Cache Access and VA Translation OverlappingVA Translation Overlapping

Requires cache index to remain invariant across translation, i.e. Index is physical address part

CPU

Cache TB

MEM

VPA

PPAL2 $

P-Offset

P-Offset

VPA P-offset

Virtual address

1. Access cache and the virtual address translation simultaneously

2. When miss, access L2 cache with the physical address


2. Avoiding Translation:2. Avoiding Translation: Process ID impactProcess ID impact

2. Avoiding Translation:2. Avoiding Translation: Process ID impactProcess ID impact

2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k Cache Size

20%

18%

16%

14%

12%

10%

8%

6%

4%

2%

0%

Miss Rate

• Green: Uniprocess Without process switch

• Red: Multiprocess Process switches using PIDs when flush cache

• Blue: Multiprocess Process switches w/o PIDs, simply flushes cache

Direct mapped cacheBlock size=16 bytesFor Ultrix running on a VAX


2. Avoiding Translation:2. Avoiding Translation: Index with Physical Portion of AddressIndex with Physical Portion of Address

2. Avoiding Translation:2. Avoiding Translation: Index with Physical Portion of AddressIndex with Physical Portion of Address

• If INDEX is a physical part of address, can start TAG access in parallel with address translation so that can compare to physical tag

• Limits cache to page size: what if want bigger caches and uses same trick?

– Higher associativity increases the Index, thus enlarge page– Page coloring - OS makes the last few bits of virtual and

physical address identical causes to compare only remainder of the page offset bits

31 12 11 0

Page Address Address Tag

Page Offset Index Block Offset

Address Translation Tag Access


Fast Hit times:Fast Hit times: 3. Fast Hit Times via Pipelined Writes3. Fast Hit Times via Pipelined Writes

Fast Hit times:Fast Hit times: 3. Fast Hit Times via Pipelined Writes3. Fast Hit Times via Pipelined Writes• Pipeline Tag Check and Update Cache as separate stages;

current write tag check & previous write cache update • Only Write in the pipeline; empty during a miss

=?

Delayed Write BufferTag

DataM

UX

=?

CPUAddress

Data DataIn Out

WriteBuffer

Low Level Memory

Delayed Write Buffer; must be checked on reads;

either complete write or read from buffer

CurrentWrite

PreviousWrite


Fast Hit times:Fast Hit times: 4. Fast Writes on Misses via 4. Fast Writes on Misses via

Small Sub-blocksSmall Sub-blocks

Fast Hit times:Fast Hit times: 4. Fast Writes on Misses via 4. Fast Writes on Misses via

Small Sub-blocksSmall Sub-blocks• If most writes are 1 word, sub-block size is 1 word, & write through

then always write sub-block and tag immediately – Tag match and valid bit already set: Writing the block was proper, &

nothing lost by setting valid bit on again.

– Tag match and valid bit not set: The tag match means that this is the proper block; writing the data into the sub-block makes it appropriate to turn the valid bit on.

– Tag mismatch: This is a miss and will modify the data portion of the block. As this is a write-through cache, however, no harm was done; memory still has an up-to-date copy of the old value. Only the tag to the address of the write and the valid bits of the other sub-blocks need to be changed because the valid bit for this sub-block has already been set

• Doesn’t work with write back due to last case


Cache Optimization SummaryCache Optimization SummaryCache Optimization SummaryCache Optimization Summary

Larger Block Size + - 0Higher Associativity + - 1Victim Caches + 2Pseudo-Associative Caches + 2

HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0

Priority to Read Miss over write miss + 1Sub-block Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2

Small & Simple Cache - + 0Avoiding Address Translation + 2Pipelining Writes for fast write hit + 1

Technique MR MP HT Complexity


What is the Impact of What What is the Impact of What You’ve Learned About Caches?You’ve Learned About Caches?

What is the Impact of What What is the Impact of What You’ve Learned About Caches?You’ve Learned About Caches?

• 1960-1985: Speed = (no. operations)

• 1995

– Pipelined Execution & Fast Clock Rate

– Out-of-Order completion

– Superscalar Instruction Issue

• 1995: Speed = (non-cached memory accesses)

• What does this mean for

– Compilers?,Operating Systems?, Algorithms? Data Structures?

1

10

100

1000

198

0

198

1

198

2

198

3

198

4

198

5

198

6

198

7

198

8

198

9

199

0

199

1

199

2

199

3

199

4

199

5

199

6

199

7

199

8

199

9

200

0

DRAM

CPU


Cross Cutting IssuesCross Cutting IssuesCross Cutting IssuesCross Cutting Issues

• Parallel Execution vs. Cache locality– Want far separation to find independent operations vs.

want reuse of data accesses to avoid misses

• I/O and consistency of data between cache and memory– Caches => multiple copies of data

– Consistency by HW or by SW?

– Where connect I/O to computer?


Alpha 21064Alpha 21064Alpha 21064Alpha 21064• Separate Instr & Data TLB

& Caches• TLBs fully associative• TLB updates in SW

(‘priv Arch Libr”)• Caches 8KB direct mappe

d• Critical 8 bytes first• Prefetch instr. stream buff

er• 2 MB L2 cache, direct ma

pped (off-chip)• 256 bit path to main mem

ory, 4 x 64-bit modules


Alpha Memory Performance:Alpha Memory Performance: Miss RatesMiss Rates

Alpha Memory Performance:Alpha Memory Performance: Miss RatesMiss Rates

100.00%Al

phaS

ort

TP

C-B

(d

b2

)

TP

C-B

(d

b1

)

Esp

resso

Li

Eqnt

ott

Sc

Gcc

Com

pre

ss

Md

ljsp

2

Ora

Fppp

p

Ear

Sw

m2

56

Dod

uc

Alvi

nnTo

mca

tvW

ave5

Md

ljp

2

Hyd

ro2

d

Sp

ice

Nasa7

Su

2cor

Mis

s R

ate

(log

scale

)

0.01%

0.10%

1.00%

10.00%

I $ 8K

D $ 8K

L2 2M

D$

I$

L2


Alpha CPI ComponentsAlpha CPI ComponentsAlpha CPI ComponentsAlpha CPI ComponentsPrimary limitations of performance

– Instruction stall: branch mispredict; – Other: compute + reg conflicts, structural conflicts

CP

I

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

Alph

aSor

t

TP

C-B

(d

b2

)

TP

C-B

(d

b1

)

Esp

resso

Li

Eqnt

ott

Sc

Gcc

Com

pre

ss

Md

ljsp

2

Ora

Fppp

p

Ear

Sw

m2

56

Dod

uc

Alvi

nn

Tom

catv

Wave5

Md

ljp

2

Hyd

ro2

d

L2

I$

D$

I Stall

Other

Co

mm

erc

ia

lw

ork

loa

ds


Pitfall: Predicting Cache Performance Pitfall: Predicting Cache Performance from Different Pgm (ISA, compiler,...)from Different Pgm (ISA, compiler,...)

• 4KB Data cache miss rate 8%,12%,or 28%?

• 1KB Instr cache miss rate 0%,3%,or 10%?

• Alpha vs. MIPS for 8KB Data:17% vs. 10%

Cache Size (KB)

Miss Rate

0%

5%

10%

15%

20%

25%

30%

35%

1 2 4 8 16 32 64 128

D: tomcatv

D: gcc

D: espresso

I: gcc

I: espresso

I: tomcatv

D$ miss rate

I$ miss rate

Miss Rate varies dependingon programs


Pitfall: Simulating Pitfall: Simulating too Small an Address Tracetoo Small an Address Trace

Pitfall: Simulating Pitfall: Simulating too Small an Address Tracetoo Small an Address Trace

Instructions Executed (billions)

Cummlative

AverageMemoryAccess

Time

1

1.5

2

2.5

3

3.5

4

4.5

0 1 2 3 4 5 6 7 8 9 10 11 12

SOR(FORTRAN)

Tree(Scheme) Multi(multiprogrammed workload)

TV(Pascal)

Cumulativeaveragememoryaccess

time

Instructions Executed(billions)

cache storage systemcs510 computer architectureslecture 13 - 1 lecture 13 cache storage system

Documents