csc 4250 computer architectures december 5, 2006 chapter 5. memory hierarchy

CSC 4250Computer Architectures

December 5, 2006

Chapter 5. Memory Hierarchy.

Cache Optimizations

1. Reducing miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim caches

2. Reducing miss rate: larger block size, larger cache size, higher associativity, way prediction, and compiler optimization

3. Reducing miss penalty or miss rate via parallelism: hardware and compiler prefetching

4. Reducing time to hit in cache: small and simple caches, and pipelined cache access

Three Categories of Misses (Three C’s) Three C’s: Compulsory, Capacity, and Conflict Compulsory ─ The very first access to a block cannot be in the

cache; also called cold-start misses or first-reference misses Capacity ─ If the cache cannot contain all the blocks needed

during execution, capacity misses (in addition to compulsory misses) will occur because of blocks being discarded and later retrieved

Conflict ─ If the block placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block may be discarded and later retrieved if too many blocks map to its set; also called collision misses or interference misses

Figure 5.15. Total miss rate (top) and distribution of miss rate (bottom) for each size data cache

according to the three C’s.

Interpretation of Figure 5.15

The figure shows the relative frequencies of cache misses, broken down by the “three C’s”

Compulsory misses are those that occur in an infinite cache Capacity misses are those that occur in a fully associative

cache Conflict misses are those that occur going from fully associative

to 8-way associative, 4-way associative, and so on To show the benefit of associativity, conflict misses are divided

by each decrease in associativity: 8-way ─ Conflict misses from fully assoc. to 8-way assoc. 4-way ─ Conflict misses from 8-way assoc. to 4-way assoc. 2-way ─ Conflict misses from 4-way assoc. to 2-way assoc. 1-way ─ Conflict misses from 2-way assoc. to 1-way assoc.

Reducing Miss Rate

1. Larger Block Size

2. Larger Caches

3. Higher Associativity

4. Way Prediction

5. Compiler Optimizations

1. Larger Block Size

Larger block size reduces compulsory misses, due to spatial locality

Larger blocks increase miss penalty Larger blocks increase conflict misses and

even capacity misses if the cache is small Do not increase the block size to value

beyond which either miss rate or average memory access time increases

Figure 5.16. Miss rate versus block size

Figure 5.18. Average memory access time versus block size for four caches sized 4KB, 16KB, 64KB, and 256KB.

Block sizes of 32B and 64B dominate; the smallest average time per cache size is shown in italic

What is the memory access overhead included in the miss penalty?

Block size Miss penalty 4KB 16KB 64KB 256KB

16B 82 8.027 4.231 2.673 1.894

32B 84 7.082 3.411 2.134 1.588

64B 88 7.160 3.323 1.933 1.449

128B 96 8.469 3.659 1.979 1.470

256B 112 11.651 4.685 2.288 1.549

2. Larger Caches

An obvious way to reduce capacity misses in Fig. 5.15 is to increase capacity of the cache

The drawback is a longer hit time and a higher dollar cost

This technique is especially popular in off-chip caches: The size of second- or third-level caches in 2001 equals the size of main memory in desktop computers in 1990

3. Higher Associativity

Figure 5.15 shows how miss rates improve with higher associativity. There are two general rules of thumb:1. 8-way set associative is for practical purposes as

effective in reducing misses as fully associative2. A direct-mapped cache of size N has about the same

miss rate as a 2-way set associative cache of size N/2 Improving one aspect of the average memory access

time comes at the expense of another: 1. Increasing block size reduces miss rate while increasing

miss penalty2. Greater associativity comes at the cost of an increased

hit time

Fig. 5.19. Average memory access time

versus associativity

Italic entries show where higher associativity increases average memory access time

Smaller caches need higher associativity

Cache size 1-way 2-way 4-way 8-way

4KB 3.44 3.25 3.22 3.28

8KB 2.69 2.58 2.55 2.62

16KB 2.23 2.40 2.46 2.53

32KB 2.06 2.30 2.37 2.45

64KB 1.92 2.14 2.18 2.25

128KB 1.52 1.84 1.92 2.00

256KB 1.32 1.66 1.74 1.82

512KB 1.20 1.55 1.59 1.66

4. Way Prediction

This approach reduces conflict misses and maintains the hit speed of direct-mapped cache

Extra bits are kept in the cache to predict the way of the next cache access

Alpha 21264 uses way prediction in its 2-way set associative instruction cache: added to each block is a prediction bit, used to select which block to try on the next cache access

If predictor is correct, the instruction cache latency is 1 clock cycle; if not, the cache tries the other block, changes the way predictor, and has a latency of 3 clock cycles

SPEC95 suggests a way prediction accuracy of 85%

5. Compiler Optimizations

Code can be rearranged without affecting correctness: Reordering the procedures of a program might reduce

instruction miss rates by reducing conflict misses. Use profiling information to determine likely conflicts between groups of instructions

Aim for better efficiency from long cache blocks: Align basic blocks so that the entry point is at the beginning of a cache block decreases the chance for a cache miss for sequential code

Improve the spatial and temporal locality of data

Loop Interchange

Some programs have nested loops that access data in memory in nonsequential order. Simply exchanging the nesting of the loops can make the code access the data in the order they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses by improving spatial locality: reordering maximizes use of data in a cache block before the data are discarded.

/* Before */for (j = 0; j < 100; j = j+1)

for (i = 0; i < 5000; i = i+1)x[i][j] = 2*x[i][j];

/* After */for (i = 0; i < 5000; i = i+1)

for (j = 0; j < 100; j = j+1)x[i][j] = 2*x[i][j];

Reducing Hit Time

Small and Simple Caches A time-consuming part of a cache hit is using the index portion

of the address to read tag memory and then compare it to the address

We already know that smaller hardware is faster It is critical to keep the cache small enough to fit on the same

chip as the processor to avoid the time penalty of going off chip

Keep the cache simple: Say use direct mapping; a main advantage is that we can overlap tag check with transmission of data

We use small and simple caches for level-1 caches For level-2 caches, some designs strike a compromise by

keeping tags on chip and data off chip, promising a fast tag check, yet providing the greater capacity of separate memory chips

Fig. 5.26. Summary of Cache OptimizationsTechnique Miss

penaltyMiss rate Hit time Hardware

complexity

Multilevel caches + 2

Critical word first and early restart + 2

Priority to read misses over write misses + 1

Merging write buffer + 1

Victim caches + + 2

Larger block size − + 0

Larger cache size + − 1

Higher associativity + − 1

Way prediction + 2

Compiler techniques + 0

Small and simple caches − + 0

Pipelined cache access + 1

Virtual Cache

The guideline of making the common case fast suggests that we use virtual addresses for the cache, since hits are much more common than misses

Such caches are termed virtual caches, with physical cache used to identify the traditional cache that uses physical addresses

It is important to distinguish two tasks: indexing the cache and comparing addresses

The issues are whether a virtual or physical address is used to index the cache and whether a virtual or physical address is used in the tag comparison

Full virtual addressing for both indices and tags eliminates address translation time from a cache hit

Why doesn’t everyone build virtually addressed caches?

Reasons against Virtual Caches First reason is protection. Page-level protection is checked as

part of the virtual to physical address translation. Second reason is that every time a process is switched, the

virtual addresses refer to different physical addresses, requiring the cache to be flushed. One solution is to increase the width of the cache address tag with a process-identifier tag (PID).

Third reason is that operating systems and user programs may use two different virtual addresses for the same physical address. These duplicate addresses could result in two copies of the same data in a virtual cache; if one is modified, the other will have the wrong value. With a physical cache, this wouldn’t happen, since accesses would first be translated to the same physical cache block.

Fourth reason is I/O. I/O typically uses physical addresses and thus would require mapping to virtual addresses to interact with a virtual cache.

One Good Choice

One way to get the best of both virtual and physical caches is to use part of the page offset (the part that is identical in both virtual and physical addresses) to index the cache

At the same time as the cache is being read using the index, the virtual part of the address is translated, and the tag match uses physical addresses

This strategy allows the cache read to begin immediately, and yet the tag comparison is still with physical addresses

The limitation of this virtually indexed, physically tagged alternative is that a direct-mapped cache can be no bigger than the page size

Example

In this figure, the index is 9 bits and the cache block offset is 6 bits

To use the trick on the previous slide, what should be the virtual page size?

The virtual page size would have to be at least 2(9+6) bytes or 32KB

What is the size of the cache? 64KB (=2×32KB)

How to Build a Large Cache

Associativity can keep the index in the physical part of the address and yet still support a large cache

Doubling associativity and doubling the cache size do not change the size of the index

Pentium III, with 8KB pages, avoids translation with its 16KB cache by using 2-way set associativity

IBM 3033 cache is 16-way set associative, even though studies show that there is little benefit to miss rates above 8-way associativity. This high associativity allows a 64KB cache to be addressed with a physical index, despite the handicap of 4KB pages.

csc 4250 computer architectures december 5, 2006 chapter 5. memory hierarchy

Documents