cache good

Upload: nishatfarhan

Post on 04-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Cache Good

    1/13

    CacheThomas Finley, May 2000

    Contents and Introduction Contents and Introduction

    Overview

    Direct Mapped Cache

    o Working in the Fields

    o Example

    o Practice Problems

    N-Way Set Associative Cache

    o Sizes of Fields for N-Way Set Associative

    o 2-Way Set Associative Cache

    o 3-Way Set Associative Cache

    o Set Associative vs. Direct Mapped

    Fully Associative Cache

    Unifying Theory

    Cache Design and Other Details

    o Line Size

    o Types of Misses

    o Writing to Memory

    o Sub-Blocks

    Cache Aware Programming

    The purpose of this document is to help people have a more complete understanding of what memory cache is and how it works. I discuss theimplementation and comparative advantages of direct mapped cache, N-way set associative cache, and fully-associative cache. Also included aredetails about cache design. I try to give a complete treatment of the more important aspects of cache.

    This is a supplement for the notes and text, not a replacement. Although this text can be read on its own without reference to the book or the notes,this is more text heavy, while the notes provide more valuable visual aides and a far more concise overview than I provide here. Also, there are manyaspects of cache described in the notes that I do not describe in this version of the text, like sub-blocks.

    The reason you should care is because a programmer can design software to take advantage of cache. If a programmer is aware of cache and itstrappings, it is a simple matter to arrange memory accesses to take advantage of cache and have faster run times. A great proprotion of computing isaccessing memory, and anything that can be done to speed memory accesses can only be a good thing. Because of this, I consider cache to be oneof the more important topics of computer hardware.

    Because cache is such a multifaceted subject, and because learning about cache requires the introduction of a lot of new ideas, this document isnecessarily long. As with all of these notes, I tried to make each section portable, so that if you only need to firm up your understanding on onesubject, you may skip to the relevant section.

    Overview

    The purpose of cache is to speed up memory accesses by storing recently used chunks of memory. Now, main memory (RAM) is nice and cheap. Notas cheap as disk memory, but cheap nevertheless. The disadvantage of this sort of memory is that it is a bit slow. An access from main memory into a

    http://www.tfinley.net/notes/cps104/cache.html#introhttp://www.tfinley.net/notes/cps104/cache.html#overviewhttp://www.tfinley.net/notes/cps104/cache.html#directhttp://www.tfinley.net/notes/cps104/cache.html#direct_fieldshttp://www.tfinley.net/notes/cps104/cache.html#direct_examplehttp://www.tfinley.net/notes/cps104/cache.html#direct_problemshttp://www.tfinley.net/notes/cps104/cache.html#assochttp://www.tfinley.net/notes/cps104/cache.html#assoc_fieldshttp://www.tfinley.net/notes/cps104/cache.html#assoc_example1http://www.tfinley.net/notes/cps104/cache.html#assoc_example2http://www.tfinley.net/notes/cps104/cache.html#assoc_vs_directhttp://www.tfinley.net/notes/cps104/cache.html#fullhttp://www.tfinley.net/notes/cps104/cache.html#unifyhttp://www.tfinley.net/notes/cps104/cache.html#detailshttp://www.tfinley.net/notes/cps104/cache.html#details_linehttp://www.tfinley.net/notes/cps104/cache.html#details_misshttp://www.tfinley.net/notes/cps104/cache.html#details_writehttp://www.tfinley.net/notes/cps104/cache.html#details_subblockshttp://www.tfinley.net/notes/cps104/cache.html#programminghttp://www.tfinley.net/notes/cps104/cache.html#introhttp://www.tfinley.net/notes/cps104/cache.html#overviewhttp://www.tfinley.net/notes/cps104/cache.html#directhttp://www.tfinley.net/notes/cps104/cache.html#direct_fieldshttp://www.tfinley.net/notes/cps104/cache.html#direct_examplehttp://www.tfinley.net/notes/cps104/cache.html#direct_problemshttp://www.tfinley.net/notes/cps104/cache.html#assochttp://www.tfinley.net/notes/cps104/cache.html#assoc_fieldshttp://www.tfinley.net/notes/cps104/cache.html#assoc_example1http://www.tfinley.net/notes/cps104/cache.html#assoc_example2http://www.tfinley.net/notes/cps104/cache.html#assoc_vs_directhttp://www.tfinley.net/notes/cps104/cache.html#fullhttp://www.tfinley.net/notes/cps104/cache.html#unifyhttp://www.tfinley.net/notes/cps104/cache.html#detailshttp://www.tfinley.net/notes/cps104/cache.html#details_linehttp://www.tfinley.net/notes/cps104/cache.html#details_misshttp://www.tfinley.net/notes/cps104/cache.html#details_writehttp://www.tfinley.net/notes/cps104/cache.html#details_subblockshttp://www.tfinley.net/notes/cps104/cache.html#programming
  • 7/31/2019 Cache Good

    2/13

    register may take 20-50 clock cycles. However, there are other sorts of memory that are very fast, that can be accessed within a single clock cycle.This fast memory is what we call cache.

    However, this fast memory is prohibitively expensive, so we can't compose our main memory entirely of it. However, we can have a small amount, say8K to 128K on average, in which we store some of our data. But what data do we choose? The way it happens, we choose data that has been mostrecently used. This is a policy that takes advantage of something called spatial and temporal locality.

    Spatial and temporal locality of memory accesses refers to the idea that if you access a certain chunk of memory, in the near future you're likely toaccess that memory, or memory near it, again. For example, you execute consecutive instructions in memory, so the next dozen of so instructions thatyou access are probably contained in a given contiguous block of data, so the next instruction would be accessed close by. Similarly, if you'reaccessing elements of an array, they will be located consecutively, so it makes sense to store a block of data that has recently been used. Separatevariables called in a function are usually close by as well. That is what cache does.

    When we want to access memory at a certain address, we look in the cache to see if it is there. If it is there, we get it from cache instead of going allthe way to memory; that situation is called a "cache hit." If the data is not in cache, then we bring the data from memory to cache, which takes longersince the data must be brought in from the smaller memory; that situation is called a "cache miss," and that delay that we endure because thatmemory must be brought from main memory is called a "miss penalty."

    Direct Mapped Cache

    Direct mapped cache works like this. Picture cache as an array with elements. These elements are called "cache blocks." Each cache block holds a"valid bit" that tells us if anything is contained in the line of this cache block or if the cache block has not yet had any memory put into it. There is alsoa block of data called a "line" which stores a recently used block of data. Perhaps most importantly, there is a number called a "tag" composed ofmore significant bits of the memory address from which the line's data came from.

    more cache blocks up here...

    V T L N

    N+1

    N+2

    N+3

    N+4

    more cache blocks down here...

    You can think about the direct mappedcache this way. Each row in the table tothe left represents a cache block. Wehave ourvalid bit which tells us if thiscache block currently holds data. Nextis ourtag, a number that tell us wherein memory this bit is from. After that, wehave ourline, which is the data that wehave stored in cache.

    The number to the right is just the cacheindex. Cache blocks are indexed likeelements in an array, and there is acache index associated with each cacheblock. As you see, I show blocks Nthrough N+4, where N is some integersuch that these are valid cache blocksfor this particular cache.

    As a nota bene, you must know that for dirct mapped cache, cache sizes and line sizes (and thus the number of cache blocks) are made in powers oftwo. It works out better for the hardware design if this is the case. Therefore, we're not going to see direct mapped cache's with 15 kilobytes, or 700bytes, or what have you. We're working only in powers of two for direct mapped cache.

    For a direct mapped cache, suppose we have a cache size of 2M bytes with a line size of 2L bytes. That means that there are 2M-L cache blocks. Now,we use the address of the data we're trying to access to determine which of these cache blocks we should look at to determine if the data we want isalready in memory.

    Working in the Fields

    32-M bits M-L bits L bits

  • 7/31/2019 Cache Good

    3/13

    Tag IndexByteSelect

    We use the address in this way. Certain bits in our 32 bit memory address have special significance. We group the bits into three distinct groups wecall fields. In this discussion, M and L are defined just as they were a paragraph ago.

    The rightmost L bits of the address is the byte select field. If data is found in a cache block, since addresses are byte addressed, we use this field todetermine which of the bytes in a line we're trying to access.

    The M-L bits just to the left of the byte select field is called the index field. As we said before, we have 2 M-L cache blocks in our cache. These cacheblocks are ordered, just as in an array. There is a first one (with index 0) and a last one (with index M-L-1). If we're given an address, the valuerepresented by the bits of the index field tell us which of these cache blocks we should examine.

    The leftmost 32-M bits of the address, just to the left of the index field, is called the tag field. This field is put into the tag part of the cache block, andthat uniquely identifies from where in main memory the data in the line of this cache block came from. If we're looking at a cache block, wepresumably know what index of this cache is, and therefore in conjunction with the cache tag we can compose the addresses from which the data inthe line came from. When we store a line of data into a cache block, we store the value represented by the tag field into the tag part of the cacheblock.

    As a side note, it is not written in stone that the index field must be composed of the bits immediately to the left of the byte select field. However, theway that memory accesses tend to work, it happens that if the index field is in the bits immediately to the left of the byte select field, when we look fordata, it is in cache rather than in main memory a larger proportion of the time. While there might be some special cases in which having the index fieldelsewhere would yield a higher hit ratio, in general it is best to have the index field immediately to the left of the byte select field.

    When we're checking cache to determine if the memory at the address we're looking for is actually in cache, we take the address, and extract theindex field. We find the corresponding cache block. We then check to see if this cache block is valid. If it's not valid, then nothing is stored in thiscache block, so we obviously must reference main memory. However, if the valid bit is set, then we compare the tag in the cache block to the tag fieldof our address. If the tags are the same, then we know that data we're looking for and the data in the cache block are the same. Therefore, we mayaccess this data from cache. At this point we take the value in the byte select field of the address. Suppose this value is N. We are then looking forbyte N in this line (hence the name byte select).

    Example

    However, these things are always more clear when we talk about a specific example. Let's talk about my first computer, a Macintosh IIcx. Suppose wehave a direct mapped cache whose cache lines are 32 bytes (25 bytes) in length, and suppose that the cache size is 512 bytes (29 bytes). Since thereare 512 bytes in the cache, and 32 bytes per cache block, that means that there are 512/32 = 16 cache blocks.

    In reality, a 512 byte cache is a very small cache; you'd never find a cache that small in today's computers. However, it makes the example thatfollows managable.

    Now, suppose we have addresses that are thirty-two bits in length. In this example, taking M and L to have the same definitions as they did at thebeginning of our discussion on direct mapped cache, M = 9 and L = 5. So, our byte select f ield is the rightmost 5 bits. Our index field is the next M-L = 4 bits. Our tag field is the leftmost 32-M = 23 bits.

    This actually makes sense. Since we have 16 cache blocks, we can uniquely represent the 16 possibilities of cache blocks we may address by 4 bits,which is the size of our index field. Similarly, we have 32 bytes that we can access in the data potion of our cache block since line size is 32 bytes,and we may uniquely address these 32 bytes with 5 bits, and indeed that is the size of our byte select field.

    To illustrate the working of this, I'll go step by step through a series of memory accesses. In my table represenation of cache I omitted the cache lineto save space, because it really doesn't matter what the line holds. The data in cache is the data in cache, and that's all there is to it. We're interestedin what memory is accessed where, because that is what affects the workings of the cache. That changes the state of the valid bits and the tagnumbers of our cache blocks.

    For this example, I provide a series of tables, which represents the cache blocks, and their corresponding valid bits and tags. The first number in arow in the table is the index of this particular cache block. The next number is the valid bit. The rightmost number is the tag, which tells us where thedata in the line (which I do not show) is from.

    Initially

    The table for the initial conditions shows that the valid bits are all set to zero, because nothing has yet been stored in cache.

    Step 1

    For step one, suppose we access the memory at address 0x0023AF7C. The looking at that in binary, that is 000000000010001110101111011111002.If we separate these bits into the lengths of the fields we have determined with dashes between each field, that is more easily read as00000000001000111010111-1011-11100. So, our index is 10112=1110. We look at index 11. Is there anything in there yet? We see there is not.Therefore, we load the data contained at memory addresses 0x0023AF60 through 0x0023AF7F into the 32 byte line of the cache block with index 11.

  • 7/31/2019 Cache Good

    4/13

    Note that the memory addresses between those two addresses are the memory addresses that would have the same tag and index fields if we brokethe addresses into their respective fields.

    Now that the data is loaded into the line, we set the tag for this cache block to be the same as the tag of our address, since this cache block will nowhold a line of data from the addresses with the same index and tag. This tag is 10001110101112=456710. We also must set our valid bit to true sincethis cache block now holds a block of data. These changes are reflected in the table for step 1.

    After these changes and the memory loads, we know that the data we want is in cache block 11. Our byte select field is 111002 = 2810, so we get the29th byte of our line.

    Since 0 would address the first byte, 28 addresses the 29th byte. It's exactly as though the cache line is an array of chars (a byte length quantity), andthe byte select is the index of the particular element of that array that we're looking at right now.

    Now, this is a cache miss, but there are different sorts of cache misses. Notice that no data has been loaded into cache yet. That is because nomemory accesses have taken place. Because we're just starting out, this is called a "compulsory miss." We cannot help but not have data in thiscache block yet, because we just started.

    Step 2

    After step 1, suppose we have another memory access, this time at memory address 0x0023AE4F. Ignoring the intervening binary operations, thisworks out to have a byte select field of 15, an index field of 2, and a tag field of 4567.

    If we look at cache block 2, we see that it is not valid. Similar to last time, we load the 32 bytes of memory from addresses 0x0023AE40 to0x0023AE5F into the line of this cache block, change the tag to 4567, and set the valid bit to true.

    The tag field is the same as in the previous operation; this was intentional on my part. As we see, the fact that the tag is the same here as it was lasttime is of no significance, since the two are part of completely different indexed cache blocks; they are unrelated insofar as cache is concerned.

    Similar to last time, the data we want is in the cache line. Our byte select field is 15, so we access the 16 th byte (since 0 would address the first byte)of the line in cache block 2.

    Even though we've already had a memory access in step two, this miss is also called a compulsory miss because the cache block that we're loadinginto hasn't yet had any data loaded into it.

    Step 3

    Suppose now we have a memory access at address 0x148FE054 (in binary, 00010100100011111110000 0010 10100). This works out to have a byteselect field of 101002=2010, index field of 00102=210, and a tag field of 000101001000111111100002=67377610.

    We look at cache block 2. We see that it is valid. We then compare the tags. 673776 is not equal to 4567, which is the tag of cache block 2. This

    means that, even though there is data in this cache block, the data we want is not in here. We must load the data, and we do so just as if this blockhad not been valid.

    We load the data at addresses 0x147FE040 to 0x147FE05F into the line at cache block 2. We set the tag to 673776. We don't need to set the valid bitto one since it already is one. Now this cache block holds a new portion of data.

    As you see, we had two pieces of data that wanted to be in the same cache block: there was the data we loaded in the previous step, and the data wejust loaded in this step. Now, this is not a conflict miss. What would be a conflict miss is if we tried to load the data that used to be in the cache (thedata that we just replaced) back into cache. When this sort of miss occurs, that is called a "conflict miss," or rather a "collision."

    We also see one of the primary weaknesses of the direct mapped cache. At the end of step two we had two blocks used up, which means that we'reusing 64 bytes of our 512 byte cache. However, in this step, step three, because of a conflict miss we overwrote one of those blocks even though wehad 448 bytes unused in our cache. Though this example is completely contrived, this sort of cache inefficiency is a common problem with directmapped cache.

    Once we load the data into this cache and set the tag field appropriately, we can find the byte that we want using the byte select field, at index 2 (the3rd byte).

    Step 4

    Now suppose we try to access memory address 0x0023AF64. This works out to have byte select field of 001002=410, index field of 10112=1110, and tagfield of 10001110101112=456710.

    We look at cache block 11, and it is valid, so we compare the tags. The tags are the same, it turns out, so we finally have a cache hit! The data thatwe want is in this cache block. Since our byte select field is 4, we access the fif th byte of this block's cache line. We don't modify cache at all; tags andvalid bits remain the same, and we needn't pull any data from main main memory since the memory we want has already been loaded into cache.

  • 7/31/2019 Cache Good

    5/13

    This block was loaded with data in step one. The memory access for step one was 0x0023AF7C. Now, this step's memory access, 0x0023AF64, isobviously a different address than that, but 0x0023AF64 and 0x0023AF7C are pretty close. Since in the memory access for 0x0023AF7C we loadedthat cache block's line with memory from 0x0023AF60 to 0x0023AF7F (32 bytes in all), the data for 0x0023AF64 could be found there. So, twomemory accesses may access the same data in cache even though they are not exactly the same.

    This set of tables trace the 16 cache blocks of the cache example through eachof the memory accesses described above. Each "step" table reflects the state of

    cache at the end of each step.

    Initially

    0 0 0

    1 0 0

    2 0 0

    3 0 0

    4 0 0

    5 0 0

    6 0 0

    7 0 0

    8 0 0

    9 0 0

    10 0 0

    11 0 0

    12 0 0

    13 0 014 0 0

    15 0 0

    Step 1

    0 0 0

    1 0 0

    2 0 0

    3 0 0

    4 0 0

    5 0 0

    6 0 0

    7 0 0

    8 0 0

    9 0 0

    10 0 0

    11 1 4567

    12 0 0

    13 0 014 0 0

    15 0 0

    Step 2

    0 0 0

    1 0 0

    2 1 4567

    3 0 0

    4 0 0

    5 0 0

    6 0 0

    7 0 0

    8 0 0

    9 0 0

    10 0 0

    11 1 4567

    12 0 0

    13 0 014 0 0

    15 0 0

    Step 3

    0 0 0

    1 0 0

    2 1 673776

    3 0 0

    4 0 0

    5 0 0

    6 0 0

    7 0 0

    8 0 0

    9 0 0

    10 0 0

    11 1 4567

    12 0 0

    13 0 014 0 0

    15 0 0

    Step 4

    0 0 0

    1 0 0

    2 1 673776

    3 0 0

    4 0 0

    5 0 0

    6 0 0

    7 0 0

    8 0 0

    9 0 0

    10 0 0

    11 1 4567

    12 0 0

    13 0 014 0 0

    15 0 0

    Wow. That just kept going, didn't it?

    Practice Problems

    These are just a couple of questions that demonstrate the most basic concepts of direct mapped cache. The answers are provided after all thequestions. These aren't terribly difficult, but they should get you thinking about cache. All of these questions pertain to direct mapped cache, in acomputer with 32-bit addressing.

    Questions

    1. If we have a cache size of 4KB with a 16 byte line size, what will be the sizes of the three fields (tag, line, and byte select) which wedivide our address into?

    2. Suppose our fields are arranged as described above (first tag field, then index field, then byte select field) in a cache of 16KB with 32 byteline sizes. What would be the values of each of the three fields for the following addresses?

    a. 0x00B248AC

    b. 0x5002AEF3

    c. 0x10203000

    d. 0x0023AF7C

  • 7/31/2019 Cache Good

    6/13

    Answers

    1. Our cache size is 4KB = 212 bytes, with a line size of 16 bytes = 24 bytes. Therefore, our byte select field is 4 bits, our index field is 12-4 =8 bits, which leaves 20 bits for our tag field.

    2. This is actually two problems in one. We must first determine the sizes of each of the fields. Our cache size is 16KB = 2 14 bytes, with aline size of 32 bytes = 25 bytes. So, our byte select field is 5 bits, our index field is 14-5 = 9 bits, which leaves 18 bits for our tag field.

    a. For 0x00B248AC, tag is 0x2C9, index is 0x45, and byte select is 0xC.

    b. For 0x5002AEF3, tag is 0x1400A, index is 0x177, and byte select is 0x13.

    c. For 0x10203000, tag is 0x4080, index is 0x180, and byte select is 0x0.

    d. For 0x0023AF7C, tag is 0x8E, index is 0x17B, and byte select is 0x1C.

    N-Way Set Associative Cache

    Another sort of cache is the N-way set associative cache. This sort of cache is similar to direct mapped cache, in that we use the address to map theaddress to a certain cache block. The important difference is that instead of mapping to a single cache block, an address will map to several cacheblocks. For example, in a 2-way set associative cache, it will map to two cache blocks. In a 5-way set associative cache, it will map to five cacheblocks.

    In this cache there may be several cache blocks per index. This group of cache blocks is referred to collectively as an "index set." In our directmapped cache, there was one cache block per index set. In an N-way set associative cache, there are N cache blocks per index set.

    So, we have an address, and like in direct mapped cache each address maps to a certain index set. Once the index set is found, the tags of all thecache blocks in this index set are checked in parallel. If one of those cache blocks holds the data that we're looking for, then the data for that oneblock is extracted.

    If one of those blocks does not hold the data we want, then like before we have a cache miss, and the data must be loaded. If one of the cache blocksin this index sets is free, then we can simply load it in there. If, however, all of the blocks are already taken, then one of those blocks must beoverridden. We choose to free up the one that has been least recently used, and write the new data in that block. This is a policy called LeastRecently Used (LRU) replacement.

    There are any number of ways (some better than others) in implementing a system that traces the order in which cache blocks of an index set havebeen used, so that we may appropriately choose which cache block to replace in if we must replace a block. It is not very important to know exactlyhow this is done, but you should think about ways of implementating this feature.

    Sizes of Fields for N-Way Set Associative

    Like direct mapped cache, when memory accesses occur the address is broken into three fields: tag, index, and byte select. The sizes of fields for N-Way Set Associative are very similar to those of direct mapped cache.

    For an N-way set associative cache, suppose we have a cache size of N2M bytes with a line size of 2L bytes. That means that there are N2 M-L cacheblocks, and at N cache blocks per index set, there are 2M-L index sets.

    For example, suppose we have a three-way set associative cache of size 12KB, with line size of 16 bytes. Our line size may be expressed as24 bytes, so L=4. Since N=3, and 12KB = 34KB = 3212 bytes, M=12. So, in this example, our byte select field would be 4 bits long, our index fieldwould be 12-4=8 bits long, and our tag field would be 32-12=20 bits long.

    Example : 2-Way Set Associative Cache

    Suppose we have a series of memory accesses. For the sake of the example, suppose that the memory accesses all map to the same index set, butthat the addresses' tags differ. Which index is unimportant, since they're all to the same index set. What is important is the tag of each memoryaccess.

    For each step I provide a table that shows the tags of the single index set that we're considering. The cell on the right is that s tructure which tells usthe order in which things were last accessed. I have it initially set so that the system will think that the first block was the least recently used block, sothe first block (leftmost) will be the one first used. The table is shown at the end of a step to show what the index set looks like at that point.

    Block 1 Block 2 Second, thenfirst.No tag No tag

  • 7/31/2019 Cache Good

    7/13

    Now, suppose we have a series of these memory accesses to the same index set as described, with the following tags for each memory access:0x00010, 0x8A02F, 0x00010, 0xF1EEE.

    For the first access (tag 0x00010), we're assuming that this index set is empty. So, we load the data with tag 0x00010 into the first of our two cacheblocks.

    Block 1 Block 2 First, then

    second.0x00010 No tag

    It is important to remember that in N-way set associative caches there is a structure for each index set that tells us the order in which the index set'scache blocks were last accessed. For our two-way cache, the possible states for the order of access are "first, then second" or "second, then first."For a three-way cache, the possible states are "first, then second, then third," "first, then third, then second," etc. (There are six states in all for athree-way cache.) This tells our hardware just which block to replace if we must replace a block, so that our more recently used blocks are notdisposed of before their time.

    For the second access (tag 0x8A02F), we compare the tags. 0x00010 does not equal 0x8A02F, but we have the second cache block unused (that is,invalid), so we stick the data with tag 0x8A02F into the second of the two cache blocks. At this point we have another structure in this index set thatlets the hardware know that the second cache block has been used most recently.

    Block 1 Block 2 Second, thenfirst.

    0x00010 0x8A02F

    For the third access (tag 0x00010), we compare the tags. The tags for the first of our two cache blocks match. Our tags don't change and we don'tload any more data, but we change our structure that describes the order in which the blocks were last accessed to reflect that the first block wasmore recently accessed.

    Block 1 Block 2 First, then

    second.0x00010 0x8A02F

    For the fourth and last access (tag 0xF1EEE), we compare the tags. There is no match, and there are no blocks free for this index set. So, whathappens? We replace the block that has been least recently used. We have stored the information that, for this index set, the first cache block wasmore recently used. Therefore, we replace the second block. We keep the data with tag 0x00010, and discard the data with tag 0x8A02F and load thedata with tag 0xF1EEE in its place into the second block. Now we change the order of access to reflect that the second block was more recentlyaccessed than the first block.

    Block 1 Block 2 Second, thenfirst.0x00010 0xF1EEE

    Example : 3-Way Set Associative Cache

    I'm not going to go tremendously in depth here. I'm just going to show a 3-way cache, with each block's tag shown (or the text "no tag" if the block isinvalid). After each index set, there will be a cell that tells us the order of "last accessed" for this index set.

    Notice that we have this "order of access" field for each index set. Also notice that it doesn't suffice to say "the third was the block least recently used,"because if we were to access that third block, then we have no way of telling which block was the one least recently used.

    While we can have up to three memory locations that map to the same index in cache at the same time, we can't have four. We therefore run intoproblems similar to those discussed in direct mapped cache (conflict cache misses), even though we can store more cache blocks per index set. Theproblem of conflict misses has been alleviated, but not solved.

    A simplified view of a 3-way setassociative cache with 16 index sets.

    Index Block 1 Block 2 Block 3

    0 0x49206 0xC6F76 0x65204 Third,

  • 7/31/2019 Cache Good

    8/13

    thenfirst,thensecond.

    1 0x36172 0x6F6C2 No tag

    First,thensecond,thenthird.

    2 0x04272 0x61756 0xE2C20

    Second,thenfirst,thenthird.

    3 No tag No tag No tag

    Third,

    thensecond,thenfirst.

    4 0x62757 0x42073 0x68652

    First,thensecond,thenthird.

    5 0x0646F 0x65732 0x06E6F

    Second,

    thenthird,thenfirst.

    6 0x74206 0xC6F76 No tag

    Second,thenfirst,thenthird.

    7 0x65206 No tag No tag

    First,

    thensecond,thenthird.

    8 0xD652E 0x0B3F9 0x636FC Third,thenfirst,then

  • 7/31/2019 Cache Good

    9/13

    second.

    9 0x4D67A 0x7AB7E No tag

    Second,thenfirst,thenthird.

    10 0x78A7F No tag No tag

    First,thensecond,thenthird.

    11 0x22448 0x37CAD 0x3DA1E

    First,thenthird,then

    second.

    12 0x33C1A 0x14E66 0x5D56C

    Second,thenthird,thenfirst.

    13 0x3414D 0x2585F 0x202E1

    First,thenthird,then

    second.

    14 0x295D0 0x0BC9F No tag

    Second,thenfirst,thenthird.

    15 0x632E0 0x255C9 0x82823

    Third,thensecond,then

    first.

    Set Associative vs. Direct Mapped

    When we're dealing with an N-way set associative cache, when we're looking at a particular index set we have the added burden of comapringmultiple tags to the tag of our address to see if there is a match. This is done in parallel, so we must have as many comparators as there are blocks inan index set to check the tags of the blocks to the address tag. We also have to deal with the access histories of each index set, another thing wedidn't have to do with direct mapped cache. So, N-way set associative cache is considerably more difficult to design and to produce, and is thereforemore expensive. For the same money, an N-way set associative cache would be smaller than a direct mapped cache, though the advantages of setassociative with regard to conflict misses and hit ratios often justifies the trade.

  • 7/31/2019 Cache Good

    10/13

    Fully Associative Cache

    Fully associative cache is perhaps the easiest cache to understand and explain because it acts intuitively as one would expect a cache to work. Weload cache blocks similar to before, and when we fill up the cache and have to replace a cache block, we dispose of the block that has been leastrecently accessed. It's exactly as though we take the N-way set associative cache, and set N such that there was only one index set. So, we're able toignore all the garbage about the index field. We only have a tag field, and a byte select field.

    This sounds good as an idea, since fully associative cache acts exactly as a cache "should"; we don't run into the problem where a more recentlyused block is discarded and a less recently used block is kept simply because the index set where that recently used block happened to be storedwas more busy. Clearly fully associative cache has its attractive aspect.

    However, from a hardware designer's point of view fully associative cache is extraordinarily complex. First of all, when you are checking the cache tosee if a tag is there, you must check all the tags of the cache in parallel. That means that there must be as many comparators as there are cacheblocks in the entire cache. In direct mapped cache we only needed one comparator, and in N-way set associative cache we needed N comparators tocheck the tags of cache blocks in parallel.

    Also, we must deal with the fact that we must trace the order in which blocks were accessed. When we were dealing with two or three blocks perindex set, this wasn't a huge problem. However, as an example, if we have to trace the access history all the blocks of the entire cache, supposing wehad C cache blocks to keep track of, then we have C! possible states for access histories. As a concrete example, if we have something as small as a1K cache with 32 byte line sizes, that's (1024/32)! = 32! = 2.63131036 combinations for access histories. For our CPS120 graduates, those states maybe represented in 118 bits (possibly less after optimizations, but don't expect me to figure that out). The point is, no matter how we implement thiscache, the circuitry for handling these combinations of access histories (especially considering that the circuitry would have to do its job in at most oneor two clock cycles) would be extraordinary for larger sized caches.

    Because there is so much hardware involved, fully associative caches are necessarily very expensive. When fully associative caches are even usedits cache size is necessarily tiny compared to the direct mapped and N-way set associative caches. However, for very small caches with very fewblocks fully associative is often a logical choice.

    Unifying Theory

    If this all seems too complex, the important thing to realize about all these cache types is that they are all really the same cache type: N-way setassociative.

    The direct mapped cache is actually the special case of one-way set associative. For each "index set" there is only one cache block. You may noticethat there is no structure in direct mapped cache that keeps track of the access history of the blocks of each index set, but that is because there isonly one possibility; that the one block we have was accessed least recently (or most recently depending on how you like to look at it).

    Fully associative cache is equivalent to the special case of N-way set associative cache where N is chosen such that N equals the total number ofblocks in the cache, i.e., such that there is only one index set. It is true that N-way set associative cache has an index field and fully associative cachedoes not, but in this case the index field works out to have a length of 0 bits, i.e. there is no index field. This is consistent with our discussion of fully

    associative cache.

    Cache Design and Other Details

    Line Size

    In our examples, we often discuss line sizes (alternatively called block sizes) of 32 bytes, or 16 bytes, or what have you. But there is no real reasonwhy you can't have line sizes of 8 bytes, or maybe 128 bytes, or 512 bytes.

    There is some appeal to having a larger line size, especially when dealing with many memory accesses with sequential addresses. Suppose, forexample, you're printing a string of many characters, or accessing elements of a large array of integers for computation, or executing a large numberof sequential instructions.

    If we're accessing integers in an array sequentially for computation, what we're really doing is accessing four byte quantities at a time. If our line sizeis 32 bytes, when we first load an integer we have a cache miss, and that line is loaded into memory. The next seven accesses to the array willaccess that line as the integers are read sequentially, so those will be cache hits. Then an integer is read that's just beyond the line we just loaded, sowe must load the next line. This is a cache miss, but again the next seven will be hits since the integers will be in that line we just loaded. This trendcontinues. One miss, seven hits. One miss, seven hits. Et cetera, et cetera. All in all, we have a miss rate of one eighth. However, if our line size is 64bytes, we'll have one miss, fifteen hits, one miss, fifteen hits, a miss rate of one sixteenth. Larger line sizes takes advantage of spatial locality.

    Larger line sizes aren't all good, however. The miss penalty, or the time we are delayed when we have a cache miss, is increased since when webring the data into the cache we must copy more data, which takes more time.

    The other thing is that as we increase our line size, the number of our blocks will decrease. While this needn't be extremely bad, examine the extremecase where line size equals the cache size, so there is only one cache block in our entire cache. Suppose that our cache size is 8K, so our block sizeis 8K. So long as we access memory in the same contiguous 8K of space, we'll be great, but if the program ever strays from that memory space we'll

  • 7/31/2019 Cache Good

    11/13

    necessarily have a miss. If we have a program that jumps around between two distinct memory locations, accessing one area, then accessing theother, back and forth, we'll necessarily have a miss every time, whereas if we had two cache blocks, after the data is initially loaded we'll have a hitevery time.

    The point is that line sizes are best taken in moderation.

    Types of Misses

    A cache miss occurs when the data we're looking for is not in cache, and so we must load it from main memory. In the example for direct mappedcache, we alluded to the fact that there are different types of cache misses. We mentioned two, but there are four, and how a system designer goesabout minimizing misses depends greatly on the type of misses involved.

    In this explanation I provide an analogous situation. Suppose we're writing a paper in Perkins. We can take books (data) from stacks (main memory)and put them on our desk for easy reference (cache).

    Compulsory Miss

    A compulsory miss occurs at the first access of that memory location, so memory location has not yet been loaded into cache. These are also calledcold start misses, or first reference misses. There isn't much that can be done about this problem. However, since a compulsory miss only occurswhen data is first loaded, in the long run with a program with many memory accesses, compulsory misses are not a problem.

    This situation is analogous to when you're just starting your research in Perkins. You desk is empty, so you must first get some of the books that youinitially want. You can't help this, since the books won't be at the desk waiting for you when you arrive.

    Conflict Miss (Collision)

    A conflict miss occurs when there are many memory accesses mapped to the same index set in a cache, so data in a block of in this index set may bediscarded even though another block in another relatively unused index set might be even older. If this said block is discarded and later retrieved, thatis called a conflict miss.

    This doesn't have a clean analogy to my Perkin's library situation, but suppose that there is a policy in the library that you may only have a certainnumber of books of the same color at any one time. For example, you may have a maximum of three blue books, and three red books, and threeyellow books, and three fuchsia books. You may have space on the desk for more, but since you have three fuchsia books, you'll have to put a bookback into stacks before you get another.

    In this analogy, the index sets are like the different colors, and the amount of books is like the set associativity. In this example we had a maximum ofthree books of a certain color at one time, so this was like a 3-way set associative cache.

    We can alleviate this problem if we increase the number of books of a certain color we can have at once. That is, if we increase the set associativity.

    We can also increase the size of the cache. All other things being equal, this has the effect of increasing the amount of index sets in the cache. In ouranalogy that would be the same as not only enlarging the des, but also refining the spectrum of colors. For example, at first we may only considerthree colors: red, yellow, and blue. However, we may include other colors, and instead our spectrum may be red, orange, brown, yellow, aquamarine,and blue. Some colors that were considered red may now be considered orange, some colors that used to be considered yellow we may now callbrown, et cetera. We increase the number of books we may have on our desk in this sneaky way. Looking outside of the analogy for a second, if wedouble the cache size, our index field grows by one bit. Instead of an index of 10010, for example, we could have indexes of 010010 OR 110010; our"spectrum" of indexes has increased.

    As you may see, a fully associative cache would not have this problem because we ignore the nonsense of index sets entirely. We only have one"index set" (if you can call it that) and you can store as many cache blocks in that index set as there is space in the cache.

    Capacity Miss

    A capacity miss occurs when a memory location is accessed once, but later because the cache fills up, that data is discarded, and then when we get amiss when accessing that memory location again because that data is no longer in memory.

    The analogy for this is simple. Your desk f ills up, so you must send back a book to make room for new books on the desk, even though you may needthe book you're sending back later.

    The solution to this is to increase the cache size; that is, get a bigger desk.

    Invalidation Miss

    What happens if main memory whose data is also loaded in cache is modified? In that case the cache data has been invalidated, and if the computerlooks for it again it will have to be reloaded from main memory to get the up-to-date information. You might say, "that could never happen; all memoryaccesses are piped through the memory system, so cache would catch it." That's not always the case. I/O devices often write directly to memory(especially in older machines), so it is possible that we may have to reload data from main memory even though it may be in cache. This is called aninvalidation miss.

  • 7/31/2019 Cache Good

    12/13

    The analogy for this is that a new edition of a book you have on your desk comes out, and it has new or corrected information. Regardless of howrecently you brought this now out-of-date book to your desk, you must get the new book rather than refer to dated material.

    Writing to Memory

    I alluded to the potential problems of memory writing in my brief discussion of the invalidation miss, but for the most part throughout this text I havebegged the issue of what happens to cache in the event of a memory write.

    Memory writes are difficult to implement because something in memory changes when you do a memory write, and it is difficult to keep cache andmain memory consistent with whatever change was made by the program. Memory reads, on the other hand, are no problem. If all a program does isread memory, memory cannot change, so there is no need to make sure that cache and memory are consistent with each other.

    Before we go about making cache and memory consistent, however, the primary question is where our data will go; will writes first go in cache, inwhich case we'd have to bring main memory up to date later (called a "write back" policy), or write directly to main memory (called a "write through"policy)? The answer is that we can choose from either of these policies.

    Write Back

    Write back is the policy of writing changes to cache. This is simple enough; our cache blocks stay valid since they have the latest data. The problemwith that is that if modified blocks ever were overwritten in the cache, we'd have to write the now discarded data back to main memory or changes willbe lost.

    One inefficient solution would be to simply write all discarded data back to main memory, but this is undesirable; if every cache miss involved not onlya read from main memory but in addition a write to main memory, there would be a dramatic increase in the miss penalty. A more efficient solution isto assign a "dirty bit" to each cache block. In the course of a program's execution if a block of data is modified we set the dirty bit, and write the databack to main memory if the dirty bit for the block we're discarding has been set.

    The advantage of a write back policy is that instead of writing to slower memory, we're writing to cache. Just as we're more likely to read from thesame location in memory in a given chunk of time, the same can be said for memory writes. Instead of writing to memory every time we have amemory write, we only write back once we're done with that block of data.

    One disadvantage of this is that its control is somewhat complicated compared to write through.

    Write Through

    Write through is the policy of writing directly to cache, but also writing to memory. The problem is that writing data back to RAM slows things down.The solution is to have something called a write buffer. We stick our changed data in the write buffer, and whenever we have a DRAM write cycle thewrite buffer is emptied and the data is changed.

    A problem arises here. A write buffer exists so a computer can continue to execute instructions faster than the time it takes to write new data through

    to main memory. It is therefore conceivable that a whole bunch of write instructions could fill the buffer faster than the buffer's capacity to get rid ofdata already in the buffer.

    This is not an insurmountable problem. We can install a level two cache to hold writes. This also has the effect of speeding up the process of writingback to memory, since this L2 cache will be faster than DRAM. By appropriately regulating the flow of data between DRAM and the L2 cache, writeoverflow may be overcome.

    Sub-blocks

    This revision does not cover sub-blocks. Sorry. If it's any comfort, I don't even remember them being mentioned in lecture, and we never did anythingwith them. I'm writing this on vacation, so I'm working from memory; I can't remember much about them aside from the fact that sub-blocks reduce themiss penalty.

    Cache Aware Programming

    In the previous section we talked about ways of reducing misses by changing the cache size, set associativity, etc. The problem is that we asprogrammers don't deal with the machine at that level. We already have a machine, and the cache of that machine is the cache of that machine, andwe must deal with that. The only way we are to minimize these misses would be to modify our program to take advantage of cache. If we have twodifferent programs that do the same thing, it is likely that one program may use cache more efficiently than the other. There are plenty of examples ofequivalent cache efficient and cache inefficient programs in the notes. Depending on the sort of program you're working on, it may also be helpful tounderstand the basics of the memory system of the particular computer your program will run on.

    Another source of inefficiency not mentioned (at least in my version of the notes) arises from one of the few weaknesses of object orientedprogramming. When you're working with objects, you call a lot of class object functions. The result is that we're accessing instructions (stored inmemory) all over the place. This is cache inefficient, so if you have a C program and a C++ program that do the same thing, the C program will tend torun a bit faster, especially if the C++ program takes advantage of its object oriented nature.

  • 7/31/2019 Cache Good

    13/13

    Thomas Finley 2000