cs1104 help session i memory semester ii 2001/02

CS1104 Help Session IMemory

Semester II 2001/02Colin Tan,

S15-04-05,

[email protected]

Memory

• Memory can be visualized as a stack of pigeon holes. Current computers have about 128,000,000 pigeon holes.

• Each pigeon hole is given a number, starting from 0. This number is called an “address”.

• Each pigeon hole will contain either data (e.g. numbers you want to add together) or instruction (e.g. add two numbers)

Memory

• Memory locations 0 to 3 contain instructions, locations 4 to 6 contain data.

• Note: In reality, instructions are also encoded into numbers!

Addresses

• As mentioned, each pigeon hole has a number identifying it called an “address”.

• When the CPU requires an instruction, it will send the instruction’s “address” to memory, and the memory will return the instruction at that address.– E.g. At IF CPU will send “0” to memory, and the

memory returns li t1, 5– At MEM CPU will send “6” to memory, and memory

returns “10”.– At WB, CPU writes “10” back to t1.

Addressing Bits

• Computers work only in binary– Hence addresses generated in the previous example are

also in binary!

• In general, to address a maximum of n memory locations, you will need m = log2 n bits in your address.

• Conversely, if you had m bits in your address, you can access a maximum of 2m memory locations.

Memory Hierarchy

• Motivation– Not all memory is created equal

• Cheap Memory => Slow

• Fast Memory => Expensive– DRAM, 70 ns access time, $1/MByte

– SRAM, 8 ns access time, $50/Mbyte

– So, you can choose either:• Have fast but very small memory, OR

• Large but very slow memory.

Memory Hierarchy

• Memory hierarchy gives you a third option:– Large, but very fast memory

• Though slower than the expensive memory mentioned earlier.

Locality

• “Locality” is a particular type of behavior exhibited by running programs:– Spatial locality: If a memory location has been accessed, it is very

likely its neighbor will also be accessed.

– Temporal locality: If a memory location has been accessed, it is very likely that it will be accessed again sometime soon.

Locality - Example

• Consider the following program:for(i=0; i<10; i++)

a[i] = b[i] + c[i];

Locality - Example

• In memory it will look like this:

Locality - Example• Tracing the execution of the program:

Locality - Example

• Focusing only on the addresses of the fetched instructions, we see that the addresses the instructions are fetched from are:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, …

• Here we see both:– Spatial locality (e.g. after location 0 is accessed, location 1 is

accessed, then 2, etc.)

– Temporal locality (e.g. location 2 is accessed 10 times!)

Effect of Locality

• Locality means that in the short run out of all the memory you have (perhaps up to 128,000,000 pigeon holes!), only a very small number of locations are actually being accessed!– In our example for ten iterations only memory locations 2 to 10

are being accessed out of 128,000,000 possible locations!

– What if we had a tiny amount of very fast (but expensive!) memory and kept these locations in that fast memory?

• We can speed up access times dramatically!!

– This is the idea behind caches.

How Do Caches Help?• The average time to access memory (AMAT) is

given by:AMAT = hit_rate * Tcache + miss_rate x (Tmemory +

Tcache)

Tcache = Time to read the cache (8ns for SRAM cache)

Tmemory = Time to read main memory (70ns for DRAM)

miss_rate = Probability of not finding what we want in the cache.

• Because of locality, miss_rate is very small– Typically about 3% to 5%.

Here, our AMAT = 0.95 * 8ns + 0.05 x (70 + 8) ns = 11.5 ns

Our AMAT is about 43% slower than pure SRAM cache memory (11.5 ns vs. 8 ns)

How Do Caches Help?• What about cost?

• Let’s consider:

• A system with 32 MB of DRAM memory, 512KB of SRAM cache.

• Cost is $1/MB for DRAM, and $50/MB for SRAM.

• If we had 32MB of SRAM, access time is 8 ns, but cost will be $1,600

• With 32MB of DRAM, cost is $32, but access time is 70 ns!

• But with 32MB of DRAM and 512 (1/2 MB) of SRAM, cost will be: $32 + (512/1024) * 50 = $57!

How do Caches Help?

• So with pure SRAM, we can have 8 ms average access time at $1,600.

• With pure DRAM, our memory will cost $32, but all accesses will take 70 ns!

• With DRAM memory and SRAM cache, we can have 11.5 ms access time at $57.

• So for a performance drop of 43%, we have a cost savings of >2700%!

• Hence caches give us large memory size (32 MB), at close to the cost of the DRAM technology ($57 vs. $32), but at close to the speed of expensive SRAM technology (11.5 ms vs. 8 ms)

Cache Architecture

• Caches consist of blocks (or lines). Each block stores data from memory:

Block

• Block allocation problem:– Given data from an address A, how do we decide which block of

cache its data should go to?

The Block Allocation Problem

• 3 possible solutions:– Data from each address A will go to to a fixed

block.• Direct Mapped Cache

– Data from each address A will go to any block.• Fully associative cache

– Data from address A will go to a fix set of blocks.

• Data may be put into any block within a set.

• Set associative cache.

Direct Mapped Caches

• The value of a portion of memory address is used to decide which block to send the data to:

Tag Block Index Block Offset Byte Offset

• The Block Index portion is used to decide which block data from this address should go to.

Address A

Example

• The number of bits in the block index is log2N, where N is the total number of blocks.

• For a 4-block cache, the block index portion of the address will be 2 bits, and these 2 bits can take on the value of 00, 01, 10 or 11.

• The exact value of these 2 bits will determine which block the data for that address will go to.

Direct Mapped Addressing E.g.

• Show how an addresses generated by the MIPS CPU will be divided into byte offset, block offset, block index and tag portions for the following cases:

i) Block size: 1 word, 128 blocks

ii) Block size: 4 words, 64 blocks

• All MIPS addresses are 32 bit byte addresses (i.e. they address individual bytes in a word).

Case I

Case II

Example

• The value of the two block index bits will determine which block the data will go to, following the scheme shown below:

00011011

Cache

Solving Direct-Mapped Cache Problems

• Question 7.7Basic formula: Blk_Addr = floor(word_address/words_per_block) mod N

– N here is the total number of blocks in the cache

– This is the mathematical version of taking the value of the Block Index bits from the address.

A Complication:Multiple Word Blocks

• Single word blocks do not support spatial locality– Spatial locality: Likelihood of accessing

neighbor of a piece of data that was just accessed is high.

– But with single word blocks, none of the neighbors are in cache!

• All accesses to neighbors that were not accessed before will miss!

An ExampleQuestion 7.8

Accessing Individual Words

• In our example, each block has 4 words.• But we always access memory 1 word at a time!

(e.g. lw)• Use the Block Offset to specify which of the 4

words in a block we want to read:

Address A


The Block Offset• Number of block offset bits = log2M, where M is

the number of words per block.• For our example, M=4. So number of block offset

bits is 2.• These two bits can take on the values of 00, 01, 10

and 11.• Note that for single word blocks, the number of

block offset bits is log2 1, which is 0. I.e. There are no block offset bits for single-word blocks.

• These values determine exactly which word within a block address A is referring to:

Who am I?Purpose of the Tag

• Many different addresses may map to the same block: e.g. (Block Index portions shown highlighted)01000 00010010 00000000 00

01010 00010010 00000000 00

11011 00010010 00000000 00

• All 3 addresses are different, but all map to block 00010010

Disambiguation• We need a way to disambiguate the situation

– Otherwise how do we know that the data in block x actually comes from address A and not from another address A’ that has the same block index bit value?

• The portion of the address A to the left of the Block Index can be used for disambiguation.

• This portion is called the tag, and the tag for address A is stored in the cache together with address A data.

The Tag

• When we access the cache, the Tag portion and Block Index portions of address A are extracted.

• The Block Index portion will tell the cache controller which block of cache to look at.

• The Tag portion is compared against the tag stored in the block. If the tags match, we have a cache hit. The data is read from the cache.

Word 00 Word 01 Word 10 Word 1100011011

Tag

Accessing Individual Bytes

• MIPS addresses are byte addresses, and actually index individual bytes rather than words.

• Each MIPS word consists of 4 bytes.• The byte offset tells us exactly which byte within a

word we are referring to.

Address A


Advantages & Disadvantages ofDirect Mapped Caches

• Advantages:– Simple to implement– Fast performance

• Less time to detect a cache hit => less time to get data from the cache => faster performance

• Disadvantages– Poor temporal locality.

• Many addresses may map to the same block.

• The next time address A is accessed, it may have been replaced by the contents of address A’.

Improving Temporal LocalityThe Fully Associative Cache

• In the fully associative cache, data from an address A can go to any block in cache.– In practice, data will go into the first available

cache block.– When the cache is full, a replacement policy is

invoked to choose which block of cache to throw out.

Advantages and DisadvantagesFully Associative Cache

• Good temporal locality properties– Flexible block placement allows smart

replacement policies such that blocks that are likely to be referenced again will not be replaced. E.g. LRU, LFU.

• Disadvantages– Complex and too expensive for large caches

• Each block needs a comparator to check the tag.

• With 8192 blocks, we need 8192 comparators!

A CompromiseSet Associative Caches

• Represents a compromise between direct-mapped and fully associative caches.

• Cache is divided into sets of blocks.

• An address A is mapped directly to a set using a similar scheme as for direct mapped caches.

• Once the set has been determined, the data from A may be stored in any block within a set - Fully associative within a set!

Set Associative Cache

• An n-way set associative cache will have n blocks per set.

• For example, for a 16-block cache that is implemented as a 2-way set associative cache, each set has 2 blocks, and we have a total of 8 sets.

Advantages and DisadvantagesSet Associative Cache

• Advantages– Almost as simple to build as a direct-mapped

cache.– Only n comparators are needed for an n-way set

associative cache. For 2-way set-associative, only 2 comparators are needed to compare tags.

– Supports temporal locality by having full associativity within a set.

Advantages and DisadvantagesSet Associative Cache

• Disadvantages– Not as good as fully-associative cache in supporting

temporal locality.

– For LRU schemes, because of small associativity, actually possible to have 0% hit rate for temporally local data.

– E.g. If our accesses are A1 A2 A3 A1 A2 A3, and if A1, A2 and A3 map to the same 2-way set, then hit rate is 0% as subsequent accesses replace previous accesses in the LRU scheme.

Multi-level Cache

• Let the first level of cache (closest to CPU) be called “L1”, and the next level “L2”.

• Let Phit_l1 be the hit rate of L1, Tcache_L1 be the cache access time of L1, Tmiss_L1 be the miss penalty of L1.

• AMAT of L1 = Phit_l1 * Tcache_L1 + (1-Phit_l1) * Tmiss_L1

• What is Tmiss_L1?

– If L1 misses, then we will attempt to get data from L2. Hence Tmiss_l1

is actually just the AMAT of L2!

• Let Phit_l2 be the hit rate of L2, Tcache_l2 be the cache access time of L2, Tmiss_l2 be the miss penalty of L2.

Multilevel Cache

• Tmiss_l1 = AMATl2 = Phit_l2 * Tcache_L2 + (1-Phit_l2) * Tmiss_L2

• Substitute this back and we get:

AMAT of L1 = Phit_l1 * Tcache_L1 + (1-Phit_l1) * (Phit_l2 * Tcache_L2 + (1-Phit_l2) * Tmiss_L2)

• Tmiss_l2 is of course the time taken to access the slow DRAM memory.

• What if we had an L3 cache?

cs1104 help session i memory semester ii 2001/02

Documents