memory subsystem design or nothing beats cold, hard cache print out the two pi questions on cache...

Memory Subsystem Design

or

Nothing Beats Cold, Hard Cache

Print out the two PI questions on cache diagrams and bring copies to class for students to work on.Reading – 5.3, 5.4

Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

http://creativecommons.org/choose/www.peerinstruction4cs.org

http://creativecommons.org/licenses/by-nc-sa/3.0/



The memory subsystem

Computer

Memory

Datapath

Control

Output

Input

Movie Rental Store

• You have a huge warehouse with EVERY movie ever made (hits, training films, etc.).

• Getting a movie from the warehouse takes 15 minutes.

• You can’t stay in business if every rental takes 15 minutes.

• You have some small shelves in the front office.

Office

Warehouse

Think for a bit about what you Might do to improve this (on your own)

Here are some suggested improvements to the store:1. Whenever someone rents a movie, just keep it in the front office

for a while in case someone else wants to rent it.2. Watch the trends in movie watching and attempt to guess movies

that will be rented soon – put those in the front office.3. Whenever someone rents a movie in a series (Star Wars), grab

the other movies in the series and put them in the front office.4. Buy motorcycles to ride in the warehouse to get the movies faster

Extending the analogy to locality for caches, which pair of changes most closely matches the analogous cache locality?

Selection Spatial Temporal

A 2 1

B 4 2

C 4 3

D 3 1

E None of the above

Office

Warehouse

Memory Locality

• Memory hierarchies take advantage of memory locality.

• Memory locality is the principle that future memory accesses are near past accesses.

• Memories take advantage of two types of locality– -- near in time => we will often access the

same data again very soon

– -- near in space/distance => our next access is often very close to our last access (or recent accesses).

(this sequence of addresses exhibits both temporal and spatial locality)

1,2,3,1,2,3,8,8,47,9,10,8,8...

From the book we know SRAM is very fast, expensive ($/GB), and small. We also know Disks are slow, inexpensive ($/GB), and large. Which statement best describes the role of cache when it works.

Selection Role of caching

A Locality allows us to keep frequently touched data in SRAM.

B Locality allows us the illusion of memory as fast as SRAM but as large as a disk.

C SRAM is too expensive to have large – so it must be small and caching helps use it well.

D Disks are too slow – we have to have something faster for our processor to access.

E None of these accurately describes the roll of cache.

Locality and cacheing

• Memory hierarchies exploit locality by cacheing (keeping close to the processor) data likely to be used again.

• This is done because we can build large, slow memories and small, fast memories, but we can’t build large, fast memories.

SRAM access times are 0.5 – 2.5ns at cost of $2000 to $5000 per GB.

DRAM access times are 60-120ns at cost of $20 to $75 per GB.

Disk access times are 5 to 20 million ns at cost of $.20 to $2 per GB.

A typical memory hierarchy

CPU

memory

memory

memory

memory

on-chip cache(s)

off-chip cache

main memory

disk

small expensive $/bit

cheap $/bit

big

•so then where is my program and data??

fast

slow

Cache Fundamentals

• cache hit -- an access where the data is found in the cache.

• cache miss -- an access which isn’t

• hit time -- time to access the cache

• miss penalty -- time to move data from further level to closer, then to cpu

• hit ratio -- percentage of time the data is found in the cache

• miss ratio -- (1 - hit ratio)

cpu

lowest-levelcache

next-levelmemory/cache

Cache Fundamentals, cont.

• cache block size or cache line size– the amount of data that gets transferred on a cache miss.

• instruction cache -- cache that only holds instructions.

• data cache -- cache that only caches data.

• unified cache -- cache that holds both.

cpu

lowest-levelcache


Cacheing Issues

On a memory access -

• How do I know if this is a hit or miss?

On a cache miss -

• where to put the new data?

• what data to throw out?

• how to remember what data this is?

cpu

lowest-levelcache


access

miss

A simple cache

• A cache that can put a line of data anywhere is called ______________

• The most popular replacement strategy is LRU ( ).

tag data

the tag identifiesthe address of the cached data

the tag identifiesthe address of the cached data

4 blocks, each block holds one word, any blockcan hold any word.

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

Fully associative

Point out the tag IDs theaddress (pointer) data is The value

Fully Associative Cache

tag data

4 blocks, each block holds one word, any blockcan hold any word.

addresses

4 00 00 01 008 00 00 10 0012 00 00 11 004 00 00 01 008 00 00 10 0020 00 01 01 004 00 00 01 008 00 00 10 0020 00 01 01 0024 00 01 10 0012 00 00 11 008 00 00 10 004 00 00 01 00

A simpler cache

• A cache that can put a line of data in exactly one place is called __________________.

• Advantages/disadvantages vs. fully-associative?

an index is usedto determine

which line an addressmight be found in

an index is usedto determine

which line an addressmight be found in

4 blocks, each block holds one word, each wordin memory maps to exactly one cache location.

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100tag data

Direct Mapped

Direct Mapped Cache

tag data

4 blocks, each block holds one word, each wordin memory maps to exactly one cache location.

00 00 01 0000 00 10 0000 00 11 0000 00 01 0000 00 10 0000 01 01 0000 00 01 0000 00 10 0000 01 01 0000 01 10 0000 00 11 0000 00 10 0000 00 01 00

MMMHHMHHH M MHM

MMMHHMMHH M HHH

MMMHHMMHM M HMM

MHMHHMHHH M HHM

A B C D E None are correct481248204820241284

addresses

00

01

10

11

An n-way set-associative cache

• A cache that can put a line of data in exactly n places is called n-way set-associative.

• The cache lines/blocks that share the same index are a cache ____________.

tag data

4 entries, each block holds one word, each wordin memory maps to one of a set of n cache lines

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100tag data

Set

2-way Set Associative Cache

MMMHHMMHH M MHM

MMMHHMHHH M MHM

MMMHHMMHM M HMM

MHMHHMHHH M HHM

A B C D E None are correct

tag data tag data

4 entries, each block holds one word, each wordin memory maps to one of a set of n cache lines

00 000 1 0000 001 0 0000 001 1 0000 000 1 0000 001 0 0000 010 1 0000 000 1 0000 001 0 0000 010 1 0000 011 0 0000 001 1 0000 001 0 0000 000 1 00

481248204820241284

addresses

0

1

Longer Cache Blocks

• Large cache blocks take advantage of spatial locality.• Too large of a block size can waste cache space.• Longer cache blocks require less tag space

tag data

DM, 4 blocks, each block holds two words, each wordin memory maps to exactly one cache location (this cache is twice the total size of the prior caches).

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100

Longer Cache Blocks

tag data

DM, 4 blocks, each block holds two words, each word in memory maps to exactly one cache location (this cache is twice the total size of the prior caches).

00 0 00 10000 0 01 00000 0 01 10000 0 00 10000 0 01 00000 0 10 10000 0 00 10000 0 01 00000 0 10 10000 0 11 00000 0 01 10000 0 01 00000 0 00 100

481248204820241284

addresses

Cache Parameters

Cache size = Number of sets * block size * associativity

-128 blocks, 32-byte block size, direct mapped, size = ?

-128 KB cache, 64-byte blocks, 512 sets, associativity = ?

#entries

2^7*2^5=2^12

2^17 / 2^6 = 2^112^11/2^9 = 2^2

Draw it

EquationsAll “sizes” are in bytes1. log2(block_size)2. log2(cache_size /(assoc*block_size))3. 32 – log2(cache_size/assoc)

Selection # tag bits # index bits # block offset bits

A 3 2 1

B 1 2 3

C 1 3 2

D 2 1 3

E None of the above

Descriptions of caches1. Exceptional usage of the cache space in exchange for a

slow hit time2. Poor usage of the cache space in exchange for an

excellent hit time3. Reasonable usage of cache space in exchange for a

reasonable hit time

Selection Fully-Associative

8-way Set Associative

Direct Mapped

A 3 2 1

B 3 3 2

C 1 2 3

D 3 2 1

E None of the above

Cache Associativity

Block Size and Miss Rate

Handling a Cache Access

1. Use index and tag to access cache and determine hit/miss.

2. If hit, return requested data.

3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor).

-load entire missed cache line into cache

-return requested data to CPU (or higher cache)

4. If next lower memory is a cache, goto step 1 for that cache.

ICache Reg

AL

U Dcache Reg

IF ID EX MEM WB

Accessing a Sample Cache

• 64 KB cache, direct-mapped, 32-byte cache block size

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data

64 KB

/ 32 bytes =

2 K cache blocks/sets

11

=

256

32

16

hit/miss

012

...

...

...

...204520462047

block offset

Point out validBit – show the dataCan be grabbed in ||With tag compare

Accessing a Sample Cache

• 32 KB cache, 2-way set-associative, 16-byte block size

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data

32 KB

/ 16 bytes / 2 =

1 K cache sets

10

=

18

hit/miss

012

...

...

...

...102110221023

block offset

tag datavalid

=

for(int i = 0; i<10,000,000;i++)sum+=A[i];

Assume each element of A is 4 bytes and sum is kept in a register. Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks. Which changes would help the hit rate of the above code?

Selection Change

A Increase to 2-way set associativity

B Increase block size to 64 bytes

C Increase cache size to 64 KB

D A and C combined

E A, B, and C combined

Isomorphic

for(int i=0; i<10,000,000;i++)for(int j = 0; j<8192;j++)

sum+= A[j] – B[j];Assume each element of A and B are 4 bytes and each array is at least 32KB in size. Assume sum is kept in a register. Assume a baseline direct-mapped 32KB L1 cache with 32 byte blocks. Which changes would help the hit rate of the above code?

Selection Change

A Increase to 2-way set associativity

B Increase block size to 64 bytes

C Increase cache size to 64 KB

D A and C combined

E A, B, and C combined

Assume a 1KB Cache with 64 byte blocks. Assume the following byte-address are repeatedly accessed in a loop.

Selection Address Stream

A 1

B 2

C 3

D None of the above, 2-way always has a better HR than DM

E None of the above

000000 0000 000000000000 0100 000100000001 0000 000000

000000 0000 000000000000 1000 000100000001 0000 000000

000000 0000 000000000000 0100 000100000001 1000 000000

*The addresses above are broken up (bitwise) for a DM Cache.

For which of the address streams above does a 2-way set-associative cache (same size cache, same block size) suffer a worse hit rate than a DM cache?

1 2 3

1: 33% DM, 100% 2-way2: 33% DM, 0% 2-way3: 100% DM, 100% 2-way

Dealing with Stores

• Stores must be handled differently than loads, because...– they don’t necessarily require the CPU to stall.

– they change the content of cache/memory (creating memory consistency issues)

– may require a and a store to completeLoad

Draw value in cache vs. not in cache

There have been a number of issues glossed over – we’ll cover those now

Policy decisions for stores

• Keep memory and cache identical?– => all writes go to both cache and main memory

– => writes go only to cache. Modified cache lines are written back to memory when the line is replaced.

• Make room in cache for store miss?– write-allocate => on a store miss, bring written line into the cache

– write-around => on a store miss, ignore cache

Write-throughWrite-back

Store Policies• Given either high store locality or low store locality, which policies

might you expect to find?

Selection

High Locality Low Locality

Miss Policy Hit Policy Miss Policy Hit Policy

A Write-allocate Write-through Write-around Write-back

B Write-around Write-through Write-allocate Write-back

C Write-allocate Write-back Write-around Write-through

D Write-around Write-back Write-allocate Write-through

E None of the above

Dealing with stores

• On a store hit, write the new data to cache. In a write-through cache, write the data immediately to memory. In a write-back cache, mark the line as dirty.

• On a store miss, initiate a cache block load from memory for a write-allocate cache. Write directly to memory for a write-around cache.

• On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory.

Cache Performance

CPI = BCPI + MCPI– BCPI = base CPI, which means the CPI assuming perfect memory (BCPI

= peak CPI + PSPI + BSPI) PSPI => pipeline stalls per instruction BSPI => branch hazard stalls per instruction

– MCPI = the memory CPI, the number of cycles (per instruction) the processor is stalled waiting for memory.

MCPI = accesses/instruction * miss rate * miss penalty– this assumes we stall the pipeline on both read and write misses, that the

miss penalty is the same for both, that cache hits require no stalls.

– If the miss penalty or miss rate is different for Inst cache and data cache (common case), then

MCPI = I$ accesses/inst*I$MR*I$MP + D$ acc/inst*D$MR*D$MP

Cache Performance

• Instruction cache miss rate of 4%, data cache miss rate of 10%, BCPI = 1.0 (no data or control hazards), 20% of instructions are loads and stores, miss penalty = 12 cycles, CPI = ?

CPI = 1 + %insts*%miss*miss_penaltyCPI = 1+(1.0)*.04*12 + .2*.10*12 = 1+ .48+.24 =1.72

Selection CPI (rounded if necessary)

A 1.24

B 1.34

C 1.48

D 1.72

E None of the above

Example -- DEC Alpha 21164 Caches

21164 CPUcore

InstructionCache

DataCache

UnifiedL2

Cache

Off-ChipL3 Cache

• ICache and DCache -- 8 KB, DM, 32-byte lines

• L2 cache -- 96 KB, ?-way SA, 32-byte lines

• L3 cache -- 1 MB, DM, 32-byte lines

Cache Alignment

• The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).

• This results in – no overlap of cache lines– easy mapping of addresses to cache lines (no additions)– data at address X always being present in the same

location in the cache block (at byte X mod blocksize) if it is there at all.

• Think of main memory as organized into cache-line sized pieces (because in reality, it is!).

tag index block offset

memory address

.

.

.

0123456789

10...

Memory

Three types of cache misses

• Compulsory (or cold-start) misses– first access to the data.

• Capacity misses– we missed only because the cache isn’t big enough.

• Conflict misses– we missed because the data maps to the same line as other data

that forced it out of the cache.

tag data

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

DM cache

Reading Quiz Variant

• Suppose you experience a cache miss on a block (let's call it block A). You have accessed block A in the past. There have been precisely 1027 different blocks accessed between your last access to block A and your current miss. Your block size is 32-bytes and you have a 64KB cache. What kind of miss was this?

Explain the way to know if itIs a capacity vs. conflict – awouldA fully associative cache of theSame size get a miss

Selection Cache Miss

A Compulsory

B Capacity

C Conflict

D Both Capacity and Conflict

E None of the above

So, then, how do we decrease...

• Compulsory misses?

• Capacity misses?

• Conflict misses?

Block Size, Prefetch

Increase Cache Size

Increase Associativity

Cache Miss Components

Capacity

One-way conflict

Two-way conflict

Four-way conflict

LRU replacement algorithms

• only needed for associative caches

• requires one bit for 2-way set-associative, 8 bits for 4-way, 24 bits for 8-way.

• can be emulated with log n bits (NMRU)

• can be emulated with use bits for highly associative caches (like page tables)

• However, for most caches (eg, associativity <= 8), LRU is calculated exactly.

Caches in Current Processors

• A few years ago, they were DM at highest level (closest to CPU), associative further away (this is less true today). Now they are less associative near the processor (4-8), and more farther away (8-16).

• split I and D close to the processor (for throughput rather than miss rate), unified further away.

• write-through and write-back both common, but never write-through all the way to memory.

• 64-byte cache lines common (but getting larger)

• Non-blocking– processor doesn’t stall on a miss, but only on the use of a miss (if even then)

– this means the cache must be able to keep track of multiple outstanding accesses.

Prefetching

• “Watch the trends in movie watching and attempt to guess movies that will be rented soon – put those in the front office.”

• Hardware Prefetching– suppose you are walking through a single element in an array of

large objects

– hardware determines the “stride” and starts grabbing values early

• Software Prefetching– Load instruction to $0 a fair number of instructions before it is

needed

Writing Cache-Aware Code

• Focus on your working set• If your “working set” fits in L1 it will be vastly better than

a “working set” that fits only on disk.• If you have a large data set – do processing on it in chunks.• Think about regularity in data structures (can a prefetcher

guess where you are going – or are you pointer chasing)

• Instrumentation tools (PIN, Atom, PEBIL) can often help you analyze your working set

• Profiling can give you idea of which section of code is dominant which can tell you where to focus

HW – matrix example

Working Set Size

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00% astar

bwaves

bzip2

calcul ix

gamess

gemsFDTD

gobmk

gromacs

h264ref

hmmer

lbm

les l ie3d

l ibquantum

mcf

mi lc

namd

sjeng

soplex

wrf

xalan

zeusmpCache Size (KB)

Hit

Rat

e

64, 256, 4096 Nehalem

Key Points

• Caches give illusion of a large, cheap memory with the access time of a fast, expensive memory.

• Caches take advantage of memory locality, specifically temporal locality and spatial locality.

• Cache design presents many options (block size, cache size, associativity, write policy) that an architect must combine to minimize miss rate and access time to maximize performance.

memory subsystem design or nothing beats cold, hard cache print out the two pi questions on cache...

Documents