2/15/99cs520s99 memoryc. edward chow page 1 what impact the memory system design? principle of...

2/15/99 CS520S99 Memory

C. Edward

Chow

What impact the memory system design?• Principle of Locality

– Temporal Locality (90% time spent in 10% code)– Spatial Locality (neighboring code has high

probability to be executed soon)• Smaller Hardware is faster?• Speed vs. Capacity• Price/Performance

Consideration (Amdahl’s Law)

2/15/99 CS520S99 Memory

C. Edward

Chow

Memory Hierarchy

cache blocks

page

cache (SRAM)

Memory LevelPrice’ 99

Speed Size

primary memory (DRAM)

disk

$50-200$200-$500

$2-5$25-50

$0.02-0.06$1-2

Fast

M

Slow

Small

M32M-512M

Large2G-17G

(8-35ns)

(90-120ns)

(7.5-15ms)

per MBytes access time

Expensive

Cheap

M

8-256 bytes

4K-64K

16K-2M

Price’ 93

2/15/99 CS520S99 Memory

C. Edward

Chow

DRAM, SIMM, DIMM

• DRAM: Dynamic RAM• SIMM: Single In-Line Memory Module

Save space (30pin/8bits, 72pin/32bits)• DIMM: Dual In-Line Memory Module

with DRAM on both sides (168pins/64bits)

• SO-DIMM: Small Outline DIMMused in most notebooks (72pin/32bits)

• See http://www.kingston.com/king/mg0.htmfor an informative guide

Simm sockets on system board

2/15/99 CS520S99 Memory

C. Edward

Chow

32pin and 72 pin SIMM

• Most desktop computers use either 72- or 30-pin SIMMs.

• A 30-pin SIMM supports 8 data bits (data bus width)– Typical consists of 2 banks (Bank 0 and Bank 1).

Each with 4 30-pin SIMM sockets, delivery 32 bits.• A 72-pin SIMM supports 32 data bits (in one bus

cycle)• Mixing different capacity of SIMMs in the same

memory bank results in system booting problems.• Newer DIMM support 168 pins.

2/15/99 CS520S99 Memory

C. Edward

Chow

Data Integrity

• To detect / fix bit error in memory chips, parity bit checking (9th bit) or ECC (Error Correcting Code) is used. These additional bits are called check bits.

• In ECC, the # of required check bit for single error correction is derived from the following formula:m+r+1 <= 2r; m: # of data bits; r: # of check bits.M=8, r=4; r=6, m<=57bits; r=7, m<=120bits.

• The SIMM memory modules (x36bits, x39, x40) provide data bits and redundant check bits to the memory controller, which actually performs the data integrity check.

• Most home pcs does not use check bit memory.• Most servers or high end workstations support ECC memory.

2/15/99 CS520S99 Memory

C. Edward

Chow

SIMM Module IdentificationNote that the parity SIMMs are distinguished by the "x 9" or "x 36" format specifications. This is because parity memory adds a parity bit to every 8 bits of data. So, a 30-pin SIMM provides 8 data bits per cycle, plus a parity bit, which equals 9 bits; 72-pin SIMMs provide 32 bits per cycle, plus 4 parity bits, which equals 36 bits.

2/15/99 CS520S99 Memory

C. Edward

Chow

Page Mode, EDO, SDRAM

• Fast-page mode chip: same row of bits are retrieved from memory cell area and saved in a special latch buffer. If next access is on the same row, data already in latch, only need half of the time.

• Extended Data Output, or EDO memory, memory access 10 to 15 percent faster than comparable fast-page mode chips.

• Synchronous DRAM (SDRAM) uses a clock to coordinate with the CPU clock so the timing of the memory chips and the timing of the CPU are in `synch.'

• Synchronous DRAM saves time in executing commands and transmitting data, thereby increasing the overall performance of the computer. The new PC 100MHz bus requires SDRAM.

• SDRAM memory allows the CPU to access memory approximately 25 percent faster than EDO memory.

2/15/99 CS520S99 Memory

C. Edward

Chow

Memory Access

• A memory access is said to have a hit (miss) in a memory level, if the data is found (can not be found) in the level.

• Hit rate (Miss rate)—is the fraction of memory accesses (not) found in the level.

• Hit time—the time to access data in a memory level including the time to decide if the access is a hit or miss.

• Miss penalty—the time to replace a block in a level with the corresponding block from the level below, plus the time to deliver the block to CPU=Time to access first word on a miss+transfer time (rest words)

Access Time Transfer Time

2/15/99 CS520S99 Memory

C. Edward

Chow

Memory Acces in System with Cache/Virtual Memory (Paging)

2/15/99 CS520S99 Memory

C. Edward

Chow

Evaluating Performance of a Memory Hierarchy

• Average Memory Access Time is a better measure than the Miss rate.

• Average Memory Access Time = Hit time + Miss rate * Miss penalty

Block size vs. Average access time, Miss penalty, Miss rate

Block size

Miss Rate

Block size

Miss penalty = Access Time+

Transfer Time

Block size

Average access time

2/15/99 CS520S99 Memory

C. Edward

Chow

Goal of Memory Hierarchy

Reduce execution time, not the no. of misses

Computer designers favor a block size with the lowest average access time rather than the lowest miss rate.

2/15/99 CS520S99 Memory

C. Edward

Chow

Classify Memory Hierarchy Design

Four Questions for Classifying Memory Hierarchies

Q1: Where to place a block in the upper memory level? (Block placement)

Q2: How to find a block in a memory level? (Block identification)

Q3: Which block should be replaced on a miss? (Block replacement)

Q4: What happens on a write? (Write strategy)

2/15/99 CS520S99 Memory

C. Edward

Chow

Cache

The memory level between CPU and main memory.Cache: a safe place for hiding or storing things.

Webster’s New World Dictionary of the American Language,

Second College Edition (1976)

2/15/99 CS520S99 Memory

C. Edward

Chow

Q1: Where to place a block in a cache?

(Block placement)Direct mapped cache—a fixed place for a block to appear in a cache. e.g.,

the location = (block address) modulo (no. of blocks in cache).Fully Associative cache—a block can be placed anywhere in the cache.Set Associative cache—a block can be placed in a restricted set of places.If there are n blocks in a set, the cache placement is called n-way set

associative.

2/15/99 CS520S99 Memory

C. Edward

Chow

Q2: How to find a block in a cache?(Block identification)

• Caches include an address tag (which gives part of block address) on each block.

• A valid bit is attached to a tag to indicate if the information in the block is valid.

• The address from the CPU to cache are divided into two fields: block address and block offset.

• The block address identifies the block.• The block offset identifies the byte in the block. The # of bits in

block offset depends on the block size (bs). Bs=32B block offset is 5 bits, 25=32.

• The block address is further divided into Tag and Index field. Tag identifies the set of blocks, while index selects the block within a set.

2/15/99 CS520S99 Memory

C. Edward

Chow

Search the tag value• When a cache block was brought into cache, the correspond

tag value was saved in the tag field of the block.• When the address from CPU arrives, its tag field is extracted

and compared with the tag values in the cache.

Often Implemented as Content Address Memory

Sequential search slow Simpler to implement

2/15/99 CS520S99 Memory

C. Edward

Chow

Q3: Which block should be replaced on a miss?(Block replacement)

• For the direct-mapped cache, this is easy since only one block is replaced.

• For the fully-associative and set-associative cache, there are two strategies:– Random– Least-recently used (LRU)—replace the block that has not

been access for a long time. (Principle of temporal locality)• The following sequence shows the block access history (upper

row) and the LRU block at each time slot (lower row). Time axis heading to the right. Initially, by default block 0 is the LRU block. Assume there are only 4 blocks.

2/15/99 CS520S99 Memory

C. Edward

Chow

Q4: What happens on a write? (Write strategy)

• Reads dominate cache accesses. All instructions accesses are reads.

• Write policies (options when writing to the cache):– Write–through—The information is written to both the

cache and main memory.– Write–back—The information is only written to the

cache; the modified cache block is written to main memory only when it is replaced.

• A block in a write–back cache can be either clean or dirty, depending on whether the block content is the same as that in main memory.

2/15/99 CS520S99 Memory

C. Edward

Chow

Write Back vs. Write Through

• For the write back–cache, – uses less memory bandwidth, since multiple writes

within a block only requires one write to main memory.– a read miss (which causes a block to be replaced and

therefore) may result in writes to main memory. (delay write is fast for CPU but data blocks inconsistent)

• For the write–through cache, – a read miss does not result in writes to main memory.– it is easier to implement.– the main memory has the most current copy of the data.

(Data consistency, coherence)

2/15/99 CS520S99 Memory

C. Edward

Chow

Dealing with Write Miss

• Write Miss: Data block not in the cache.• There are two options (whether to bring the block into the

cache):– Write–allocate—The block is loaded into the cache, followed

by the write-hit actions above.– No–write–allocate—The block is modified in the main

memory and not loaded into the cache.• In general, the write–back caches use write–allocate.

hoping that there are subsequent writes to the same block.• The write–through caches often use no–write–allocate.

since the subsequent writes also go to the main memory.

2/15/99 CS520S99 Memory

C. Edward

Chow

Dealing with CPU write stall

• CPU has to wait for the writes to complete during write-through.

• This can be solved by having a write buffer and let CPU to continue while the memory is updated using data in write buffer.

• If the write buffer is full, then the CPU and cache need to wait.

• Write merging: allow multiple writes to the write buffer to be merged into a single entry to be transferred to the lower level memory.

2/15/99 CS520S99 Memory

C. Edward

Chow

Write Merging• The subsequent writes with the address in the same write buffer

entry can be combined/merged. • This leaves more vacant entries for CPU to write.

• It also reduces the memory bandwidth.

Without write mergingWrite buffer fullCPU stalls

With write merging4 writes merged in

one write buffer entry CPU free to proceed

2/15/99 CS520S99 Memory

C. Edward

Chow

Alpha AXP 21064 Data/Instruction Cache

8KB=213B direct mapped cache, 32B block, 256 bit

34 bits

8B

2/15/99 CS520S99 Memory

C. Edward

Chow

Intel Pentium II

• 32 KB Level 1 cache (16KB instruction/16KB data)• Cacheable address space up to 64GB (36 bit physical address)• Dual Independent Bus (D.I.B.) architecture increases

performance and provides more data to the processor core • 100MHz system bus speeds data transfer between the

processor and the system • 400 MHZ offers 1 MB and 512 KB L2 cache options. 450 MHZ

offers 2 MB, 1 MB, and 512 KB L2 cache options. • Error Checking and Correction (ECC) to maintain the integrity of

mission-critical data

2/15/99 CS520S99 Memory

C. Edward

Chow

Cache Performance Formula

• AverageMemoryAccessTime = HitTime + MissRate*MissPenalty• MemoryStallClockcycle=#Reads*ReadMissRate*ReadMissPenalty

+#Writes*WriteMissRate*WriteMissPenalty• CPU time=

(CPU-execution clock cycles+MemoryStallClockCycles)*cycleTime.• CPU time=

IC*(CPIexecution+(MemoryStallClockcycles/IC))* cycleTime.• CPU time=

IC*(CPIexecution+MAPI*MissRate*MissPenalty)*cycleTime.• MAPI: Memory Accesses Per Instruction.• MissRate: fraction of memory access not in the cache• MissPenalty: the additional time to service the miss (related to blocksize and

memory transfer time.• IC: instruction count.

2/15/99 CS520S99 Memory

C. Edward

Chow

Split Caches vs. Unified Cache

• Percentage of instruction references is about 75%.• Split (Data+Instruction) caches offer two memory

ports per clock cycle and allow parallel access of instruction and data.

• Unified cache only has one port to satisfy both accesses, resulting 1 more clock cycle in hit time.

• But with more cache memory (cache size), unified cache will have lower miss rate.

2/15/99 CS520S99 Memory

C. Edward

Chow

Miss Rates of Split/Unified Caches

2/15/99 CS520S99 Memory

C. Edward

Chow

Design Trade-off: Split Cache vs. Unified Cache

• 16KB instruction cache+16KB data cache• 32-KB unified cache• Normal HitTime=1 cycle; Load and Store HitTime=2 cycles.• Which one has lower miss rate?• Ans: Based on 75% instruction reference, the overall MissRate (Split

caches)= 0.75*0.64%+0.25*6.47%=2.10%.• MissRate(Unified cache)=1.99%.• According to miss rate, unified cache performs better.• AverageMemoryAccessTime(Split)

=75%x(1+0.64%x50)+25%x(1+6.47%x50)=(75%x1.32)+(25%x4.235)=0.990+1.059=2.05 cycles Better!

• AverageMemoryAccessTime(Unified)=75%x(1+1.99%x50)+25%x(2+1.99%x50)=(75%x1.995)+(25%x2.995)=1.496+0.749=2.24 cycles

2/15/99 CS520S99 Memory

C. Edward

Chow

Impact of Cache on Performance

• Assume Cache MissPenalty=50 clockcycles• All instructions normally take 2.0 clockcycles (ignoring memory stall).• MissRate=2%.• MemoryAccessPerInstruction=1.33• Ans: Apply CPU time=

IC*(CPIexecution+(MemoryStallClockcycles/IC))* cycleTime.

• CPUTime(with cache)=IC*(2.0+(1.33x2%x50)*cycleTime=IC*3.33*cycleTime

• CPI(perfect cache)=2.0CPI(with cache) =3.33CPI(without cache)=2.0+1.33*50=68.5 !

2/15/99 CS520S99 Memory

C. Edward

Chow

2-way associative vs. direct-mapped cache• 2-way associative cache requires extra logic to select the block in the set

longer hit time longer CPU clock cycle time.• Will the advantage in lower miss rate offset the slower hit time?• Example (page 387). CPIexecution=2, DataCacheSize=64KB,

Miss Penalty=70ns (35CPUClockcycles), MemoryAccesPerInstuction=1.3.• CPUwith direct-mapped cache CPUwith 2-way assoc. cache• ClockCycleTime 2ns 2*1.1=2.2ns• Miss rate 0.014 0.010• AMAT 2+0.014*70 2.2+0.010*70• =2.98ns worse! =2.90ns• CPU time=IC*(CPIexecution*ClockCycleTime

+MemoryAccesPerInstuction*MissRate*MissPenalty*CycleTime)• CPU time IC*(2.0*2+1.3*0.014*70) IC*(2.0*2.2+1.3*0.010*70)

=IC*5.27 =IC*5.31• Since the CPU time is the bottom line evaluation metric and direct-mapped

cache is simpler to build, in this case the direct-mapped cache is preferred.

2/15/99 CS520S99 Memory

C. Edward

Chow

Improving Cache Performance

• Caches can be improved by:– Reducing miss rate– Reducing miss penalty– Reducing hit time

• Often there are related, improving in one area may impact the performance in the other areas.

2/15/99 CS520S99 Memory

C. Edward

Chow

Reducing Cache Misses• Three basic types of cache misses:

– Compulsory - The first access to a block not in the cache. (first reference misses, cold start misses).

– Capacity - since the cache cannot contain all the blocks of a program, some blocks will be replaced and later retrieved.

– Conflict - when too many blocks try to load into its set, some blocks will be replaced and later retrieved.

2/15/99 CS520S99 Memory

C. Edward

Chow

Reducing Miss Rate by Larger Block Size

• Good: Larger blocks takes advantage of spatial locality. • Bad: Larger blocks increase the miss penalty and

reduce the number of blocks.

Missrate

2/15/99 CS520S99 Memory

C. Edward

Chow

Miss Rate vs. Block Size

2/15/99 CS520S99 Memory

C. Edward

Chow

Select the Block Size that Minimizes AMAT

• Assume memory system takes 40 cycles overhead and delivers 16 bytes every 2 clock cycles. MissRate from Figure 5.12. Figure 5.13 shows the results on AMAT.

• AMATBS=16B, CS=1KB= 1+(15.05%x(40+(16/16)*2))=7.321 clock cycles.

• AMATBS=256B, CS=256KB= 1+(0.49%x(40+(256/16)*2)=1.353 clock cycles.

2/15/99 CS520S99 Memory

C. Edward

Chow

Reducing Miss Rate by Victim Cache• contains blocks that are discarded from a cache miss• checked on a miss, if matched, victim block and cache block are

swapped.• A four entry victim cache removed 20% to 95% of the conflict

misses in a 4KB direct-mapped data cache.

2/15/99 CS520S99 Memory

C. Edward

Chow

Pseudo (Column)-Associative Caches

• Cache access just like direct-mapped cache for hit.• When miss, additional cache entry is checked for

match.• The candidate cache entry can be the one with its

most significant bit of the index field inverted.• These cache blocks form the “pseudo set”.• One cache block has fast hit time; the other has slow

hit time.• What is the difference between this and a two-way

set associative cache? (not affecting processor clock rate)

2/15/99 CS520S99 Memory

C. Edward

Chow

Hardware Prefetching of Instruction and Data• Prefetch items before the processor request them.• Place them in cache or external buffer• Instruction Prefetch:

– AXP21064 fetches two blocks (requested block into cache and the next consecutive block or prefetch block into instruction stream buffer, ISB) on a miss.

– If instruction in the instruction stream buffer, cache request is cancelled, the block is read from ISB and the next prefetch request is issued.

– For 4-KB direct mapped cache with BS=16B, 1 block in ISBcatch 15-25% misses4 blocks in ISBhit rate improves to ~50%16 blocks in ISBhit rate improves to 72%

• Data Prefetch:– 1 single data stream buffer catches 25% misses– 4 data stream buffers increases data hit rate to 43%

2/15/99 CS520S99 Memory

C. Edward

Chow

Compiler-Controlled Prefetching

• Compiler generates prefetch instructions to request data before they are needed.

• Register prefetch loads the value into a register• Cache prefetch loads data only into the cache• Since the prefetching of a block can result in a page (virtual

memory) fault (the page containing the bock is not in main memory), the prefetches can be classified as faulting or non-faulting.

• Non-binding prefetch: a nonfaulting cache prefetch that does not cause virtual memory faults.

• A nonblocking (lookup-free) cache allow the processor to retrieve data/instruction, while prefetched data being fetched.

• Goal overlap execution with prefetching of data. Loop is key target.

2/15/99 CS520S99 Memory

C. Edward

Chow

Example of Compile-controlled Prefetchingfor (i=0; i<3; i=i+1)

for (j=0; j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0];

Assume 8kB direct-mapped cache, BS=16B, write back cache with write allocate.

A and b are double precision (8B) floating point array; a has 3 rows and 100 columns; b has 101 rows and 3 columns.

Assume a and b not in cache at the start of program.1. (Compiler) Determine the number of cache misses.Ans: In C, a two dimension array is saved in column-major order.

a[0][0];a[0][1]; …;a[0][99];a[1][0];a[1][1];…a[1][99];a[2][0];…Accessing a row of a a[i][0] (separated by 800 B) called stridingThe above loop, a’s elements are written in the same order in memory. Since BS=16B, access a[0][0] will miss and bring in a block with a[0][0] and a[0][1]. The next access a[0][1] will hit.For array a, accesses with even value of j miss; accesses with odd value of j hit.300/2=150 misses. (spatial locality).

2/15/99 CS520S99 Memory

C. Edward

Chow

Example of Compile-controlled Prefetchingfor (i=0; i<3; i=i+1)

for (j=0; j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0];

Assume 8kB direct-mapped cache, BS=16B, write back cache with write allocate.

A and b are double precision (8B) floating point array; a has 3 rows and 100 columns; b has 101 rows and 3 columns.

Assume a and b not in cache at the start of program.1. (Compiler) Determine the number of cache misses.Ans: The first loop instance, b[0][0] and b[1][0] are accessed. They are

separated by 101*8=808 bytes and not belong to the same block. Their misses bring in blocks with b[0][0] and b[0][1] (not used); and b[1][0] and b[1][1] (not used).Next loop instance, b[1][0] and b[2][0] are accessed. B[1][0] hits and b[2][0] misses.In the loop where i=0, there will be 101 misses for b[0][0]-b[100][0]. But for loops with i=1 and i=2, all array b accesses hit

2/15/99 CS520S99 Memory

C. Edward

Chow

Example of Compile-controlled Prefetching2. Insert prefetch instructions to reduce misses. Assume miss penalty big and we need

to prefetch 7 iterations in advanced. No fault for beginning and ending prefetches.Ans: First loop prefetch b and a. Second loop only prefetch a. for (j=0;j<100;j++) { prefetch(b[j+7][0]; /* b[j][0] for 7 iterations later */ prefecth(a[0][j+7]); /* a[0][j] for 7 iterations later */ a[0][j]=b[j][0]*b[j+1][0]; };for (i=1; i<3; i++) for (j=0; j<100; j++) { prefetch(a[I][j+7]); /* a[I][j] for 7 iterations later */ a[I][j]=b[j][0]*b[j+1][0]; }The revised code prefetches a[I][7]…a[i][99?] and b[7][0]…b[99?][0]1st loop nonprefetched misses to a[i][0-6];b[0-6][0] is 4+7=112nd loop nonprefetched misses to a[1][0-6]; a[2][0-6]is 4+4=8.Avoid 251-19=232 cache misses. Executing 400 prefetch instructions.

2/15/99 CS520S99 Memory

C. Edward

Chow

Correction to Prefetch Results

• Assume that a[i][0-6] is prefetched by a[i-1][100-106], then only a[0][0-6] causes ceil(7/2)=4 nonfetched misses.

• a[0-6][0] causes 7 nonfetched misses. • a[100][0] is prefetched• The total nonfetched misses should be 4+7=11.• Prefectch a[0][7] brings in

1. a[0][7] and a[0][8]? Or

2. a[0][6] and a[0][7]? • Note that if a[0][0] is allocated at address divisible by 16, then

the answer is 2. Othewise the answer is 1.

2/15/99 CS520S99 Memory

C. Edward

Chow

Improvement of Compiler-Controlled Prefetch

• Ignore instruction cache misses.• No conflict/capacity misses in the data cache.• Prefetches can overlap with each other and with cache misses.• 7 clockcycles per iteration in the original loop.• First prefetch loop takes 9 clock cycles per iterations

2nd prefetch loop takes 8 clcok cycles per iterations• A miss takes 50 clock cycles.• Ans: Original Loop:

300 iterations, each takes 7 clockcycles2100 cycles251 cache misses, each takes 50 cycles12550 cyclestotal clock cycles = 2100+12550=14650 cycles1st prefetch loop: 100 iterations*9cycles+11misses*50cycles =1450cycles2nd prefetch loop: 200iterations*8cyccles+8misses*50cycles=2000cyclesTotal pretech loop cycles= 3450 cycles.Prefetch code is 14650/3450=4.2 times faster.

2/15/99 CS520S99 Memory

C. Edward

Chow

Reducing Miss Rate Through Compiler Optimization

• Pure software solution• Through profiling info and rearrange of code, miss

rate can be reduced by 50%. [McFarling 98]• Data has less restriction on relocation than code.• Goal: rearrange code and data to improve the spatial

and temporal locality.

2/15/99 CS520S99 Memory

C. Edward

Chow

Example 1. Merging Array

• Group related data into a record and hope they are fetched as a cache block.

• What locality is exploited here?• Before:int val[SIZE];int key[SIZE];• After:structure merge {

int val;int key;

}struct merge merged_array[SIZE];

2/15/99 CS520S99 Memory

C. Edward

Chow

Loop Interchange

• In C arrays arranged in column major order.• In Fortran array arranged in row major order.• Before: striding through in row order, a lot of misses.

Block contains x[0][1] may got kicked out before being access when j=1, due to conflict misses.

for (j=0; j<100; j++)for (i=0; i<5000; i++)

x[i][j]=2*x[i][j];• After: access data in consecutive area.for (i=0; i<5000; i++)

for (j=0; j<100; j++)x[i][j]=2*x[i][j];

2/15/99 CS520S99 Memory

C. Edward

Chow

Loop Fusion

• Before

for (i=0; i<N; i++)

for (j=0; j<N; j++)

a[i][j]=1/b[i][j]*c[i][j];

for (i=0; i<N; i++)

for (j=0; j<N; j++)

d[i][j]=a[i][j]+c[i][j];

• After

for (i=0; i<N; i++)

for (j=0; j<N; j++) {

a[i][j]=1/b[i][j]*c[i][j];

d[i][j]=a[i][j]+c[i][j];

}

2nd accesses of a and c will be hits.

2/15/99 CS520S99 Memory

C. Edward

Chow

Blocking (Improve Temporal Locality)• Some algorithms access data by both row and column. The

column/row major order does not help.• Idea: modify the algorithms to access by blocks (submatrices).• Before (Matrix multiplication)for (i=0; i<N; I++)

for (j=0; j<N; j++) { r=0; for (k=0; k<N; k++)

r=r+y[i][k]*z[k][j]; x[i][j]=r;}

• After: B=submatrix block sizefor (jj=0; jj<N; jj=jj+B)for (kk=0; kk<N; kk=kk+B) for (i=0; i<N; I++)

for (j=jj; j<min(jj+B-1,N); j++) { r=0; for (k=kk; min(jj+B-1,N); k++)

r=r+y[I][k]*z[k][j]; x[I][j]=r;}

2/15/99 CS520S99 Memory

C. Edward

Chow

Access Pattern

Without blocking (acces of matrices with i=1)

With blocking

Gray: old accessesBlack: new accessesWhite: not accesses

2/15/99 CS520S99 Memory

C. Edward

Chow

Impact of conflict misses in direct-mapped caches on block size

• Smaller block size reduce conflict misses.• See right side of the upper curve. [Lam91]

MissRate

2/15/99 CS520S99 Memory

C. Edward

Chow

Reducing Cache Miss PenaltySub-block Placement

• Large block size small tag field• Large block size long miss penalty.• Solution: only bring in portion of the block, e.g., 8

bytes of a 32 byte block.• Each sub-block needs to valid bit.

2/15/99 CS520S99 Memory

C. Edward

Chow

Early Restart and Critical Word First

• Early Restart: As soon the request word of the block arrive, send it to CPU. Do not wait for the complete block to arrive.

• Critical Word First: Request lower level memory to send the request word first, then the rest of the blocks. When the request word arrives, send it to CPU immediately.

2/15/99 CS520S99 Memory

C. Edward

Chow

Nonblocking Data Cache

• Some pipelined machines allow instructions to be executed out of order.

• After data cache miss occurs, we can allow other instructions in the pipeline to retrieve their data.

• Their data may be in cache. This is called “hit under misses”.

• This type of caches is called Non-blocking cache.

2/15/99 CS520S99 Memory

C. Edward

Chow

Second-Level Caches

• AMAT = HitTimeL1+MissRateL1 * MissPenaltyL1• MissPenaltyL1=HitTimeL2+MissRateL2 * MissPenaltyL2• AMAT=HitTimeL1+MissRateL1*(HitTimeL2+MissRateL2*MissP

enaltyL2)• Local miss rate: # of misses in this cache/memory accesses to

this cache.• Global miss rate: # of misses in this cache/memory accesses

from CPU.• Example: 1000 memory references, 40 misses in first level

cache, 20 misses in the 2nd level cache. Miss rate for level-1 cache is 40/1000=4%, local miss rate for level-2 cache is 20/40=50%, global miss rate for level-2 cache is 20/1000=2%.

2/15/99 CS520S99 Memory

C. Edward

Chow

Main Memory Organization

32 bits

128 bits

Wider memory busFaster transfer

Increment of memoryUpdate bigger

Same address retrieve different wordsWord in Bank 0 transfer first

Followed by Bank 1The 4 words of a block is

spread out on 4 banks

2/15/99 CS520S99 Memory

C. Edward

Chow

Wider Main Memory vs. Interleaved Memory

Assume the basic memory organization takes 1 cycle to send address; 6 cycles to access a word; 1 cycle to transfer a word. For reading a block size of 4 words,

MEM. BUS Width Time(cycles) Bandwidth(bytes/cycle)

1word 4x(1+6+1)=32 16 B/32 cycles = 0.5

2words 2x(1+6+1)=16 16 B/16 cycles = 1

4words 1x(1+6+1)=8 16 B/8 cycles = 2

2/15/99 CS520S99 Memory

C. Edward

Chow

Compare Memory OrganizationCPIaverage=CPIexecution+MemoryAccesPerInstuction*MissRate*MissPenaltyExample: Use CPIaverage for trade-off between wide and interleaved memory.

Assume that the clock cycle time and instruction counts do not change.For BlockSize(BS in word unit)=1, MemoryBusWidth(MBW in word unit)=1, MissRate=15%,(10%,5%for BS=2,4) MissPenalty=8cycles. MemoryAccessPerInstruction=1.2, AverageCyclesPerInstruction (ignoring cache miss)=2

MissRate CPIaverage _____BS=1, MBW=1, no interleaving 15% 2+(1.2*15%*8) =3.44BS=2, MBW=1, no interleaving 10% 2+(1.2*10%*2*8) =3.92BS=2, MBW=1, interleaving 10% 2+(1.2*10%*(1+6+2)) =3.08BS=2, MBW=2, no interleaving 10% 2+(1.2*10%*1*8) =2.96 BS=4, MBW=1, no interleaving 5% 2+(1.2*5%*4*8) =3.92BS=4, MBW=1, interleaving 5% 2+(1.2*5%*(1+6+4)) =2.66 BS=4, MBW=2, no interleaving 5% 2+(1.2*5%*2*8) =2.96

2/15/99 CS520S99 Memory

C. Edward

Chow

Virtual Memory

is a technique to • share a small physical main memory among many processes.• facilitate programmers to write programs whose size is larger

than main memory (Simulation programs or DB applications with large data.) Make physical main memory size invisible to programmers.

• relieve programmers the burden of writing code swap routine to move overlay program segment in/out of main memory.

• protect memory space for different processes.• enable relocation of program anywhere in the main memory.• reduce the time to start a program, why?

2/15/99 CS520S99 Memory

C. Edward

Chow

Page vs. Segment

• Physical main memory is typically divided into fixed length blocks, called pages. A miss is called page fault

• If blocks are variable length, they are called segments. A segment miss is called segment fault.

• When a page fault happens, the page will be brought in from the disk.

2/15/99 CS520S99 Memory

C. Edward

Chow

Mapping Virtual Address to Physical Address• Memory Mapping (Address Translation): CPU

produces virtual addresses that are translated by a combination of hardware and software to physical addresses which are used in access main memory.

Pages A-D in VMConsider to be contiguous Actually there are scattered

In main memory and disk

2/15/99 CS520S99 Memory

C. Edward

Chow

Q1: Where to place a block in main memory? (Block placement)

• Because the horrendous cost of a miss (the DRAM speed and the disk speed are 4 orders of magnitude apart), OS designer always pick lower miss rates to allow blocks to be placed anywhere in main memory. (Fully associative)

2/15/99 CS520S99 Memory

C. Edward

Chow

Q2: How to find a block in main memory?(Block identification)

• Example. Size of page table. For a 28-bit virtual address and a 4KB page and 4 bytes per page table entry, the page table has 216=64K entries and 256KB(equivalent of 64 pages). It is quite big and the address translation can be quite slow.

virtual page no. page offset

virtual address

12 bits16 bits

pagetable

mainmemory

physical page-frame no.(16MB)

(4K pages)12 bits

32 bits entry

2/15/99 CS520S99 Memory

C. Edward

Chow

Page Table

• To reduce the size of page table, some machines apply a hashing function on the virtual page no. to get a value btw 0~4K-1 and use an Inverted page table— one entry per physical pages in main memory. A 16MB physical memory needs 4B*16MB/4KB=16KB (4 pages) for the inverted page table.

• To reduce the address translation time, a special cache called TLB (Translation Look-aside buffer) for that.

2/15/99 CS520S99 Memory

C. Edward

Chow

Translation-Lookaside Buffer: TLB

• To reduce the virtual-to-physical address translation time, a cache called Translation-Lookaside Buffer (TLB first used in IBM 370) is used to keep the most recent address translation.

• A TLB entry contains – the tag field that holds the virtual page number,– the data field that holds the corresponding

physical page-frame number, protect field, use bit, and dirty bit.

2/15/99 CS520S99 Memory

C. Edward

Chow

AXP 21064 TLB (32 entries)Fully-Associative Cache

13

V: valid bitR: Read Permission bitW: Write Permission bit

2/15/99 CS520S99 Memory

C. Edward

Chow

Combine Cache with Virtual Memory

• Is there a way to reduce the cache hit time?

TLBvirtualaddress

physicaladdress Cache Main

Mem

data notin cache

data in cache

CPU

modified cache hit time = address translation time + original cache hit time

Entry not in TLB, bring in from page table in main memory

2/15/99 CS520S99 Memory

C. Edward

Chow

Q3: Which block should be replaced on a virtual memory

miss? (Block replacement)• Replacement on cache miss is controlled by hardware• Replacement on virtual memory miss is controlled by OS.• OS tries to replace Least-Recent Used (LRU) block.• To help estimate the “freshness” of the blocks, many OS

provide a use bit or reference bit.- set it when the page is accessed- and periodically clears them.

2/15/99 CS520S99 Memory

C. Edward

Chow

Q4: What happens on a write?

• Write through is too expensive. 0.7M~6M cycles• Always use write back.

2/15/99 CS520S99 Memory

C. Edward

Chow

Exercise on Cache+Virtual Memory

Consider a memory system with virtual address=32 bits2-way set associative TLB total entries in TLB=512page size=2KBphysical address=36 bits2-way set associative cache total cache data size(CS)=256KBline size=block size(BS)=32B

a=TLB tag fieldb=TLB index fieldc=page offsetd=physical page frame address fielde=cache tag fieldf=cache index fieldg=block offseth=block data field

What is the relation between the two sets of data?

2/15/99 CS520S99 Memory

C. Edward

Chow

Relation between field sizea b c

=

MUX

=

virtual address

physical address

d

TLB

d

e f g

=

MUX

=

datacache

hh

a=32-b-c=32-8-11=13b=log2(entries/associativity)

c=log2(page size)=log2(211)=11d=36-c=36-11=25e=36-f-g=36-12-5=19f=log2(datasize/(assoc.*BS))

g=log2(BS)=log2(32)=5

=log2(512/2)=log2(256)=8

=log2(218/(2*25)=12

h=32*8=256bits

2/15/99 CS520S99 Memory

C. Edward

Chow

Find Optimal Block SizeThe following measurements have been made on the system:

0.4 data references/instruction MAPI=1.4

TLB miss rate=0.1%,TLB miss penalty=25 cyclesCache miss penalty

=6 cycles+#words/block.

b) Find the average memory access time for a 4W block size.assume 1 clock cycle for the hittime,

AMAT=hittime+miss rate*miss penalty = AMATTLB+AMATcache

= (1+0.001*25)+(1+0.015*6*4)=2.175c) Find the optimal block size.(Compare AMAT1W~16W)

Block size with 8W has the lowest AMAT=2.165.

2/15/99 CS520S99 Memory

C. Edward

Chow

Cache Analysis Using dinero

• There is a trace-driven cache simulator called dinero in ~cs520/dinero/bin

• Given a trace (the history of instruction and data reference of a program) and cache design parameters (such as the block size, unified cache size or instruction cache size and data cache size, associativity of the cache, replacement policy, and write policy),

• dinero will perform cache simulation and generate the miss rate.• For example, the following command

dinero -b32 -u32K -a1 < eg.din > eg32k.out• perform a cache analysis on 32K unified direct mapped cache

with block size=32btyes using a trace input file eg.din and generated output to eg32k.out.

• The man page is in ~cs520/dinero/man. The eg.din is in ~cs520/dinero/demo.

2/15/99 CS520S99 Memory

C. Edward

Chow

Dinero Output

>dinero -b32 -u64K -a1 < ~cs520/traces/benchmarks/tex.din > tex.cache---Dinero III by Mark D. Hill.CMDLINE: dinero -b32 -u64K -a1CACHE (bytes): blocksize=32, sub-blocksize=0, Usize=65536, Dsize=0, Isize=0.POLICIES: assoc=1-way, replacement=l, fetch=d(1,0), write=c, allocate=w.CTRL: debug=0, output=0, skipcount=0, maxcount=10000000, Q=0.

---Simulation begins. Metrics Access Type: (totals,fraction) Total Instrn Data Read Write Misc ----------------- ------ ------ ------ ------ ------ ------ Demand Fetches 832477 597309 235168 130655 104513 0

1.0000 0.7175 0.2825 0.1569 0.1255 0.0000

Demand Misses 1537 130 1407 343 1064 00.0018 0.0002 0.0060 0.0026 0.0102 0.0000

---Execution complete.43.7u 11.1s 0:57 95% 92+470k 950+1io 0pf+0wMissRate

2/15/99 CS520S99 Memory

C. Edward

Chow

Homework #3

Prob. 2. Exercise on Cache+Virtual MemoryConsider a memory system with virtual address=34 bits, fully

associative TLB, total entries in TLB=1024, page size=8KB, physical address=36 bits, direct map cache, total cache data size=256KB, block size=32B.

a) What are the tag field sizes in TLB and cache?b) Assume 0.5 data references per instruction, TLB miss

rate=0.14%, TLB miss penalty=20 cycles, cache miss penalty=6 cycles+#words in a block, the miss rate for block size of 4 words is 1.4% while the miss rate for block size of 8 words is 1%. Which block size offers better performance?

2/15/99 CS520S99 Memory

C. Edward

Chow

Homework #3

Problem 3.

for (i=0; i<3; i=i+1)for (j=0; j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0];

Assume 32KB direct-mapped cache, BS=32B, write back cache with write allocate.

a and b are double precision (8B) floating point arrays; a has 3 rows and 100 columns; b has 101 rows and 3 columns.

Assume a and b not in cache at the start of program.

Determine the number of cache misses.

2/15/99cs520s99 memoryc. edward chow page 1 what impact the memory system design? principle of...

Documents