2/15/99cs520s99 memoryc. edward chow page 1 what impact the memory system design? principle of...

76
2/15/99 CS520S99 Memory C. Edw ard Cho w Pag What impact the memory system design? Principle of Locality Temporal Locality (90% time spent in 10% code) Spatial Locality (neighboring code has high probability to be executed soon) Smaller Hardware is faster? Speed vs. Capacity • Price/Performance Consideration (Amdahl’s Law)

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 1

What impact the memory system design?• Principle of Locality

– Temporal Locality (90% time spent in 10% code)– Spatial Locality (neighboring code has high

probability to be executed soon)• Smaller Hardware is faster?• Speed vs. Capacity• Price/Performance

 Consideration (Amdahl’s Law)

Page 2: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 2

Memory Hierarchy

cache blocks

page

cache (SRAM)

Memory LevelPrice’ 99

Speed Size

primary memory (DRAM)

disk

$50-200$200-$500

$2-5$25-50

$0.02-0.06$1-2

Fast

M

Slow

Small

M32M-512M

Large2G-17G

(8-35ns)

(90-120ns)

(7.5-15ms)

per MBytes access time

Expensive

Cheap

M

8-256 bytes

4K-64K

16K-2M

Price’ 93

Page 3: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 3

DRAM, SIMM, DIMM

• DRAM: Dynamic RAM• SIMM: Single In-Line Memory Module

Save space (30pin/8bits, 72pin/32bits)• DIMM: Dual In-Line Memory Module

with DRAM on both sides (168pins/64bits)

• SO-DIMM: Small Outline DIMMused in most notebooks (72pin/32bits)

• See http://www.kingston.com/king/mg0.htmfor an informative guide

Simm sockets on system board

Page 4: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 4

32pin and 72 pin SIMM

• Most desktop computers use either 72- or 30-pin SIMMs.

• A 30-pin SIMM supports 8 data bits (data bus width)– Typical consists of 2 banks (Bank 0 and Bank 1).

Each with 4 30-pin SIMM sockets, delivery 32 bits.• A 72-pin SIMM supports 32 data bits (in one bus

cycle)• Mixing different capacity of SIMMs in the same

memory bank results in system booting problems.• Newer DIMM support 168 pins.

Page 5: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 5

Data Integrity

• To detect / fix bit error in memory chips, parity bit checking (9th bit) or ECC (Error Correcting Code) is used. These additional bits are called check bits.

• In ECC, the # of required check bit for single error correction is derived from the following formula:m+r+1 <= 2r; m: # of data bits; r: # of check bits.M=8, r=4; r=6, m<=57bits; r=7, m<=120bits.

• The SIMM memory modules (x36bits, x39, x40) provide data bits and redundant check bits to the memory controller, which actually performs the data integrity check.

• Most home pcs does not use check bit memory.• Most servers or high end workstations support ECC memory.

Page 6: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 6

SIMM Module IdentificationNote that the parity SIMMs are distinguished by the "x 9" or "x 36" format specifications. This is because parity memory adds a parity bit to every 8 bits of data. So, a 30-pin SIMM provides 8 data bits per cycle, plus a parity bit, which equals 9 bits; 72-pin SIMMs provide 32 bits per cycle, plus 4 parity bits, which equals 36 bits.

Page 7: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 7

Page Mode, EDO, SDRAM

• Fast-page mode chip: same row of bits are retrieved from memory cell area and saved in a special latch buffer. If next access is on the same row, data already in latch, only need half of the time.

• Extended Data Output, or EDO memory, memory access 10 to 15 percent faster than comparable fast-page mode chips.

• Synchronous DRAM (SDRAM) uses a clock to coordinate with the CPU clock so the timing of the memory chips and the timing of the CPU are in `synch.'

• Synchronous DRAM saves time in executing commands and transmitting data, thereby increasing the overall performance of the computer. The new PC 100MHz bus requires SDRAM.

• SDRAM memory allows the CPU to access memory approximately 25 percent faster than EDO memory.

Page 8: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 8

Memory Access

• A memory access is said to have a hit (miss) in a memory level, if the data is found (can not be found) in the level.

• Hit rate (Miss rate)—is the fraction of memory accesses (not) found in the level.

• Hit time—the time to access data in a memory level including the time to decide if the access is a hit or miss.

• Miss penalty—the time to replace a block in a level with the corresponding block from the level below, plus the time to deliver the block to CPU=Time to access first word on a miss+transfer time (rest words)

Access Time Transfer Time

Page 9: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 9

Memory Acces in System with Cache/Virtual Memory (Paging)

Page 10: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 10

Evaluating Performance of a Memory Hierarchy

• Average Memory Access Time is a better measure than the Miss rate.

• Average Memory Access Time = Hit time + Miss rate * Miss penalty

Block size vs. Average access time, Miss penalty, Miss rate

Block size

Miss Rate

Block size

Miss penalty = Access Time+

Transfer Time

Block size

Average access time

Page 11: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 11

Goal of Memory Hierarchy

Reduce execution time, not the no. of misses

Computer designers favor a block size with the lowest average access time rather than the lowest miss rate.

Page 12: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 12

Classify Memory Hierarchy Design

Four Questions for Classifying Memory Hierarchies

Q1: Where to place a block in the upper memory level? (Block placement)

Q2: How to find a block in a memory level? (Block identification)

Q3: Which block should be replaced on a miss? (Block replacement)

Q4: What happens on a write? (Write strategy)

Page 13: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 13

Cache

The memory level between CPU and main memory.Cache: a safe place for hiding or storing things.

Webster’s New World Dictionary of the American Language,

Second College Edition (1976)

Page 14: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 14

Q1: Where to place a block in a cache?

(Block placement)Direct mapped cache—a fixed place for a block to appear in a cache. e.g.,

the location = (block address) modulo (no. of blocks in cache).Fully Associative cache—a block can be placed anywhere in the cache.Set Associative cache—a block can be placed in a restricted set of places.If there are n blocks in a set, the cache placement is called n-way set

associative.

Page 15: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 15

Q2: How to find a block in a cache?(Block identification)

• Caches include an address tag (which gives part of block address) on each block.

• A valid bit is attached to a tag to indicate if the information in the block is valid.

• The address from the CPU to cache are divided into two fields: block address and block offset.

• The block address identifies the block.• The block offset identifies the byte in the block. The # of bits in

block offset depends on the block size (bs). Bs=32B block offset is 5 bits, 25=32.

• The block address is further divided into Tag and Index field. Tag identifies the set of blocks, while index selects the block within a set.

Page 16: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 16

Search the tag value• When a cache block was brought into cache, the correspond

tag value was saved in the tag field of the block.• When the address from CPU arrives, its tag field is extracted

and compared with the tag values in the cache.

Often Implemented as Content Address Memory

Sequential search slow Simpler to implement

Page 17: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 17

Q3: Which block should be replaced on a miss?(Block replacement)

• For the direct-mapped cache, this is easy since only one block is replaced.

• For the fully-associative and set-associative cache, there are two strategies:– Random– Least-recently used (LRU)—replace the block that has not

been access for a long time. (Principle of temporal locality)• The following sequence shows the block access history (upper

row) and the LRU block at each time slot (lower row). Time axis heading to the right. Initially, by default block 0 is the LRU block. Assume there are only 4 blocks.

Page 18: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 18

Q4: What happens on a write? (Write strategy)

• Reads dominate cache accesses. All instructions accesses are reads.

• Write policies (options when writing to the cache):– Write–through—The information is written to both the

cache and main memory.– Write–back—The information is only written to the

cache; the modified cache block is written to main memory only when it is replaced.

• A block in a write–back cache can be either clean or dirty, depending on whether the block content is the same as that in main memory.

Page 19: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 19

Write Back vs. Write Through

• For the write back–cache, – uses less memory bandwidth, since multiple writes

within a block only requires one write to main memory.– a read miss (which causes a block to be replaced and

therefore) may result in writes to main memory. (delay write is fast for CPU but data blocks inconsistent)

• For the write–through cache, – a read miss does not result in writes to main memory.– it is easier to implement.– the main memory has the most current copy of the data.

(Data consistency, coherence)

Page 20: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 20

Dealing with Write Miss

• Write Miss: Data block not in the cache.• There are two options (whether to bring the block into the

cache):– Write–allocate—The block is loaded into the cache, followed

by the write-hit actions above.– No–write–allocate—The block is modified in the main

memory and not loaded into the cache.• In general, the write–back caches use write–allocate.

hoping that there are subsequent writes to the same block.• The write–through caches often use no–write–allocate.

since the subsequent writes also go to the main memory.

Page 21: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 21

Dealing with CPU write stall

• CPU has to wait for the writes to complete during write-through.

• This can be solved by having a write buffer and let CPU to continue while the memory is updated using data in write buffer.

• If the write buffer is full, then the CPU and cache need to wait.

• Write merging: allow multiple writes to the write buffer to be merged into a single entry to be transferred to the lower level memory.

Page 22: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 22

Write Merging• The subsequent writes with the address in the same write buffer

entry can be combined/merged. • This leaves more vacant entries for CPU to write.

• It also reduces the memory bandwidth.

Without write mergingWrite buffer fullCPU stalls

With write merging4 writes merged in

one write buffer entry CPU free to proceed

Page 23: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 23

Alpha AXP 21064 Data/Instruction Cache

8KB=213B direct mapped cache, 32B block, 256 bit

34 bits

8B

Page 24: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 24

Intel Pentium II

• 32 KB Level 1 cache (16KB instruction/16KB data)• Cacheable address space up to 64GB (36 bit physical address)• Dual Independent Bus (D.I.B.) architecture increases

performance and provides more data to the processor core • 100MHz system bus speeds data transfer between the

processor and the system • 400 MHZ offers 1 MB and 512 KB L2 cache options. 450 MHZ

offers 2 MB, 1 MB, and 512 KB L2 cache options. • Error Checking and Correction (ECC) to maintain the integrity of

mission-critical data

Page 25: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 25

Cache Performance Formula

• AverageMemoryAccessTime = HitTime + MissRate*MissPenalty• MemoryStallClockcycle=#Reads*ReadMissRate*ReadMissPenalty

+#Writes*WriteMissRate*WriteMissPenalty• CPU time=

(CPU-execution clock cycles+MemoryStallClockCycles)*cycleTime.• CPU time=

IC*(CPIexecution+(MemoryStallClockcycles/IC))* cycleTime.• CPU time=

IC*(CPIexecution+MAPI*MissRate*MissPenalty)*cycleTime.• MAPI: Memory Accesses Per Instruction.• MissRate: fraction of memory access not in the cache• MissPenalty: the additional time to service the miss (related to blocksize and

memory transfer time.• IC: instruction count.

Page 26: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 26

Split Caches vs. Unified Cache

• Percentage of instruction references is about 75%.• Split (Data+Instruction) caches offer two memory

ports per clock cycle and allow parallel access of instruction and data.

• Unified cache only has one port to satisfy both accesses, resulting 1 more clock cycle in hit time.

• But with more cache memory (cache size), unified cache will have lower miss rate.

Page 27: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 27

Miss Rates of Split/Unified Caches

Page 28: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 28

Design Trade-off: Split Cache vs. Unified Cache

• 16KB instruction cache+16KB data cache• 32-KB unified cache• Normal HitTime=1 cycle; Load and Store HitTime=2 cycles.• Which one has lower miss rate?• Ans: Based on 75% instruction reference, the overall MissRate (Split

caches)= 0.75*0.64%+0.25*6.47%=2.10%.• MissRate(Unified cache)=1.99%.• According to miss rate, unified cache performs better.• AverageMemoryAccessTime(Split)

=75%x(1+0.64%x50)+25%x(1+6.47%x50)=(75%x1.32)+(25%x4.235)=0.990+1.059=2.05 cycles Better!

• AverageMemoryAccessTime(Unified)=75%x(1+1.99%x50)+25%x(2+1.99%x50)=(75%x1.995)+(25%x2.995)=1.496+0.749=2.24 cycles

Page 29: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 29

Impact of Cache on Performance

• Assume Cache MissPenalty=50 clockcycles• All instructions normally take 2.0 clockcycles (ignoring memory stall).• MissRate=2%.• MemoryAccessPerInstruction=1.33• Ans: Apply CPU time=

IC*(CPIexecution+(MemoryStallClockcycles/IC))* cycleTime.

• CPUTime(with cache)=IC*(2.0+(1.33x2%x50)*cycleTime=IC*3.33*cycleTime

• CPI(perfect cache)=2.0CPI(with cache) =3.33CPI(without cache)=2.0+1.33*50=68.5 !

Page 30: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 30

2-way associative vs. direct-mapped cache• 2-way associative cache requires extra logic to select the block in the set

longer hit time longer CPU clock cycle time.• Will the advantage in lower miss rate offset the slower hit time?• Example (page 387). CPIexecution=2, DataCacheSize=64KB,

Miss Penalty=70ns (35CPUClockcycles), MemoryAccesPerInstuction=1.3.• CPUwith direct-mapped cache CPUwith 2-way assoc. cache• ClockCycleTime 2ns 2*1.1=2.2ns• Miss rate 0.014 0.010• AMAT 2+0.014*70 2.2+0.010*70• =2.98ns worse! =2.90ns• CPU time=IC*(CPIexecution*ClockCycleTime

+MemoryAccesPerInstuction*MissRate*MissPenalty*CycleTime)• CPU time IC*(2.0*2+1.3*0.014*70) IC*(2.0*2.2+1.3*0.010*70)

=IC*5.27 =IC*5.31• Since the CPU time is the bottom line evaluation metric and direct-mapped

cache is simpler to build, in this case the direct-mapped cache is preferred.

Page 31: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 31

Improving Cache Performance

• Caches can be improved by:– Reducing miss rate– Reducing miss penalty– Reducing hit time

• Often there are related, improving in one area may impact the performance in the other areas.

Page 32: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 32

Reducing Cache Misses• Three basic types of cache misses:

– Compulsory - The first access to a block not in the cache. (first reference misses, cold start misses).

– Capacity - since the cache cannot contain all the blocks of a program, some blocks will be replaced and later retrieved.

– Conflict - when too many blocks try to load into its set, some blocks will be replaced and later retrieved.

Page 33: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 33

Reducing Miss Rate by Larger Block Size

• Good: Larger blocks takes advantage of spatial locality. • Bad: Larger blocks increase the miss penalty and

reduce the number of blocks.

Missrate

Page 34: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 34

Miss Rate vs. Block Size

Page 35: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 35

Select the Block Size that Minimizes AMAT

• Assume memory system takes 40 cycles overhead and delivers 16 bytes every 2 clock cycles. MissRate from Figure 5.12. Figure 5.13 shows the results on AMAT.

• AMATBS=16B, CS=1KB= 1+(15.05%x(40+(16/16)*2))=7.321 clock cycles.

• AMATBS=256B, CS=256KB= 1+(0.49%x(40+(256/16)*2)=1.353 clock cycles.

Page 36: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 36

Reducing Miss Rate by Victim Cache• contains blocks that are discarded from a cache miss• checked on a miss, if matched, victim block and cache block are

swapped.• A four entry victim cache removed 20% to 95% of the conflict

misses in a 4KB direct-mapped data cache.

Page 37: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 37

Pseudo (Column)-Associative Caches

• Cache access just like direct-mapped cache for hit.• When miss, additional cache entry is checked for

match.• The candidate cache entry can be the one with its

most significant bit of the index field inverted.• These cache blocks form the “pseudo set”.• One cache block has fast hit time; the other has slow

hit time.• What is the difference between this and a two-way

set associative cache? (not affecting processor clock rate)

Page 38: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 38

Hardware Prefetching of Instruction and Data• Prefetch items before the processor request them.• Place them in cache or external buffer• Instruction Prefetch:

– AXP21064 fetches two blocks (requested block into cache and the next consecutive block or prefetch block into instruction stream buffer, ISB) on a miss.

– If instruction in the instruction stream buffer, cache request is cancelled, the block is read from ISB and the next prefetch request is issued.

– For 4-KB direct mapped cache with BS=16B, 1 block in ISBcatch 15-25% misses4 blocks in ISBhit rate improves to ~50%16 blocks in ISBhit rate improves to 72%

• Data Prefetch:– 1 single data stream buffer catches 25% misses– 4 data stream buffers increases data hit rate to 43%

Page 39: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 39

Compiler-Controlled Prefetching

• Compiler generates prefetch instructions to request data before they are needed.

• Register prefetch loads the value into a register• Cache prefetch loads data only into the cache• Since the prefetching of a block can result in a page (virtual

memory) fault (the page containing the bock is not in main memory), the prefetches can be classified as faulting or non-faulting.

• Non-binding prefetch: a nonfaulting cache prefetch that does not cause virtual memory faults.

• A nonblocking (lookup-free) cache allow the processor to retrieve data/instruction, while prefetched data being fetched.

• Goal overlap execution with prefetching of data. Loop is key target.

Page 40: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 40

Example of Compile-controlled Prefetchingfor (i=0; i<3; i=i+1)

for (j=0; j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0];

Assume 8kB direct-mapped cache, BS=16B, write back cache with write allocate.

A and b are double precision (8B) floating point array; a has 3 rows and 100 columns; b has 101 rows and 3 columns.

Assume a and b not in cache at the start of program.1. (Compiler) Determine the number of cache misses.Ans: In C, a two dimension array is saved in column-major order.

a[0][0];a[0][1]; …;a[0][99];a[1][0];a[1][1];…a[1][99];a[2][0];…Accessing a row of a a[i][0] (separated by 800 B) called stridingThe above loop, a’s elements are written in the same order in memory. Since BS=16B, access a[0][0] will miss and bring in a block with a[0][0] and a[0][1]. The next access a[0][1] will hit.For array a, accesses with even value of j miss; accesses with odd value of j hit.300/2=150 misses. (spatial locality).

Page 41: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 41

Example of Compile-controlled Prefetchingfor (i=0; i<3; i=i+1)

for (j=0; j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0];

Assume 8kB direct-mapped cache, BS=16B, write back cache with write allocate.

A and b are double precision (8B) floating point array; a has 3 rows and 100 columns; b has 101 rows and 3 columns.

Assume a and b not in cache at the start of program.1. (Compiler) Determine the number of cache misses.Ans: The first loop instance, b[0][0] and b[1][0] are accessed. They are

separated by 101*8=808 bytes and not belong to the same block. Their misses bring in blocks with b[0][0] and b[0][1] (not used); and b[1][0] and b[1][1] (not used).Next loop instance, b[1][0] and b[2][0] are accessed. B[1][0] hits and b[2][0] misses.In the loop where i=0, there will be 101 misses for b[0][0]-b[100][0]. But for loops with i=1 and i=2, all array b accesses hit

Page 42: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 42

Example of Compile-controlled Prefetching2. Insert prefetch instructions to reduce misses. Assume miss penalty big and we need

to prefetch 7 iterations in advanced. No fault for beginning and ending prefetches.Ans: First loop prefetch b and a. Second loop only prefetch a. for (j=0;j<100;j++) { prefetch(b[j+7][0]; /* b[j][0] for 7 iterations later */ prefecth(a[0][j+7]); /* a[0][j] for 7 iterations later */ a[0][j]=b[j][0]*b[j+1][0]; };for (i=1; i<3; i++) for (j=0; j<100; j++) { prefetch(a[I][j+7]); /* a[I][j] for 7 iterations later */ a[I][j]=b[j][0]*b[j+1][0]; }The revised code prefetches a[I][7]…a[i][99?] and b[7][0]…b[99?][0]1st loop nonprefetched misses to a[i][0-6];b[0-6][0] is 4+7=112nd loop nonprefetched misses to a[1][0-6]; a[2][0-6]is 4+4=8.Avoid 251-19=232 cache misses. Executing 400 prefetch instructions.

Page 43: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 43

Correction to Prefetch Results

• Assume that a[i][0-6] is prefetched by a[i-1][100-106], then only a[0][0-6] causes ceil(7/2)=4 nonfetched misses.

• a[0-6][0] causes 7 nonfetched misses. • a[100][0] is prefetched• The total nonfetched misses should be 4+7=11.• Prefectch a[0][7] brings in

1. a[0][7] and a[0][8]? Or

2. a[0][6] and a[0][7]? • Note that if a[0][0] is allocated at address divisible by 16, then

the answer is 2. Othewise the answer is 1.

Page 44: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 44

Improvement of Compiler-Controlled Prefetch

• Ignore instruction cache misses.• No conflict/capacity misses in the data cache.• Prefetches can overlap with each other and with cache misses.• 7 clockcycles per iteration in the original loop.• First prefetch loop takes 9 clock cycles per iterations

2nd prefetch loop takes 8 clcok cycles per iterations• A miss takes 50 clock cycles.• Ans: Original Loop:

300 iterations, each takes 7 clockcycles2100 cycles251 cache misses, each takes 50 cycles12550 cyclestotal clock cycles = 2100+12550=14650 cycles1st prefetch loop: 100 iterations*9cycles+11misses*50cycles =1450cycles2nd prefetch loop: 200iterations*8cyccles+8misses*50cycles=2000cyclesTotal pretech loop cycles= 3450 cycles.Prefetch code is 14650/3450=4.2 times faster.

Page 45: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 45

Reducing Miss Rate Through Compiler Optimization

• Pure software solution• Through profiling info and rearrange of code, miss

rate can be reduced by 50%. [McFarling 98]• Data has less restriction on relocation than code.• Goal: rearrange code and data to improve the spatial

and temporal locality.

Page 46: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 46

Example 1. Merging Array

• Group related data into a record and hope they are fetched as a cache block.

• What locality is exploited here?• Before:int val[SIZE];int key[SIZE];• After:structure merge {

int val;int key;

}struct merge merged_array[SIZE];

Page 47: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 47

Loop Interchange

• In C arrays arranged in column major order.• In Fortran array arranged in row major order.• Before: striding through in row order, a lot of misses.

Block contains x[0][1] may got kicked out before being access when j=1, due to conflict misses.

for (j=0; j<100; j++)for (i=0; i<5000; i++)

x[i][j]=2*x[i][j];• After: access data in consecutive area.for (i=0; i<5000; i++)

for (j=0; j<100; j++)x[i][j]=2*x[i][j];

Page 48: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 48

Loop Fusion

• Before

for (i=0; i<N; i++)

for (j=0; j<N; j++)

a[i][j]=1/b[i][j]*c[i][j];

for (i=0; i<N; i++)

for (j=0; j<N; j++)

d[i][j]=a[i][j]+c[i][j];

• After

for (i=0; i<N; i++)

for (j=0; j<N; j++) {

a[i][j]=1/b[i][j]*c[i][j];

d[i][j]=a[i][j]+c[i][j];

}

2nd accesses of a and c will be hits.

Page 49: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 49

Blocking (Improve Temporal Locality)• Some algorithms access data by both row and column. The

column/row major order does not help.• Idea: modify the algorithms to access by blocks (submatrices).• Before (Matrix multiplication)for (i=0; i<N; I++)

for (j=0; j<N; j++) { r=0; for (k=0; k<N; k++)

r=r+y[i][k]*z[k][j]; x[i][j]=r;}

• After: B=submatrix block sizefor (jj=0; jj<N; jj=jj+B)for (kk=0; kk<N; kk=kk+B) for (i=0; i<N; I++)

for (j=jj; j<min(jj+B-1,N); j++) { r=0; for (k=kk; min(jj+B-1,N); k++)

r=r+y[I][k]*z[k][j]; x[I][j]=r;}

Page 50: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 50

Access Pattern

Without blocking (acces of matrices with i=1)

With blocking

Gray: old accessesBlack: new accessesWhite: not accesses

Page 51: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 51

Impact of conflict misses in direct-mapped caches on block size

• Smaller block size reduce conflict misses.• See right side of the upper curve. [Lam91]

MissRate

Page 52: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 52

Reducing Cache Miss PenaltySub-block Placement

• Large block size small tag field• Large block size long miss penalty.• Solution: only bring in portion of the block, e.g., 8

bytes of a 32 byte block.• Each sub-block needs to valid bit.

Page 53: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 53

Early Restart and Critical Word First

• Early Restart: As soon the request word of the block arrive, send it to CPU. Do not wait for the complete block to arrive.

• Critical Word First: Request lower level memory to send the request word first, then the rest of the blocks. When the request word arrives, send it to CPU immediately.

Page 54: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 54

Nonblocking Data Cache

• Some pipelined machines allow instructions to be executed out of order.

• After data cache miss occurs, we can allow other instructions in the pipeline to retrieve their data.

• Their data may be in cache. This is called “hit under misses”.

• This type of caches is called Non-blocking cache.

Page 55: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 55

Second-Level Caches

• AMAT = HitTimeL1+MissRateL1 * MissPenaltyL1• MissPenaltyL1=HitTimeL2+MissRateL2 * MissPenaltyL2• AMAT=HitTimeL1+MissRateL1*(HitTimeL2+MissRateL2*MissP

enaltyL2)• Local miss rate: # of misses in this cache/memory accesses to

this cache.• Global miss rate: # of misses in this cache/memory accesses

from CPU.• Example: 1000 memory references, 40 misses in first level

cache, 20 misses in the 2nd level cache. Miss rate for level-1 cache is 40/1000=4%, local miss rate for level-2 cache is 20/40=50%, global miss rate for level-2 cache is 20/1000=2%.

Page 56: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 56

Main Memory Organization

32 bits

128 bits

Wider memory busFaster transfer

Increment of memoryUpdate bigger

Same address retrieve different wordsWord in Bank 0 transfer first

Followed by Bank 1The 4 words of a block is

spread out on 4 banks

Page 57: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 57

Wider Main Memory vs. Interleaved Memory

Assume the basic memory organization takes 1 cycle to send address; 6 cycles to access a word; 1 cycle to transfer a word. For reading a block size of 4 words,

MEM. BUS Width Time(cycles) Bandwidth(bytes/cycle)

1word 4x(1+6+1)=32 16 B/32 cycles = 0.5

2words 2x(1+6+1)=16 16 B/16 cycles = 1

4words 1x(1+6+1)=8 16 B/8 cycles = 2

Page 58: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 58

Compare Memory OrganizationCPIaverage=CPIexecution+MemoryAccesPerInstuction*MissRate*MissPenaltyExample: Use CPIaverage for trade-off between wide and interleaved memory.

Assume that the clock cycle time and instruction counts do not change.For BlockSize(BS in word unit)=1, MemoryBusWidth(MBW in word unit)=1, MissRate=15%,(10%,5%for BS=2,4) MissPenalty=8cycles. MemoryAccessPerInstruction=1.2, AverageCyclesPerInstruction (ignoring cache miss)=2

MissRate CPIaverage _____BS=1, MBW=1, no interleaving 15% 2+(1.2*15%*8) =3.44BS=2, MBW=1, no interleaving 10% 2+(1.2*10%*2*8) =3.92BS=2, MBW=1, interleaving 10% 2+(1.2*10%*(1+6+2)) =3.08BS=2, MBW=2, no interleaving 10% 2+(1.2*10%*1*8) =2.96 BS=4, MBW=1, no interleaving 5% 2+(1.2*5%*4*8) =3.92BS=4, MBW=1, interleaving 5% 2+(1.2*5%*(1+6+4)) =2.66 BS=4, MBW=2, no interleaving 5% 2+(1.2*5%*2*8) =2.96

Page 59: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 59

Virtual Memory

is a technique to • share a small physical main memory among many processes.• facilitate programmers to write programs whose size is larger

than main memory (Simulation programs or DB applications with large data.) Make physical main memory size invisible to programmers.

• relieve programmers the burden of writing code swap routine to move overlay program segment in/out of main memory.

• protect memory space for different processes.• enable relocation of program anywhere in the main memory.• reduce the time to start a program, why?

Page 60: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 60

Page vs. Segment

• Physical main memory is typically divided into fixed length blocks, called pages. A miss is called page fault

• If blocks are variable length, they are called segments. A segment miss is called segment fault.

• When a page fault happens, the page will be brought in from the disk.

Page 61: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 61

Mapping Virtual Address to Physical Address• Memory Mapping (Address Translation): CPU

produces virtual addresses that are translated by a combination of hardware and software to physical addresses which are used in access main memory.

Pages A-D in VMConsider to be contiguous Actually there are scattered

In main memory and disk

Page 62: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 62

Q1: Where to place a block in main memory? (Block placement)

• Because the horrendous cost of a miss (the DRAM speed and the disk speed are 4 orders of magnitude apart), OS designer always pick lower miss rates to allow blocks to be placed anywhere in main memory. (Fully associative)

Page 63: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 63

Q2: How to find a block in main memory?(Block identification)

• Example. Size of page table. For a 28-bit virtual address and a 4KB page and 4 bytes per page table entry, the page table has 216=64K entries and 256KB(equivalent of 64 pages). It is quite big and the address translation can be quite slow.

virtual page no. page offset

virtual address

12 bits16 bits

pagetable

mainmemory

physical page-frame no.(16MB)

(4K pages)12 bits

32 bits entry

Page 64: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 64

Page Table

• To reduce the size of page table, some machines apply a hashing function on the virtual page no. to get a value btw 0~4K-1 and use an Inverted page table— one entry per physical pages in main memory. A 16MB physical memory needs 4B*16MB/4KB=16KB (4 pages) for the inverted page table.

• To reduce the address translation time, a special cache called TLB (Translation Look-aside buffer) for that.

Page 65: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 65

Translation-Lookaside Buffer: TLB

• To reduce the virtual-to-physical address translation time, a cache called Translation-Lookaside Buffer (TLB first used in IBM 370) is used to keep the most recent address translation.

• A TLB entry contains – the tag field that holds the virtual page number,– the data field that holds the corresponding

physical page-frame number, protect field, use bit, and dirty bit.

Page 66: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 66

AXP 21064 TLB (32 entries)Fully-Associative Cache

13

V: valid bitR: Read Permission bitW: Write Permission bit

Page 67: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 67

Combine Cache with Virtual Memory

• Is there a way to reduce the cache hit time?

TLBvirtualaddress

physicaladdress Cache Main

Mem

data notin cache

data in cache

CPU

modified cache hit time = address translation time + original cache hit time

Entry not in TLB, bring in from page table in main memory

Page 68: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 68

Q3: Which block should be replaced on a virtual memory

miss? (Block replacement)• Replacement on cache miss is controlled by hardware• Replacement on virtual memory miss is controlled by OS.• OS tries to replace Least-Recent Used (LRU) block.• To help estimate the “freshness” of the blocks, many OS

provide a use bit or reference bit.- set it when the page is accessed- and periodically clears them.

Page 69: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 69

Q4: What happens on a write?

• Write through is too expensive. 0.7M~6M cycles• Always use write back.

Page 70: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 70

Exercise on Cache+Virtual Memory

Consider a memory system with virtual address=32 bits2-way set associative TLB total entries in TLB=512page size=2KBphysical address=36 bits2-way set associative cache total cache data size(CS)=256KBline size=block size(BS)=32B

a=TLB tag fieldb=TLB index fieldc=page offsetd=physical page frame address fielde=cache tag fieldf=cache index fieldg=block offseth=block data field

What is the relation between the two sets of data?

Page 71: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 71

Relation between field sizea b c

=

MUX

=

virtual address

physical address

d

TLB

d

e f g

=

MUX

=

datacache

hh

a=32-b-c=32-8-11=13b=log2(entries/associativity)

c=log2(page size)=log2(211)=11d=36-c=36-11=25e=36-f-g=36-12-5=19f=log2(datasize/(assoc.*BS))

g=log2(BS)=log2(32)=5

=log2(512/2)=log2(256)=8

=log2(218/(2*25)=12

h=32*8=256bits

Page 72: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 72

Find Optimal Block SizeThe following measurements have been made on the system:

0.4 data references/instruction MAPI=1.4

TLB miss rate=0.1%,TLB miss penalty=25 cyclesCache miss penalty

=6 cycles+#words/block.

b) Find the average memory access time for a 4W block size.assume 1 clock cycle for the hittime,

AMAT=hittime+miss rate*miss penalty = AMATTLB+AMATcache

= (1+0.001*25)+(1+0.015*6*4)=2.175c) Find the optimal block size.(Compare AMAT1W~16W)

Block size with 8W has the lowest AMAT=2.165.

Page 73: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 73

Cache Analysis Using dinero

• There is a trace-driven cache simulator called dinero in ~cs520/dinero/bin

• Given a trace (the history of instruction and data reference of a program) and cache design parameters (such as the block size, unified cache size or instruction cache size and data cache size, associativity of the cache, replacement policy, and write policy),

• dinero will perform cache simulation and generate the miss rate.• For example, the following command

dinero -b32 -u32K -a1 < eg.din > eg32k.out• perform a cache analysis on 32K unified direct mapped cache

with block size=32btyes using a trace input file eg.din and generated output to eg32k.out.

• The man page is in ~cs520/dinero/man. The eg.din is in ~cs520/dinero/demo.

Page 74: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 74

Dinero Output

>dinero -b32 -u64K -a1 < ~cs520/traces/benchmarks/tex.din > tex.cache---Dinero III by Mark D. Hill.CMDLINE: dinero -b32 -u64K -a1CACHE (bytes): blocksize=32, sub-blocksize=0, Usize=65536, Dsize=0, Isize=0.POLICIES: assoc=1-way, replacement=l, fetch=d(1,0), write=c, allocate=w.CTRL: debug=0, output=0, skipcount=0, maxcount=10000000, Q=0.

---Simulation begins. Metrics Access Type: (totals,fraction) Total Instrn Data Read Write Misc ----------------- ------ ------ ------ ------ ------ ------ Demand Fetches 832477 597309 235168 130655 104513 0

1.0000 0.7175 0.2825 0.1569 0.1255 0.0000

Demand Misses 1537 130 1407 343 1064 00.0018 0.0002 0.0060 0.0026 0.0102 0.0000

---Execution complete.43.7u 11.1s 0:57 95% 92+470k 950+1io 0pf+0wMissRate

Page 75: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 75

Homework #3

Prob. 2. Exercise on Cache+Virtual MemoryConsider a memory system with virtual address=34 bits, fully

associative TLB, total entries in TLB=1024, page size=8KB, physical address=36 bits, direct map cache, total cache data size=256KB, block size=32B.

a) What are the tag field sizes in TLB and cache?b) Assume 0.5 data references per instruction, TLB miss

rate=0.14%, TLB miss penalty=20 cycles, cache miss penalty=6 cycles+#words in a block, the miss rate for block size of 4 words is 1.4% while the miss rate for block size of 8 words is 1%. Which block size offers better performance?

Page 76: 2/15/99CS520S99 MemoryC. Edward Chow Page 1 What impact the memory system design? Principle of Locality –Temporal Locality (90% time spent in 10% code)

2/15/99 CS520S99 Memory

C. Edward

Chow

Page 76

Homework #3

Problem 3.

for (i=0; i<3; i=i+1)for (j=0; j<100;j=j+1) a[i][j]=b[j][0]*b[j+1][0];

Assume 32KB direct-mapped cache, BS=32B, write back cache with write allocate.

a and b are double precision (8B) floating point arrays; a has 3 rows and 100 columns; b has 101 rows and 3 columns.

Assume a and b not in cache at the start of program.

Determine the number of cache misses.