cmpe 421 parallel computer architecture
DESCRIPTION
CMPE 421 Parallel Computer Architecture. PART3 Accessing a Cache. 01. 4. 11. 15. Direct Mapped Cache Example 2 ( 4, 1-word blocks ). Consider the main memory word reference string 0 1 2 3 4 3 4 15. - PowerPoint PPT PresentationTRANSCRIPT
1
CMPE 421 Parallel Computer ArchitecturePART3
Accessing a Cache
3
Direct Mapped Cache Example 2 (4, 1-word blocks)
0 1 2 3
4 3 4 15
Consider the main memory word reference string 0 1 2 3 4 3 4 15
00 Mem(0) 00 Mem(0)00 Mem(1)
00 Mem(0) 00 Mem(0)00 Mem(1)00 Mem(2)
miss miss miss miss
miss misshit hit
00 Mem(0)00 Mem(1)00 Mem(2)00 Mem(3)
01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)
01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)
01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)
01 4
11 15
00 Mem(1)00 Mem(2)
00 Mem(3)
Start with an empty cache - all blocks initially marked as not valid
8 requests, 6 misses, 2 hits
4
Direct Mapped Caching: A Simple First Example
00011011
Cache
Main Memory
Q1: How do we find it?
Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)
Tag Data
Q2: Is it there?
Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache
Valid
0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx
Two low order bits define the byte in the word (32b words)
(block address) modulo (# of blocks in the cache)
Index
Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block
5
Direct Mapped Cache Example 1 8-blocks, 1 word/block, direct mapped Initial state
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N
6
Direct Mapped Cache Example 1
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
22 10 110 Miss 110
7
Direct Mapped Cache Example 1
Index V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
26 11 010 Miss 010
8
Direct Mapped Cache Example 1
Index V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
22 10 110 Hit 110
26 11 010 Hit 010
9
Direct Mapped Cache Example 1
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000
3 00 011 Miss 011
16 10 000 Hit 000
10
Direct Mapped Cache Example 1
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010
11
One word/block, cache size = 1K words
20Tag 10Index
Data Index TagValid012...
102110221023
31 30 . . . 13 12 11 . . . 2 1 0Byte offset
What kind of locality are we taking advantage of?
20
Data
32
Hit
Address Subdivision : Direct Mapped Cache
FIGURE 7.7 For this cache, the lower portion of the address is used to select a cache entry consisting of a data word and a tag.
13
Another Example for Direct Mapping
0 4 0 4
0 4 0 4
Consider the main memory word reference string 0 4 0 4 0 4 0 4
miss miss miss miss
miss miss miss miss
00 Mem(0) 00 Mem(0)01 4
01 Mem(4)000
00 Mem(0)01
4
00 Mem(0)01 4
00 Mem(0)01
401 Mem(4)
00001 Mem(4)
000
Start with an empty cache - all blocks initially marked as not valid
Ping pong effect due to conflict misses - two memory locations that map into the same cache block
8 requests, 8 misses
14
Handling Cache Misses Our control unit must detect a miss and process
the miss by fetching the data from memory or from a lower-level cache.
Approach– stall the CPU, freezing the contents of all the registers– a separate controller fetches the data from memory– once data is present, execution of datapath is
resumed
15
Instruction Cache MissesSend the original PC value (current PC - 4) to
the memory Instruct main memory to perform a read and wait
for the memory to complete its access.Write the cache entry, putting the data from
memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and setting the valid bit
Restart the instruction execution at the first step, which will re-fetch the instruction (now in the cache)
16
Example MachineThe Digital DECStation 3100, one of the first
commercially available RISC-Architecture machines used a MIPS R2000 processor.
5 stage pipeline requested an instruction word and a data word on
every clock cycle static branch prediction delayed branch instruction 64 KB data cache and 64 KB instruction cache
17
DECStation 3100 Cache (16 K words)
offset This cache has 214
(16K) words with a block size of 1 word. 14 bits are used to index into the cache. 16 bits are compared against the tag. A hit results if the upper 16 bits matches the tag AND the valid bit is set
18
DECStation 3100 Cache
64 KB = 16 K words with a 1 word blockRead requests (lw instruction):
1. Send the address to the appropriate cache.• Instruction - from PC• Data - from ALU
2. If the cache signals a hit, read the requested word from the data lines.Else, send the address to main memory. When the requested word comes back from main memory, write it into the cache.
19
DECStation 3100 Cache Write requests:On a sw instruction, the DECStation 3100 used a scheme
called write-through. Write-through is when you store the word in the data
cache AND in main memory. This is done to keep the data cache and main memory consistent.
Don’t bother to check for hits and misses. Just write the word into the cache and into main memory.
A=5 A=5cache mem
P
A=10 A=5cache mem
P
A=10 A=10cache mem
P
Process wants to update A as 10
Now, cache copy of A is not equal to MEM copy of A
With write policy write into both of them
inconsistency
20
DECStation 3100 Cache Write requests using a write-through scheme (see
page17):1. Index the cache using bits 2-15 of the address.2. Write bits 31-16 of the address into the tag, write the data word
into the data portion, and set the valid bit.3. Also write the word to main memory using the entire address.
This simple approach slows our performance down because of the long write to main memory.
21
EXAMPLE : for cache MISS Suppose 10% of the instructions are stores, if the CPI
without cache misses 1.0 spending 100 extra cycles on every write
Þ 1.0+ 100 * 10 % = 11Þ Reducing the performance by more than factor of 10
Þ SOLUTION: Write Buffer Solution
22
Write Buffer Solution Use a write buffer to store the data while it is waiting to be written to
memory
Write the data into the cache and into the write buffer Processor continues with execution The write buffer copies the data into main memory (memory controller)
Hopefully, the processor does not generate writes faster than the write buffer can take them. If the write buffer becomes full, stalls occur.
A write buffer stores the data while it is waiting to be written to memory. After writing the data into the cache and into the write buffer the processor can continue execution. When a write main memory completes the entry in the write buffer is freed.
23
Write Buffer Saturation PROBLEM: Memory system designer’s nightmare:!
If Store frequency (w.r.t. time) << 1 / DRAM write cycle!- Write buffer works fine
If Store frequency (w.r.t. time) -> 1 / DRAM write cycle!- If this condition exist for a long period of time (CPU cycle time too
quick and/or too many store instructions in a row):!– Store buffer will overflow no matter how big you make it!
» Because you simply feeding data faster than can empty it– The CPU Cycle Time << DRAM Write Cycle Time!
24
Solution for write buffer saturation: Use a write back cache! Install a second level (L2) cache:! store compression!
25
Alternative: Write-Back Solution The write-back scheme doesn’t automatically write to the
cache AND to main memory. Instead, it writes only to the cache. The data does not get
written to memory until that cache block has to be replaced with a different block (when that cache block is being replaced on a cache miss).
Can improve performance. Greatly reduce the memory bandwidth requirement! More complex to implement than write-through.But, Control can be complex! Need a “dirty bit” for each cache block!
26
Sources of Cache Misses Compulsory (cold start or process migration, first
reference): They are caused when we first start the program. First access to a block, “cold” fact of life, not a whole lot you can
do about it If you are going to run “millions” of instruction, compulsory
misses are insignificant Conflict (collision):
Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (next lecture)
Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size
27
Multiword (4 word) Block Direct Mapped CacheTaking the Advantage of Spatial Locality The cache organization (i.e. 1 word= 1 block) we have
discussed so far does not take the spatial locality, We want have a cache block size > one word So, when we have cache miss occurs, we fetch multiple
adjacent words, and the probability that one these words will be needed shortly will be high
Example: - Choose block size= 4 words (4x4=16 bytes)- Now, we need an extra “block index” field, which selects one of
four words in the indexed block according to the request - The total size of the tag field is also reduced per word, because
each tag is used for 4 words (25% tag overhead only, compared to the case where block size is 1 word)
28
Multiword (4 word) Block Direct Mapped Cache
Cache size4Kx4= 16K words
16Kx4 Bytes
=64K Byte cache
29
Formulas During Lecture Period
30
Multiword (4 word) Block Direct Mapped Cache
12Index
DataIndex TagValid012...
409340944095
31 30 . . . 17 16 15 . . . 4 3 2 1 0 Byte offset
16
16Tag
Hit Data
32
Block offset
Four words/block, cache size = 16K words
What kind of locality are we taking advantage of?
32 32 32 32
2
32
Taking Advantage of Spatial Locality
0
Let cache block hold more than one word 0 1 2 3 4 3 4 15
1 2
3 4 3
4 15
00 Mem(1) Mem(0)
miss
00 Mem(1) Mem(0)
hit
00 Mem(3) Mem(2)00 Mem(1) Mem(0)
miss
hit
00 Mem(3) Mem(2)00 Mem(1) Mem(0)
miss
00 Mem(3) Mem(2)00 Mem(1) Mem(0)
01 5 4hit
00 Mem(3) Mem(2)01 Mem(5) Mem(4)
hit
00 Mem(3) Mem(2)01 Mem(5) Mem(4)
00 Mem(3) Mem(2)01 Mem(5) Mem(4)
miss
11 15 14
Start with an empty cache - all blocks initially marked as not valid
8 requests, 4 misses
33
Cache Hits and Misses Read misses are processed the same as with a 1 word
block cache. Read the entire block from memory into the cache.
Write hits and misses must be handled differently with a multiword block cache. Consider:
Assume memory addresses X and Y both map to cache block C C is a 4 word block containing Y What would happen if we did a write to address X by simply
overwriting the data and tag in cache block C?Ans: Cache block C would contain the tag for X, 1 word of X, and 3
words of Y.
34
Cache Hits and MissesSolution 1 (when write Miss occurs)
Ignore the cache when we have a write miss Do not change the tag for X and do not update data X1,
just write it on memory Where is the idea of using cache? if the data resides in memory In this case we can not use the advantage of locality
….. Another Solution ?
35
Cache Hits and MissesSolution 2 (First implement read miss then write)
Perform a tag comparison while writing the data. If equal, we have a write hit. No problem. If unequal, we have a write miss.
– Fetch the block from memory– Rewrite the word that caused the miss (Write through)
So, with a multi-word cache, a write miss causes a read from memory.
36
Miss Penalty and Miss Rate vs Block Size In general, larger block sizes take advantage of spatial locality BUT: – Larger block size means larger miss penalties:
Takes longer time to fill up the block – If block size is too big relative to cache size, miss rate will go up
Too few cache blocks 16- 64 bytes works fine
In general, Average Access Time: = Hit Time + Miss Penalty x Miss Rate Need to find a middle ground (Good design needs compromise
37
Miss Rate vs Block Size vs Cache Size
0
5
10
8 16 32 64 128 256Block size (bytes)
Miss
rate
(%) 8 KB
16 KB64 KB256 KB
Miss rate goes up if the block size is too large relative to the cache size.
because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)
38
Spacial Locality Does increasing the block size help the miss rate?
• Yes, until the number of blocks in the cache becomes small. Then, a cache block may be swapped out before many of its words are accessed losing any spatial locality benefits.
Effective Instruction Data combined
Program miss rate miss rate miss rate
gcc (1 word blocks) 6.1% 2.1% 5.4%
gcc (4 word blocks) 2.0% 1.7% 1.9%
spice (1 word blocks) 1.2% 1.3% 1.2%
spice (4 word blocks) 0.3% 0.6% 0.4%
39
Miss Penalty Increasing the block size also increases the miss penalty
since we must read more words from memory for each miss. Reading more words takes more time.
Miss Penalty = latency to the first word + transfer time for the rest.
One way around this problem is to design our memory system to transfer blocks of memory efficiently.
One common way is to increase the width of the memory and the bus. (Transfer 2 or 4 words at a time, 1 latency period.)
Another common way is interleaving. This technique uses banks of memory. The requested address is sent to each bank in parallel. The memory latency is incurred once. Then the banks take turns at sending the requested words to the cache.
40
Cache Summary The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time
- Temporal Locality: Locality in Time- Spatial Locality: Locality in Space
Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses Conflict Misses: increase cache size and/or associativity
Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size
Cache Design Space total size, block size, associativity (replacement policy) write-policy (write-through, write-back)
41
Main Memory Organizations
CPU
Cache
Bus
Memory
CPU
Bus
Memory
Multiplexor
Cache
CPU
Cache
Bus
Memorybank 1
Memorybank 2
Memorybank 3
Memorybank 0
(a) one-word widememory organization
(b) wide memory organization (c) interleaved memory organization
DRAM access time >> bus transfer time
Processing the cache Miss• Latency to fetch the first word from MEM(finding the addr. for
word0)• Block transfer time (to bring all words in block from MEM)• It is difficult to reduce the latency to fetch the first word from MEM.
However, we can reduce the miss penalty if we increase the bandwidth from MEM to cache
42
Memory Access Time Example Assume that it takes 1 cycle to send the address, 15 cycles
for each DRAM access and 1 cycle to send a word of data. Assuming a cache block of 4 words and one-word wide
DRAM (page 41 fig. a), miss penalty = 1 + 4x15 + 4x1 = 65 cycles
With main memory and bus width of 2 words (page 41 fig. b), miss penalty = 1 + 2x15 + 2x1 = 33 cycles. For 4-word wide memory, miss penalty is 17 cycles. Expensive due to wide bus and control circuits.
With interleaved memory of 4 memory banks and same bus width (page 41 fig. c), the miss penalty = 1 + 1x15 + 4x1 = 20 cycles. The memory controller must supply consecutive addresses to different memory banks. Interleaving is universally adapted in high-performance computers.
43
Cache Performance During Lecture Period