csc i 2510 computer organization
DESCRIPTION
CSC I 2510 Computer Organization. Memory System II Cache In Action. Cache-Main Memory Mapping. A way to record which part of the Main Memory is now in cache Design concerns: Be Efficient: fast determination of cache hits/ misses - PowerPoint PPT PresentationTRANSCRIPT
1
CSCI 2510Computer Organization
Memory System II
Cache In Action
2
Cache-Main Memory Mapping
A way to record which part of the Main Memory is now in cache
Design concerns:Be Efficient: fast determination of cache hits/ missesBe Effective: make full use of the cache;
increase probability of cache hits
In the following discussion, we assume:16-bit address, i.e. 216 = 64k Bytes in the Main MemoryOne Word = One ByteSynonym: Cache line === Cache block
3
Imagine: Trivial Conceptual Case
Cache size == Main Memory size == 64kBTrivial one-to-one mappingDo we need Main Memory any more?
Cache
64kB
FAST
Main Memory
64kB
SLOW
CPU
FASTEST
4
Reality: Cache Block/ Cache Line
Cache size is much smaller than the Main Memory size
A block in the Main Memory maps to a block in the Cache
Many-to-One Mapping
MainMemory
Block 0
Block 1
Block 127
Block 128
Block 129
Block 255
Block 256
Block 257
Block 4095
tag
tag
tag
Cache
Block 0
Block 1
Block 127
5
Direct Mapping
Direct mapped cache
Block j of Main Memory maps to block (j mod 128) of Cache[same colour in figure]
Cache hit occurs if tag matches desired address
• there are 24 = 16 words (bytes) in a block• 27 = 128 Cache blocks• 2(7+5) = 27 x 25 = 4096 Main Memory blocks
MainMemory
Block 0
Block 1
Block 127
Block 128
Block 129
Block 255
Block 256
Block 257
Block 4095
tag
tag
tag
Cache
Block 0
Block 1
Block 127
7 4 16-bit Main Memory address
Cachetag
CacheBlock No
Word/ Byte Addresswithin block (4-bit)
5
12-bit Main MemoryBlock number/ address
6
Direct Mapping Memory address divided into 3
fields Main Memory Block number
determines position of block in cache
Tag used to keep track of which block is in cache (as many MM blocks can map to same position in cache)
The last 4 bits in the address selects target word in the block
Given an address t,b,w (16-bit) See if it is already in cache by
comparing t with the tag in block b
If not, cache miss!, replace the current block at b with a new one from memory block t,b (12-bit)
E.g. CPU is looking for [A7B4]MAR = 1010011110110100Go to cache block 1111011See if the tag is 10100If YES, cache hit!Get the word/ byte 0100 in thecache block 1111011
MainMemory
Block 0
Block 1
Block 127
Block 128
Block 129
Block 255
Block 256
Block 257
Block 4095
tag
tag
tag
Cache
Block 0
Block 1
Block 127
7 4 16-bit Main Memory address
Cachetag
CacheBlock No
Word/ Byte Addresswithin block (4-bit)
5
12-bit Main MemoryBlock number/ address
7
Associative Mapping
In direct mapping, a Main Memory block is restricted to reside in a given position in the Cache (determined by mod)
Associative mapping allows a MM block to reside in an arbitrary Cache block location
In this example, all 128 tag entries must be compared with the address Tag in parallel (by hardware)
4
tag
tag
tag
Cache
MainMemory
Block 0
Block 1
Block i
Block 4095
Block 0
Block 1
Block 127
12 16-bit Main Memory address
Tag Word/ Byte
E.g. CPU is looking for [A7B4]MAR = 1010011110110100See if the tag 101001111011 matches one of the 128 cache tagsIf YES, cache hit!Get the word/ byte 0100 in theBINGO cache block
Tag width is 12 bits
8
Set Associative Mapping
Combination of direct and associative
Same colour Main Memory blocks 0,64,128,…,4032 map to cache set 0 and can occupy either of 2 positions within that set
A cache with k-blocks per set is called a k-way set associative cache. (j mod 64) derives the Set Number Within the target Set, compare
the 2 tag entries with the address Tag in parallel (by hardware)
MainMemory
Block 0
Block 1
Block 63
Block 64
Block 65
Block 127
Block 128
Block 129
Block 409516-bit Main Memory address
6 6 4
TagSet
Number Word/ Byte
tag
tag
tag
Cache
Block 0
Block 1
Block 126
tag
tag
Block 2
Block 3
tagBlock 127
Set 0
Set 1
Set 63
Tag width is 6 bits
9
Set Associative Mapping MainMemory
Block 0
Block 1
Block 63
Block 64
Block 65
Block 127
Block 128
Block 129
Block 409516-bit Main Memory address
6 6 4
TagSet
Number Word/ Byte
tag
tag
tag
Cache
Block 0
Block 1
Block 126
tag
tag
Block 2
Block 3
tagBlock 127
Set 0
Set 1
Set 63
Tag width is 6 bits
E.g. 2-Way Set Associative
CPU is looking for [A7B4]
MAR = 1010011110110100
Go to cache Set 111011 (5910)- Block 1110110 (11810)- Block 1110111 (11910)
See if ONE of the TWO tags in the Set 111011 is 101001
If YES, cache hit!Get the word/ byte 0100 in theBINGO cache block
10
Replacement Algorithms
Direct mapped cache Position of each block fixed
Whenever replacement is needed (i.e. cache miss new block to load), the choice is obvious and thus no "replacement algorithm" is needed
Associative and Set Associative Need to decide which block to replace
(thus keep/ retain ones likely to be used in cache in near future again)
One strategy is least recently used (LRU)e.g. for a 4-block/ set cache, use a log24 = 2-bit counter for each blockReset the counter to 0 whenever the block is accessed;
counters of other blocks in the same set should be incrementedOn cache miss, replace/ uncache a block with counter reaching 3
Another is random replacement (choose random block)Advantage is that it is easier to implement at high speed
11
Cache Example
Assume separate instruction and data caches We consider only the data
Cache has space for 8 blocks
A block contains one 16-bit word (i.e. 2 bytes)
A[10][4] is an array of words located at 7A00-7A27 in row-major order
The program is to normalize first column of the array (a vertical vector) by its average
short A[10][4];int sum = 0;int j, i;double mean;
// forward loopfor (j = 0; j <= 9; j++) sum += A[j][0];
mean = sum / 10.0;
// backward loopfor (i = 9; i >= 0; i--) A[i][0] = A[i][0] / mean;
12
Cache Example
A[0][0]
A[0][1]
A[0][2]
A[0][3]
A[1][0]
A[9][0]
A[9][1]
A[9][2]
A[9][3]
Array Contents (40 elements)
Tag for Direct Mapped
Tag for Set-Associative
Tag for Associative
0 1 1 1 1 10 0 0 0 0 0 0 0 0 0
1 1 1 10 10 0 0 0 0 0 0 0 0 1
0 1 1 1 1 10 0 0 0 0 0 0 0 01
0 1 1 1 1 10 0 0 0 0 0 0 0 1 1
0 1 1 1 1 10 0 0 0 0 0 0 1 0 0
0 1 1 1 1 10 0 0 0 1 0 0 1 0 0
1 1 1 10 10 0 0 0 1 0 0 1 0 1
0 1 1 1 1 10 0 0 0 1 0 0 1 01
0 1 1 1 1 10 0 0 0 1 0 0 1 1 1
Memory word address in binary
(7A00)
(7A01)
(7A02)
(7A03)
(7A04)
(7A24)
(7A25)
(7A26)
(7A27)
Memory wordaddress in hex
8 blocks in cache, 3 bits encodes cache block number
To simplify the discussion in this example:
16-bit word address; byte addresses are not shown
One item == One block == One word == Two bytes
4 blocks/ set, 2 cache sets, 1 bit encodes cache set number
13
Direct Mapping
Least significant 3-bits of address determine location in cache No replacement algorithm is needed in Direct Mapping When i = 9 and i = 8, get a cache hit (2 hits in total) Only 2 out of the 8 cache positions used Very inefficient cache utilization
Content of data cache after loop pass: (time line)
j = 0
j = 1
j = 2
j = 3
j = 4
j = 5
j = 6
j = 7
j = 8
j = 9
i = 9
i = 8
i = 7
i = 6
i = 5
i = 4
i = 3
i = 2
i = 1
i = 0
CacheBlocknumb
er
0 A[0][0]
A[0][0]
A[2][0]
A[2][0]
A[4][0]
A[4][0]
A[6][0]
A[6][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[6][0]
A[6][0]
A[4][0]
A[4][0]
A[2][0]
A[2][0]
A[0][0]
1
2
3
4 A[1][0]
A[1][0]
A[3][0]
A[3][0]
A[5][0]
A[5][0]
A[7][0]
A[7][0]
A[9][0]
A[9][0]
A[9][0]
A[7][0]
A[7][0]
A[5][0]
A[5][0]
A[3][0]
A[3][0]
A[1][0]
A[1][0]
5
6
7
Tags not shown but are needed
14
Associative Mapping
LRU replacement policy: get cache hits for i = 9, 8, …, 2 If i loop was a forward one, we would get no hits!
Content of data cache after loop pass: (time line)
j = 0
j = 1
j = 2
j = 3
j = 4
j = 5
j = 6
j = 7
j = 8
j = 9
i = 9
i = 8
i = 7
i = 6
i = 5
i = 4
i = 3
i = 2
i = 1
i = 0
Cache
Blocknumb
er
0 A[0][0]
A[0][0]
A[0][0]
A[0][0]
A[0][0]
A[0][0]
A[0][0]
A[0][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[0][0]
1 A[1][0]
A[1][0]
A[1][0]
A[1][0]
A[1][0]
A[1][0]
A[1][0]
A[1][0]
A[9][0]
A[9][0]
A[9][0]
A[9][0]
A[9][0]
A[9][0]
A[9][0]
A[9][0]
A[9][0]
A[1][0]
A[1][0]
2 A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[2][0]
3 A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
4 A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
5 A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
6 A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
7 A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
Tags not shown but are needed
LRU Counters not shown but are needed
15
Set Associative Mapping
Since all accessed blocks have even addresses (7A00, 7A04, 7A08, ...), only half of the cache is used, i.e. they all map to set 0
LRU replacement policy: get hits for i = 9, 8, 7 and 6 Random replacement would have better average performance If i loop was a forward one, we would get no hits!
Set 0
Content of data cache after loop pass: (time line)
j = 0 j = 1 j = 2 j = 3 j = 4 j = 5 j = 6 j = 7 j = 8 j = 9 i = 9 i = 8 i = 7 i = 6 i = 5 i = 4 i = 3 i = 2 i = 1 i = 0
Cache
Blocknumber
0 A[0][0]
A[0][0]
A[0][0]
A[0][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[8][0]
A[4][0]
A[4][0]
A[4][0]
A[4][0]
A[0][0]
1 A[1][0]
A[1][0]
A[1][0]
A[1][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[9][0]
A[9][0]
A[9][0]
A[9][0]
A[9][0]
A[5][0]
A[5][0]
A[5][0]
A[5][0]
A[1][0]
A[1][0]
2 A[2][0]
A[2][0]
A[2][0]
A[2][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[6][0]
A[2][0]
A[2][0]
A[2][0]
3 A[3][0]
A[3][0]
A[3][0]
A[3][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[7][0]
A[3][0]
A[3][0]
A[3][0]
A[3][0]
4
5
6
7
Tags not shown but are needed
Set 1
LRU Counters not shown but are needed
16
Comments on the Example
In this example, Associative is best, then Set-Associative, lastly Direct Mapping
What are the advantages and disadvantages of each scheme?
In practice, low hit rates like in the example is very rareusually Set-Associative with LRU replacement scheme
Larger blocks and more blocks greatly improve cache hit rate, i.e. more cache memory
17
Real-life Example: Intel Core 2 Duo
Core Processing unit
L2 cache 4MB
Mainmemory Input/Output
System bus
L1 instructioncache32kB
L1 datacache32kB
Core Processing unit
L1 instructioncache32kB
L1 datacache32kB
18
Real-life Example: Intel Core 2 Duo
Number of processors : 1
Number of cores : 2 per processor
Number of threads : 2 per processor
Name : Intel Core 2 Duo E6600
Code Name : Conroe
Specification : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Technology : 65 nm
Core Speed : 2400 MHz
Multiplier x Bus speed : 9.0 x 266.0 MHz = 2400 MHz
Front-Side-Bus speed : 4 x 266.0MHz = 1066 MHz
Instruction sets : MMX, SSE, SSE2, SSE3, SSSE3, EM64T
L1 Data cache 2 x 32 KBytes, 8-way set associative, 64-byte line size
L1 Instruction cache 2 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache 4096 KBytes, 16-way set associative, 64-byte line size
19
Memory Module Interleaving
Processor and cache are fast, main memory is slow. Try to hide access latency by interleaving memory accesses across
several memory modules. Each memory module has own Address Buffer Register (ABR) and
Data Buffer Register (DBR) Which scheme below can be better interleaved?
m bits
Address in module MM address
(a) Consecutive words in a module
i
k bits
Module Module Module
Module
DBRABR DBRABR ABR DBR
0 n 1-
(b) Consecutive words in consecutive modules
i
k bits
0ModuleModuleModule
Module MM address
DBRABRABR DBRABR DBR
Address in module
2k
1-
m bits
20
Memory Module Interleaving
Memory interleaving can be realized in technology such as “Dual Channel Memory Architecture.”
Two or more compatible (identical the best) memory modules are used in matching banks.
Within a memory module, in fact, 8 chips are used in “parallel”, to achieve a 8 x 8 = 64-bit memory bus. This is also a kind of memory interleaving.
21
Memory Interleaving Example
Suppose we have a cache read miss and need to load from main memory
Assume cache with 8-word block (cache line size = 8 bytes)
Assume it takes one clock to send address to DRAM memory and one clock to send data back.
In addition, DRAM has 6 cycle latency for first word;good that each of subsequent words in same row takes only 4 cycles
Read 8 bytes without interleaving: 1+(1x6)+(7x4)+1 = 36 cycles
For 4-module interleaved scheme: 1+(1x6)+(1x8) = 15 cycles Transfer of first 4 words are overlapped with access of second 4 words
22
Memory Without Interleaving
1 6 1
1 4 1
1 4 1
1 4 1
…Read 8 bytes (non-interleaving):
1 + 1x6 + 7x4 + 1 = 36 Cycles
1 6 1
Single Memory Read: 1 + 6 + 1 = 8 Cycles
23
Four Memory Modules Interleaved
1
1 6 1
1 6 1
1 6 1
1 4 1
1 4 1
1 4 1
1 4 1
Read 8 bytes (interleaving): 1 + 6 + 1x8 = 15 Cycles
1 6
24
Cache Hit Rate and Miss Penalty
Goal is to have a big memory system with speed of cache
High cache hit rates (> 90%) are essentialMiss penalty must also be reduced
Exampleh is hit rateM is miss penaltyC is cache access timeAverage memory access time = h * C + (1-h) * M
25
Cache Hit Rate and Miss Penalty Optimistically assume 8 cycles to read a single memory item;
15 cycles to load a 8-byte block from main memory (previous example); cache access time = 1 cycle
For every 100 instructions, statistically 30 instructions are data read/ write
Instruction fetch: 100 memory access: assume hit rate = 0.95
Data read/ write: 30 memory access: assume hit rate = 0.90
Execution cycles without cache = (100 + 30) x 8
Execution cycles with cache = 100(0.95x1+0.05x15)+30(0.9x1+0.1x15)
Ratio = 4.30, i.e. cache speeds up execution more than 4 times!
26
Summary
Cache Organizations:Direct, Associative, Set-Associative
Cache Replacement Algorithms:Random, Least Recently Used
Memory Interleaving
Cache Hit and Miss Penalty