מבנה מחשב זיכרון. memory computer memory is divided into levels, based on speed/cost....

מבנה מחשב

זיכרון

Memory Computer memory is divided into levels, based on speed/cost.

Fastest memory is used at the highest level closest to processor. Slowest memory at the lowest level must go through failure at all other intermediate levels. Fastest memory is most expensive smallest. Slowest memory is cheapest largest.

Memory technology access time $ per Mbyte In 1998

SRAM 5 - 25 ns 100 - 250

DRAM 60 - 120 ns 5 - 10

Magnetic Disk 10,000,000 - 20,000,000 ns 0.1 - 0.2

SRAM - Static random-access memory. Volatile - data is eventually lost when the memory is not powered.

DRAM - Dynamic random-access memory. The main memory ( the “RAM” ) in personal computers. Also volatile. Looses memory quickly when power is removed (unlike flash memory).

Magnetic Disk - Storage of data on a magnetic medium - hard disk. Non volatile memory.

3

The Speed/Size Hierarchy

speed size cost

fastest Memory smallest highest

Memory

slowest Memory largest lowest

CPU

Locality The principle of locality makes it possible to

have a memory as large as the slowest, which seems to run as fast as the highest level (even though the smallest).

Two types of locality: Temporal locality: If a value is referenced

(needed from memory), it has a high probability of being referenced again soon (time-wise).

Spatial locality: If a value is referenced in memory at location A, then there is a high probability that values will be referenced soon from A+i (space-wise), where i is small. (if i is zero, this is the same as temporal locality).

Interface Between Levels Memory at level i is never accessed unless the access

has failed at level i +1. Note: level i is slower and larger than level i+1. Therefore, the same access protocol must hold, in principle,

between every pair of levels. We consider only a 2-level memory system, and the

interface between those two levels. Interfaces between all other pairs of successive levels will use the same principles.

Virtual memory is an example of interface between two memory levels: the higher level is usually DRAM; the lower level is usually a disk.

We consider a level which is one level higher than main memory in the hierarchy and which is called cache.

Graphic Illustration of a 2-level Memory Hierarchy

Processor

Data transferred between levels

MainMemory

Cache Level i+1

Level i

Processor reads/writes data

- Block-unit of transfer

Mapping to the Cache

3 questions to answer: Where is a data from memory placed in the cache ? How do we find it there when we need it ? What happens when we look for data in the cache

but it is not there ?

Simplest cache organization: direct mapping For every memory location, there is exactly one

cache location where it can be stored.

9

How should we map?

When using direct mapping to transfer information between the RAM and the cache, which way is better? Mapping according to the first byte of the address

(LSB). Mapping according to the last byte of the address

(MSB). Example: 0xff02d12a

Remember: Temporal locality. Spatial locality.

10

How should we map?

Mapping by LSB

Mapping by MSB

Which one is better?

00..00

01..00

10..00

11..00

00..01

01..01

10..01

11..01

00..10

01..10

10..10

11..10

00..11

....

00

01

10

11

00

01

10

11

Illustration of Direct Mapping

000 001 010 011 100 101 110 111

00001 00101 01001 01101 10001 10101 11001 11101

Last 3 bits of memory address determine cache location= cache block address/index

Memory

Cache

• Block = Unit of transfer between memory technologies.

Finding Data in the Cache (Read)

Each cache location has a tag.

Tag = the address bits which are higher order than the cache block address.

Tag uniquely distinguishes every memory location which can map to a specific cache block.

Processor read address is divided into 2 parts: Lower order part (cache block address) is used to address

the correct cache location.

Higher order part is compared with tag at that location.

If tags match, we have a cache hit.

Assuming cache size is 2^10 or 1024 words. Each block size = 1 word. The cache index will need to be represented

by 10 bits. Therefore 32-10-2 = 20 bits are reserved for

the tag. If tag is equal and valid bit is on we have a

cache hit. First two bits tell us which byte in the word

holds the relevant data.

13

31 30 29 28 27 . . . . . . . . . 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Byteoffset

Valid Tag DataIndex0

1

2

. . .

. . .

. . .

1021

1022

1023

=

20 10Tag

Index

Hit Data

20 32

Address (showing bit positions)For this cache, the lower portion of the address is used to select a cache entry consisting of a data word and a tag. The tag from the cache is compared against the upper portion of the address to determine whether the entry in the cache corresponds to the requested address. Because the cache has 210 or 1024 words, and a block size of 1 word, 10 bits are used to index the cache, leaving 32 - 10 - 2 = 20 bits to be compared against the tag. If the tag and upper 20 bits of the address are equal and the valid bit is on, then the request hits in the cache and the word is supplied to the processor. Otherwise, a miss occurs.

Cache Miss

If data is not found in the cache (tag does not match requested address), we have a cache miss.

Requested data must be read from memory.

We assume temporal locality -- the data will be used again soon -- so we write it into cache.

Datapath must wait for new data to arrive control must be stalled (insert wait states).

Memory (00100two) Memory (00100two)

Index Tag Data

000 N

V

001 N

010 N

011 N

100 N

101 N

110 N

111 N

Example

a. The initial state of the cache after power-on.

Index Tag Data

000 N

V

001 N

010 Y 11two Memory (11010two)

011 N

100 N

101 N


111 N

c. After handling a miss of address (11010two).

Index Tag Data

000 N

V

001 N

010 N

011 N

100 N

101 N


111 N

b. After handling a miss of address (10110two).

d. After handling a miss of address (10000two). f. After handling a miss of address (10010two).e. After handling a miss of address (00100two).

The cache contents are shown after each reference request that misses with the index and tag fields show in binary. The cache is initially empty with all valid bits (V entry in cache) turned off (N). The processor requests the following addresses: 10110two (miss), 11010two (miss), 10110two (hit), 11010two (hit), 10000two (miss), 00100two (miss), 10000two (hit) and 10010two (miss). The figures show the cache contents after each miss in the sequence has been handled. When address 10010two (18) is referenced, the entry for address 11010two (26) must be replaced and a reference to 11010two will cause a subsequent miss. Remember that the tag field will contain only the upper portion of the address. The full address of a word contained in cache block i with tag field j for this cache is 8 x j + I, or equivalently the concatenation of the tag field j and the index I. You can see this by looking at the lock address in the Data field of any cache entry and the corresponding index and tag. For example, in cache f above, index 010 has tag 10 and corresponds to address 10010.

Index Tag Data

000 Y

V

001 N


011 N

100 N

101 N


111 N

Memory (10000two)10two

Index Tag Data

000 Y

V

001 N


011 N

100 Y

101 N


111 N


00two

Index Tag Data

000 Y

V

001 N


011 N

100 Y

101 N


111 N


00two

Spatial locality - larger block size

A cache which brings in a single word on each read of memory, does not exploit spatial locality.

If a word at location i is required, there is a high probability that words near i will be required soon.

It makes sense to bring in more than a single word.The number of words brought into cache on a single

read from memory is called the block size.

31 30 29 28 27 . . . . . . . . . 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Byteoffset

V Tag Data

=

16 12Tag

Index

HitData

16 32

Address (showing bit positions)

A 64 KB cache using four-word (16 byte) blocks. The tag field is 16 bits wide and the index field is 12 bits wide, while a 2-bit field (bits 3-2) are used to index the block and select the word from the block using a 4-to-1 multiplexor.

Mux

323232

32

16 bits 128 bits

2

Address in main memory =

No. of blocks X Block size X Tag +

Block size X Index +

Bytes offset

Example: What cache block will be used for byte address 1240?

–block size = 16 bytes (4 words)–cache has 64 blocks

Answer: 1240 / 16 = 77 blocks + 8 bytes = 78th block.

There are 64 blocks in cache so block 78 maps to cache block 13 (blocks start at index 0).

Mapping address from memory to cache

40%

4 16 64 256Blocksize (bytes)

35%

30%25%

20%

15%

10%

5%0%

Mis

s ra

te 1 KB

8 KB16 KB

64 KB256 KB

Cachesize

Effect of Blocksize on Cache Performance

A larger blocksize exploits spatial locality; miss rate goes down and fewer accesses to memory required.

Beyond a certain size, the blocksize does not contribute to more spatial locality.

If the blocksize is too large, there will not be room in the cache for enough different blocks; blocks will compete for the few spaces and will constantly push each other out miss rate goes up.

Miss rate is not the only factor. A large blocksize means the miss penalty is higher, since we have to read more from memory. High performance systems use special memory designs to allow a large block to be read from memory quickly.

Effect of Blocksize on Cache Performance(2)

Calculate Cache SizeCache specifications -

2n blocks of words word size = address size =w bits blocksize = b words

Total cache bits = 2n x [bxw + (w - [ n + log2(bxw/8) ] ) + 1]

Example: w = 32 bits; blocksize = 1 word; cache = 64KB

64KB = 16K words = 16*1024 = 16384 = 214 words

214 x [32 + (32 - [ 14 + 2 ] ) + 1] = 214 x 49 = 784 x 210 = 784 Kbits

= 784/8 = 98 Kbytes, i.e., more than 50% overhead for managing cache, 34 extra KB.

bits for tag field

bits for data word

valid bit

index into byte offset block size incache block +select line of the Mux bytes

Yet another example

Example: w = 32 bits; blocksize = 16 words; cache = 64KB

64KB = 16K words = 16/16 = 1K blocks = 210 blocks

210 x [32 x 16 + (32 - [ 10 + 6 ] ) + 1] = 210 x 529 = 529 Kbits

= 66.125 Kbytes, i.e., much less overhead for managing cache.

Full Associative Mapping to the Cache A block from memory may be mapped into any cache block, Usually it is mapped to the first block found to be free. If no such a free block in the cache (all blocks are occupied), a

search method is activated to find an appropriate block. When found, the block is taken out of the cache and the new memory block is written over.

The most common searching method is LRU - Least Recently Used, which means that the block that was least recently used (read from or written to) may be taken out. This heuristic is based on the temporal locality principle.

Advantages: better usage of the cache by taking out blocks only when they are supposedly unneeded and only when no more room is available. This method is considered more sophisticated.

Disadvantages: LRU is much more time consuming than a simple direct mapping. Moreover, since a block may sit anywhere in the cache, the tag must contain (almost) all the address bits. All the cache needs to be searched for a block.

Set Associative Mapping to the Cache The cache is organized into sets of blocks, each of which is composed of

more than one block. Mapping is done in two steps:

– First, a block is directly mapped into a unique set, according to the index field.

– Then, the block may be placed into any one of the set blocks. A set associative cache with sets containing n blocks each is called an n-

way set associative cache. This method is considered a combination (and maybe a compromise)

between direct mapping and full associative mapping. In fact, a 1-way set associative cache is actually a direct mapped cache. Similarly, a set associative cache, where the entire cache is only one set, is

in fact a full associative cache. Increasing the set associativety, usually decreases the miss rate. Advantage: enjoys (hopefully) the best from the two worlds. Disadvantage: more complex. All the blocks in the set are searched for a

match. (but if set is small, can be done in parallel.) (Set containing a block = Block number modulo numbers of sets in the cache.)

)1 (1ת למעשה אנו רק נדרשים לאכלסcache:לפי החוקיות

מילה נדרש

Direct 16 Blocks of 1w each

Direct 4 Blocks of 4w eachFull Associative – each block 1w

44 M1 M (4,5,6,7)0 M

11 M0 M (0,1,2,3)1 M

88 M2 M (8,9,10,11)2 M

171 M0 M (0,1,2,3) (16,17,18,19)3 M

502 M0 M (16,17,18,19) (48,49,50,51)4 M

99 M2 H (8,9,10,11)5 M

215 M1 M (4,5,6,7) (20,21,22,23)6 M

66 M1 M (20,21,22,23) (4,5,6,7)7 M

99 H2 H (8,9,10,11)5 H

1111 M2 H (8,9,10,11)8 M

)2 (1ת

:מילה נדרשתדFull Associative 4 Blockseach block 4w

40 M (4,5,6,7)

11 M (0,1,2,3)

82 M (8,9,10,11)

173 M (16,17,18,19)

500 M (4,5,6,7) – LRU (48,49,50,51)

92 H (8,9,10,11)

211 M (0,1,2,3) – LRU (20,21,22,23)

63 M (16,17,18,19) – LRU (4,5,6,7)

92 H (8,9,10,11)

112 H (8,9,10,11)

)4,5,6,7(

)0,1,2,3(

)8,9,10,11(

(16,17,18,19)

)48,49,50,51(

)20,21,22,23(

)4,5,6,7(

)3 (1ת דוגמה שלSet Associative

מילה נדרש

Set Associative – 4 sets, each set 4 Blocks – each block 4w

41 M (4,5,6,7)

10 M (0,1,2,3)

82 M (8,9,10,11)

170 M (0,1,2,3)(16,17,18,19)

500 M (0,1,2,3)(16,17,18,19)(48,49,50,51)

92 H (8,9,10,11)

211 M (4,5,6,7)(20,21,22,23)

61 H (4,5,6,7)(20,21,22,23)

92 H (8,9,10,11)

112 H (8,9,10,11)

Writing Policies When a system writes to the cache, it must at some point write to the

backing store as well. The timing of this write is controlled by what is known as the write policy. There are two basic writing approaches:

Write-through - Write is done synchronously both to the cache and to the backing store.

Write-back (or Write-behind) - Initially, writing is done only to the cache. The write to the backing store is postponed until the cache blocks containing the data are about to be modified/replaced by new content.

30

Cache Write-through policy If word is in cache (write "hit"), write cache. What about main memory? Cache will change but memory

stays the same? Memory data can become stale. This word may be removed, and later read again from memory

result is incorrect data (stale). Write-through policy: update memory on every cache write. If blocksize is 1, there is no write miss; The new block overrides the

old one. If blocksize > 1, when a write miss occurs. We cannot write just one

word from cache to memory. We must write the whole block.

Write-through Overhead

Problem with write-through: write to memory occurs frequently (every write), hence average memory access time is poor

Write-through systems use write buffering to reduce this problem

A buffer is inserted between the processor and memory. The buffer can hold a small number (1-8) of write requests; a write

request includes the data and address to be written. After a word has been sent to the buffer, the processor does not wait

for the write to complete; it continues with other instructions simultaneously.

Cache Write-back Policy Write-back policy: update memory only when needed.

Suppose a block b1 is to be read from memory into cache block b. If cache block b contains memory block b2 and b was updated since the last load of b2 into the cache, then write-back block b into memory block b2 and then load memory block b1 into cache block b.

A possible problem with write-back: if too many cache misses occur, write to memory might occur too frequently (both for read and write occasions), hence average memory access time is poor.

Write-back overhead: more complicated - needs to keep a dirty flag for each block.

Regardless of blocksize, old block must be written (updated) before new block may override it.

Write-back is NOT required for every write. Write-back is done only when a cache block is to be "taken out" of the

cache in favor of another block, either for reading this other block or for writing into it.

Write-back is mostly used in virtual memory systems, where reading or writing from/to a lower level is costly.

Write-back cache is more complex to implement, since it needs to track which of its locations have been written over, and mark them as dirty for later writing to the backing store. The data in these locations are written back to the backing store only when they are evicted from the cache, an effect referred to as a lazy write.

For this reason, a read miss in a write-back cache (which requires a block to be replaced by another) will often require two memory accesses to service: one to write the replaced data from the cache back to the store, and then one to retrieve the needed datum.

34

Write-back - write miss For blocksize = 1 word:

Write the block from the cache to the memory (update memory).

Write the new word to the appropriate location in the cache.

For blocksize > 1 word:

Write the block from the cache to the memory (update memory).

Read the new block from memory to the cache.Write the new word to the new block in the

cache.

2 + ת2ש

למעשה נצטרך למצוא רצף אחד בו החיסרון של מספר הmiss הרב לטעון רצף נתוניםC2 , יתקזז מול הזמן הנוסף שנדרש ל-C1יותר ב-

m1*8 < m2*11, m1= m2 + h2 , (m2+h2)*8 < m2*11 , h2 < 3/8m2h2-נציב עבור 0 הינו שלם גדול מ ,h2=1, 8/3 < m2

:8, 4, 1, 0מקרה פשוט שכזה כבר מושג עבור הרצף נחייב כלcacheבהתאם לשיטתו

C1 4M , = 32 = 8*4סה"כ זמן C2 3M 1H, = 33 = 11*3סה"כ זמן

בשאלה הקודמת הוצג הרצף הקביל הבא (ראו שם, פירוט עבור סעיףא' ו-ב'):

4 ,1 ,8 ,17 ,50 ,9 ,21 ,6 ,9 ,11) (

.Direct mapping עובדים בשיטת ה- cacheשני ה-

מבנה מחשב זיכרון. memory computer memory is divided into levels, based on speed/cost....

Documents