memory hierarchy part 1
Post on 02-Jan-2016
33 Views
Preview:
DESCRIPTION
TRANSCRIPT
Faculty of Computer Science
CMPUT 229 © 2006
Memory HierarchyPart 1
Refreshing Memory
© 2006
Department of Computing Science
CMPUT 229
Reading Assignment
Optional:Bryant, Randal E., O’Hallaron, David, Computer Systems: A Programmer’s Perspective, Prentice Hall, 2003. (B&H)
Chapter 6: The Memory Hierarchy
Required:Sections 8.4 and 12.4 of the Clements textbook.
© 2006
Department of Computing Science
CMPUT 229
Types of Memories
Read/Write Memory (RWM):
the time required to read orwrite a bit of memory is independent of the bit’s location.
once a word is writtento a location, it remains stored as long as power is appliedto the chip, unless the location is written again.
the data stored ateach location must be refreshed periodically by reading it andthen writing it back again, or else it disappears.
we can store and retrieve data.
Random Access Memory (RAM):
Static Random Access Memory (SRAM):
Dynamic Random Access Memory (DRAM):
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DOUT3 DOUT2 DOUT1 DOUT0
3-to-8decoder
2
1
0
A2
A1
A0
0
1
2
3
4
5
6
7
DIN3 DIN0DIN2 DIN1
WE_LCS_L
OE_L
WR_L
IOE_L
0
1
1
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DOUT3 DOUT3 DOUT3 DOUT3
3-to-8decoder
2
1
0
A2
A1
A0
0
1
2
3
4
5
6
7
DIN3 DIN3DIN3 DIN3
WE_LCS_L
OE_L
WR_L
IOE_L
0
1
1
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DOUT3 DOUT3 DOUT3 DOUT3
3-to-8decoder
2
1
0
A2
A1
A0
0
1
2
3
4
5
6
7
DIN3 DIN3DIN3 DIN3
WE_LCS_L
OE_L
WR_L
IOE_L
0
1
1
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DOUT3 DOUT3 DOUT3 DOUT3
3-to-8decoder
2
1
0
A2
A1
A0
0
1
2
3
4
5
6
7
DIN3 DIN3DIN3 DIN3
WE_LCS_L
OE_L
WR_L
IOE_L
0
1
1
© 2006
Department of Computing Science
CMPUT 229
Refreshing the Memory
Vcap
0V
HIGHLOW
VCC
time
0 stored
1 written refreshes
The solution is to periodically refresh the memorycells by reading and writing back each one of them.
© 2006
Department of Computing Science
CMPUT 229
SRAM with Bi-directional Data Bus
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DIO3 DIO2 DIO1 DIO0
WE_LCS_L
OE_L
WR_L
IOE_L
microprocessor
© 2006
Department of Computing Science
CMPUT 229
DRAM High Level View
Cols
Rows
0 1 2 3
0
1
2
3
Internal row buffer
DRAM chip
addr
data
2/
8/
Memorycontroller
(to CPU)
Byant/O’Hallaron, pp. 459
© 2006
Department of Computing Science
CMPUT 229
DRAM RAS Request
RAS = 2
Cols
Rows
0 1 2 3
0
1
2
3
Internal row buffer
DRAM chip
Row 2
addr
data
2/
8/
Memorycontroller
RAS = Row Address StrobeByant/O’Hallaron, pp. 460
© 2006
Department of Computing Science
CMPUT 229
DRAM CAS Request
Supercell (2,1)
Cols
Rows
0 1 2 3
0
1
2
3
Internal row buffer
DRAM chip
CAS = 1
addr
data
2/
8/
Memorycontroller
CAS = Column Address StrobeByant/O’Hallaron, pp. 460
Memory Modules
: Supercell (i,j)
031 78151623243263 394047485556
64-bit double word at main memory address A
addr (row = i, col = j)
data
64 MB memory module
consisting of8 8Mx8 DRAMs
Memorycontroller
bits0-7
DRAM 7
DRAM 0
bits8-15
bits16-23
bits24-31
bits32-39
bits40-47
bits48-55
bits56-63
64-bit doubleword to CPU chip
Byant/O’Hallaron, pp. 461
Step 1: Apply row address
1
Step 2: RAS go from high to low and remain low2
Step 4: WE must be high
4
Step 3: Apply column address
3Step 5: CAS goes from high to low and remain low
5
Step 6: OE goes low
6
Step 7: Data appears
7
Step 8: RAS and CAS return to high
8
Read Cycle on an Asynchronous DRAM
© 2006
Department of Computing Science
CMPUT 229
Improved DRAMs
Central Idea: Each read to a DRAM actuallyreads a complete row of bits or word line fromthe DRAM core into an array of sense amps.
A traditional asynchronous DRAM interfacethen selects a small number of these bits to bedelivered to the cache/microprocessor.
All the other bits already extracted from the DRAMcells into the sense amps are wasted.
© 2006
Department of Computing Science
CMPUT 229
Fast Page Mode DRAMs
In a DRAM with Fast Page Mode, a page is defined asall memory addresses that have the same row address.
To read in fast page mode, all the steps from 1 to 7 ofa standard read cycle are performed.
Then OE and CAS are switched high, but RAS remains low.
Then the steps 3 to 7 (providing a new column address,asserting CAS and OE) are performed for each newmemory location to be read.
A Fast Page Mode Read Cycle on an Asynchronous DRAM
© 2006
Department of Computing Science
CMPUT 229
Enhanced Data Output RAMs (EDO-RAM)
The process to read multiple locations in an EDO-RAMis very similar to the Fast Page Mode.
The difference is that the output drivers are not disabledwhen CAS goes high.
This distintion allows the data from the current read cycleto be present at the outputs while the next cyclebegins.
As a result, faster read cycle times are allowed.
An Enhanced Data Output Read Cycle on an Asynchronous DRAM
© 2006
Department of Computing Science
CMPUT 229
Synchronous DRAMs (SDRAM)
A Synchronous DRAM (SDRAM) has a clock input. It operatesin a similar fashion as the fast page mode and EDO DRAM.However the consecutive data is output synchronously on thefalling/rising edge of the clock, instead of on command byCAS.
How many data elements will be output (the length of the burst) is programmable up to the maximum size ofthe row.
The clock in an SDRAM typically runs oneorder of magnitude faster than the access time forindividual accesses.
© 2006
Department of Computing Science
CMPUT 229
DDR SDRAM
A Double Data Rate (DDR) SDRAM is an SDRAMthat allows data transfers both on the rising andfalling edge of the clock.
Thus the effective data transfer rate of a DDR SDRAM is two times the data transfer rate ofa standard SDRAM with the same clock frequency.
© 2006
Department of Computing Science
CMPUT 229
The Rambus DRAM (RDRAM)
Multiple memory arrays (banks)Rambus DRAMs are synchronous and transfer data on both edges of the clock.
© 2006
Department of Computing Science
CMPUT 229
SDRAM Memory Systems
Complex circuits for RAS/CAS/OE.
Each DIMM is connectedin parallel with the memorycontroller.(DIMM = Dual In-line Memory Module)
Often requires buffering.
Needs the whole clockcycle to establish valid data.
Making the bus wider ismechanically complicated.
© 2006
Department of Computing Science
CMPUT 229
RDRAM Memory Systems
© 2006
Department of Computing Science
CMPUT 229
Locality
We say that a computer program exhibits good locality if the program tends to reference data that is nearby or datathat has been referenced recently.
Because a program might do one of these things, but not the other,the principle of locality is separated into two flavors:
Temporal locality: a memory location that is referenced once is likely to be referenced multiple times in the near future.
Spatial locality: if a memory location that is referenced once then locations that are nearby are likely to be referenced in the near future.
Byant/O’Hallaron, pp. 478
© 2006
Department of Computing Science
CMPUT 229
Examples
In the Sampler function below, RandInt returns a randomly selected integer within the specified interval.Which program has better locality?
1 int SumVec(int v[], int N) 2 { 3 int i; 4 int sum = 0; 5 6 for (i=0 ; i<N ; i=i+1) 7 sum += v[i]; 8 return sum; 9 }
1 int Sampler(int v[], int N, int K) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<K ; i=i+1) 7 { 8 j = RandInt(0,N-1); 9 sum += v[j];10 }11 return sum/K;12 }
Byant/O’Hallaron, pp. 479
Memory Hierarchy
Larger, slower,
and cheaper (per byte)storagedevices
Registers
CPU registers hold words retrieved from cache memory.
L0:
On-chip L1cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache.L1:
Off-chip L2cache (SRAM)
L2 cache holds cache lines retrieved from memory.L2:
Main memory(DRAM)
Main memory holds disk blocks retrieved from local
disks.
L3:
Local secondary storage(local disks)
Local disks hold files retrieved from disks on
remote network servers.
L4:
Remote secondary storage(distributed file systems, Web servers)
L5:
Smaller,faster,and
costlier(per byte)storage devices
Byant/O’Hallaron,
pp. 483
© 2006
Department of Computing Science
CMPUT 229
Caching Principle
4 9 14 3
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Larger, slower, cheaper storagedevice at level k+1 is partitioned
into blocks.
Smaller, faster, more expensivedevice at level k caches a
subset of the blocks from level k+1
Data is copied betweenlevels in block-sized transfer units
Level k:
Level k+1:
Byant/O’Hallaron, pp. 484
© 2006
Department of Computing Science
CMPUT 229
Cache Misses
Cold Misses, or compulsory misses, occur the first time that a data is referenced.
Conflict Misses, occur when two memory references have to occupy the same memory line. It can occur even when the remainder of the cache is not in use.
Capacity Misses, occur when there are no more free lines in the cache.
© 2006
Department of Computing Science
CMPUT 229
Simplest Cache: Direct MappedMemory
4 Byte Direct Mapped Cache
Memory Address0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
Cache Index
0
1
2
3
Location 0 can be occupied by data from:
– Memory location 0, 4, 8, ... etc.
– In general: any memory locationwhose 2 LSBs of the address are 0s
– Address<1:0> => cache index
Which one should we place in the cache?
How can we tell which one is in the cache?
© 2006
Department of Computing Science
CMPUT 229
1 KB Direct Mapped Cache, 32B blocks For a 2 ** N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2 ** M)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag
Example: 0x50 Ex: 0x01
0x50
Cache Tag is Stored as part of the cache “state”
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
© 2006
Department of Computing Science
CMPUT 229
Direct-mapped Cache
Clements pp. 346
© 2006
Department of Computing Science
CMPUT 229
Identifying sets in Direct-mapped Caches
Clements pp. 347
© 2006
Department of Computing Science
CMPUT 229
Operation of a Direct-mapped Cache
Clements pp. 348
© 2006
Department of Computing Science
CMPUT 229
Full-Associative Cache
Clements pp. 348
© 2006
Department of Computing Science
CMPUT 229
Two-way Set Associative Cache N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel (N typically 2 to 4)
Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared in parallel
– Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
© 2006
Department of Computing Science
CMPUT 229
Set associative-mapped cache
Clements pp. 349
© 2006
Department of Computing Science
CMPUT 229
L1 and L2 Bus System
Mainmemory
I/Obridge
Bus interfaceL2 cache
ALU
Register file
CPU chip
Cache bus System bus Memory bus
L1 cache
Byant/O’Hallaron, pp. 488
Cache Organization
• • • B–110
• • • B–110
Valid
Valid
Tag
TagSet 0:
B = 2b bytesper cache block
E lines per set
S = 2s sets
t tag bitsper line
1 valid bitper line
Cache size: C = B x E x S data bytes
• • •
• • • B–110
• • • B–110
Valid
Valid
Tag
TagSet 1:
• • •
• • • B–110
• • • B–110
Valid
Valid
Tag
TagSet S -1:
• • •• • •
Byant/O’Hallaron, pp. 488
© 2006
Department of Computing Science
CMPUT 229
Address Partition
t bits s bits b bits
0m-1
Tag Set index Block offset
Address:
Compared with tags in thecache to find a match.
Used to find the set wherethe data might be found inthe cache.
Selects which word, insidethe block, is referenced.
Byant/O’Hallaron, pp. 488
© 2006
Department of Computing Science
CMPUT 229
Multi-Level Cache Organization
Mainmemory Disk
L1 i-cache
L1 d-cacheRegs L2 unifiedcache
CPU
Byant/O’Hallaron, pp. 504
top related