lecture 17 - ucsbstrukov/ece154afall2012/lecture17.pdf · lecture 17 adapted from instructor’s...

26
Lecture 17 Lecture 17 Adapted from instructor’s supplementary material from Computer Organization and Design 4th Edition Organization and Design, 4th Edition, Patterson & Hennessy, © 2008, MK]

Upload: others

Post on 18-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Lecture 17Lecture 17

Adapted from instructor’s supplementary material from ComputerOrganization and Design 4th EditionOrganization and Design, 4th Edition,Patterson & Hennessy, © 2008, MK]

Page 2: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

SRAM / DRAM / Flash / RRAM / HDD

Page 3: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

SRAM / DRAM / Flash / RRAM/ HDD/ / / /

Page 4: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

SRAM / DRAM / Flash / RRAM / HDDSRAM / DRAM / Flash / RRAM / HDD

Page 5: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

• PurposeSRAM / DRAM / Flash / RRAM / HDD

• Purpose

– Long term, nonvolatile storage

– Lowest level in the memory hierarchy

Sector

• slow, large, inexpensive

• General structure

– A rotating platter coated with a magnetic surface

Track

A rotating platter coated with a magnetic surface

– A moveable read/write head to access the information on the disk

• Typical numbers

– 1 to 4 platters (each with 2 recordable surfaces) per disk of 1” to 3.5” in diameter

– Rotational speeds of 5,400 to 15,000 RPM

– 10,000 to 50,000 tracks per surface

• cylinder ‐ all the tracks under the head at a given point on all surfacessu aces

– 100 to 500 sectors per track

• the smallest unit that can be read/written (typically 512B)

Page 6: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Magnetic Disk Characteristic• Disk read/write components Track

C t llisk read/write components

1. Seek time: position the head over the                                           proper track (3 to 13 ms avg)• due to locality of disk references

Sector

Cylinder

Controller+

Cache

due to locality of disk references                                                        the actual average seek time may                                                       be only 25% to 33% of the                                                                   advertised number

2 Rotational latency: wait for the desired sector to rotate

Cylinder

HeadPlatter

2. Rotational latency:  wait for the desired sector to rotate under the head (½ of 1/RPM converted to ms)• 0.5/5400RPM = 5.6ms      to       0.5/15000RPM = 2.0ms

3 Transfer time: transfer a block of bits (one or more3. Transfer time:  transfer a block of bits (one or more sectors) under the head to the disk controller’s cache (70 to 125 MB/s are typical disk transfer rates in 2008)• the disk controller’s “cache” takes advantage of spatial locality in• the disk controller s  cache  takes advantage of spatial locality in 

disk accesses– cache transfer rates are much faster (e.g., 375 MB/s)

4. Controller time:  the overhead the disk controller imposes in performing a disk I/O access (typically < .2 ms)

Page 7: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

How is the Hierarchy Managed?

• registers memory– by compiler (programmer?)

• cache main memory– by the cache controller hardware

• main memory disks• main memory  disks– by the operating system (virtual memory)– virtual to physical address mapping assisted by the h d (TLB)hardware (TLB)

– by the programmer (files)

Page 8: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

• The off‐chip interconnect and memory architecture can

Memory Systems that Support CachesThe off chip interconnect and memory architecture can affect overall system performance in dramatic ways

One word wide organization (one word wide bus and d id )

on‐chip

CPU one word wide memory)

Assume1 1 memory bus clock cycle to send the addr

Cache

bus

1. 1 memory bus clock cycle to send the addr

2. 15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle time), 5 memory bus clock cycles for 2nd 3rd 4th words32‐bit data

DRAMM

bus memory bus clock cycles for  2nd, 3rd, 4th words (column access time)

3. 1 memory bus clock cycle to return a word of data

32‐bit data&

32‐bit addrper cycle

Memory data

Memory‐Bus to Cache bandwidth number of bytes accessed from memory and 

transferred to cache/CPU per memory bus clock cycle

Page 9: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

One Word Wide Bus, One Word Blocks

CPU

on‐chip• If the block size is one word, then for a 

memory access due to a cache miss, the pipeline will have to stall for the number CPU

Cache

p pof cycles required to return one data word from memory

cycle to send address

bus

ycycles to read DRAMcycle to return datatotal clock cycles miss penalty

DRAMMemory

y p y

• Number of bytes transferred per clock cycle (bandwidth) for a single miss iscycle (bandwidth) for a single miss is

bytes per memory bus clock cycle

Page 10: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

One Word Wide Bus, One Word Blocks

on‐chip• If the block size is one word, then for a 

memory access due to a cache miss, the pipeline will have to stall for the numberCPU

C h

pipeline will have to stall for the number of cycles required to return one data word from memory

memory bus clock cycle to send address1Cache

bus

memory bus clock cycle to send addressmemory bus clock cycles to read DRAMmemory bus clock cycle to return datat t l l k l i lt

1   

15

1

DRAMMemory

total clock cycles miss penalty

• Number of bytes transferred per clock l (b d idth) f i l i i

17

Memory cycle (bandwidth) for a single miss isbytes per memory bus clock 

cycle4/17 0 2354/17 = 0.235

Page 11: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

One Word Wide Bus, Four Word BlocksWh t if th bl k i i f d d

CPU

on‐chip

• What if the block size is four words and each word is in a different DRAM row?

cycle to send 1st address

Cache

ycycles to read DRAMcycles to return last data word

bus

total clock cycles miss penalty

DRAMMemory

• Number of bytes transferred per clockNumber of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

Page 12: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

One Word Wide Bus, Four Word BlocksWh t if th bl k i i f d d

CPU

on‐chip

• What if the block size is four words and each word is in a different DRAM row?

cycle to send 1st address1

Cache

ycycles to read DRAMcycles to return last data word

1

4 x 15  =   60

bus

total clock cycles miss penalty15 cycles

15 cycles

62

DRAMMemory

• Number of bytes transferred per clock

15 cycles

15 cycles

Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock(4 x 4)/62 = 0 258(4 x 4)/62 = 0.258

Page 13: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

One Word Wide Bus, Four Word BlocksWh t if th bl k i i f d d

CPU

on‐chip

• What if the block size is four words and all words are in the same DRAM row?

cycle to send 1st address

Cache

ycycles to read DRAMcycles to return last data word

bus

total clock cycles miss penalty

DRAMMemory

• Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

Page 14: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

One Word Wide Bus, Four Word BlocksWh t if th bl k i i f d d

CPU

on‐chip

• What if the block size is four words and all words are in the same DRAM row?

cycle to send 1st address1

Cache

ycycles to read DRAMcycles to return last data word

1

15 + 3*5 = 30

1

bus

total clock cycles miss penalty

15 cycles

5 l

32

DRAMMemory

5 cycles

5 cycles

5 cycles

• Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock(4 x 4)/32 = 0.5

Page 15: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

(DDR) SDRAM Operation

N cols After a row is read 

into the SRAM register

+1ColumnAddress

N ro

ws

DRAMRow

Address

Input CAS as the starting “burst” address along with a burst length

Transfers a burst of data (ideally a N

N x M SRAM

Transfers a burst of data (ideally a cache block) from a series of sequential addr’s within that row

- The memory bus clock controls

M‐bit Output

M bit planesN x M SRAMThe memory bus clock controls 

transfer of successive words in the burst 

t d d th

Cycle Time

RAS

1st M‐bit Access 2nd M‐bit 3rd M‐bit 4th M‐bit

Row Address

CAS

Col Address Row Add 

Page 16: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Interleaved Memory, One Word Wide Bus For a block size of four words

cycle to send 1st address

cycles to read DRAM banksCPU

on‐chip

cycles to read DRAM banks

cycles to return last data word

total clock cycles miss penaltyCache

DRAM

bus

DRAM DRAM DRAMDRAMMemorybank 1

DRAMMemorybank 0

DRAMMemorybank 2

DRAMMemorybank 3

Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

Page 17: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Interleaved Memory, One Word Wide Bus For a block size of four words

cycle to send 1st address

cycles to read DRAM banksCPU

on‐chip 1  

15 cycles to read DRAM banks

cycles to return last data word

total clock cycles miss penaltyCache

15

4*1  =    4

20

bus 15 cycles

15 cyclesDRAMDRAM DRAM DRAM

15 cycles

15 cycles

DRAMMemorybank 1

DRAMMemorybank 0

DRAMMemorybank 2

DRAMMemorybank 3

Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock(4 x 4)/20 = 0.8

Page 18: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

DRAM Memory System Summary• Its important to match the cache characteristics

– caches access one block at a time (usually more than one word)one word)

• with the DRAM characteristics– use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache

• with the memory‐bus characteristics– make sure the memory‐bus can support the DRAM access rates and patternsaccess rates and patterns

– with the goal of increasing the Memory‐Bus to Cache bandwidth

Page 19: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Memory hierarchy summarysummary

Page 20: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

4 Questions for the Memory Hierarchy

• Q1: Where can a entry be placed in the upper level? (Entry placement)

• Q2: How is a entry found if it is in the upper level?level?(Entry identification)

• Q3: Which entry should be replaced on a miss? (Entry replacement)

• Q4: What happens on a write? (Write strategy)(Write strategy)

Page 21: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Q1&Q2: Where can a entry be placed/found?placed/found?

# of sets Entries per setDirect mapped # of entries 1ppSet associative (# of entries)/ associativity Associativity (typically

2 to 16)Fully associative 1 # of entriesFully associative 1 # of entries

Location method # of comparisonsLocation method # of comparisonsDirect mapped Index 1Set associative Index the set; compare

t’ tDegree of

i ti itset’s tags associativityFully associative Compare all entries’ tags

Separate lookup (page) # of entries0p p (p g )

table

Page 22: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Q3: Which entry should be replaced on a miss?• Easy for direct mapped – only one choice• Set associative or fully associative

– Random– LRU (Least Recently Used)

• For a 2‐way set associative, random replacement has a miss rate about 1.1 times higher than LRU

• LRU is too costly to implement for high levels of associativity (> 4‐way) since tracking the usage information is costlyinformation is costly

Page 23: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Q4: What happens on a write?• Write‐through – The information is written to the entry inWrite through The information is written to the entry in 

the current memory level and to the entry in the next level of the memory hierarchy– Always combined with a write buffer so write waits to next levelAlways combined with a write buffer so write waits to next level 

memory can be eliminated (as long as the write buffer doesn’t fill)

• Write‐back – The information is written only to the entry in y ythe current memory level. The modified entry is written to next level of memory only when it is replaced.– Need a dirty bit to keep track of whether the entry is clean or y p y

dirty– Virtual memory systems always use write‐back of dirty pages to 

disk• Pros and cons of each?

– Write‐through: read misses don’t result in writes (so are simpler and cheaper), easier to implement

– Write‐back: writes run at the speed of the cache; repeated writes require only one write to lower level

Page 24: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Summary:  Improving Cache Performance0 Reduce the time to hit in the cache0. Reduce the time to hit in the cache

– smaller cache– direct mapped cache

smaller blocks– smaller blocks– for writes 

• no write allocate – no “hit” on cache, just write to write buffer• write allocate to avoid two cycles (first check for hit then write)• write allocate – to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache

1. Reduce the miss rate1. Reduce the miss rate– bigger cache– more flexible placement (increase associativity)

larger blocks (16 to 64 bytes typical)– larger blocks (16 to 64 bytes typical)– victim cache – small buffer holding most recently discarded 

blocks 

Page 25: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Summary:  Improving Cache Performance2 Reduce the miss penalty2. Reduce the miss penalty

– smaller blocks– use a write buffer to hold dirty blocks being replaced so don’t have to wait for the write to complete before reading 

– check write buffer (and/or victim cache) on read miss –l kmay get lucky 

– for large blocks fetch critical word first– use multiple cache levels – L2 cache not tied to CPU clock rate

– faster backing store/improved memory bandwidth• wider buses• memory interleaving, DDR SDRAMs

Page 26: Lecture 17 - UCSBstrukov/ece154aFall2012/lecture17.pdf · Lecture 17 Adapted from instructor’s ... 4/17 = 0 2350.235. One Word Wide Bus, Four Word Blocks Wh tif th bl k i i f d

Summary: The Cache Design Space• Several interacting dimensions• Several interacting dimensions

– cache size– block size

A i ti it

Cache Size

– associativity– replacement policy– write‐through vs write‐back

Associativity

write through vs write back– write allocation

• The optimal choice is a compromise

Block Size

compromise– depends on access characteristics

• workload( h h )

Bad

• use (I‐cache, D‐cache, TLB)– depends on technology / cost

• Simplicity often winsGood

Less More

Factor A Factor B

p y Less More