aca chapter3
TRANSCRIPT
-
7/31/2019 ACA Chapter3
1/13
10/19/201
T6030
IT6030Advanced Computer Architecture
Chapter 3
Memory and Storage Systems
N u en Kim Khanh
Department of Computer EngineeringSchool of Information and Communication Technology
Hanoi University of Science and Technology
Since 1980, CPU has outpaced DRAM ...
CPU60% per yrCPU
Performance
(1/latency)
n . yrs
DRAM
9% per yrDRAM
Gap grew 50% peryear
10/19/2010 IT6030 2
Year
How do architects address this gap?Put small, fast cache memories between CPU and DRAM.
Create a memory hierarchy
Memory Hierarchy Take advantage of the principle of locality to:
Present as much memory as in the cheapest technology
Provide access at speed offered by the fastest technology
OR
Control Secondary
Storage
(Disk/
FLASH/
Processor
Main
Memory
(DRAM/Second
Level
Tertiary
Storage
(Tape/
Cloud
Stora e)
10/19/2010 IT6030 3
n-Chip
ache
egisters
Datapath PCM)FLASH/
PCM)Cache
(SRAM)
1s 10,000,000s(10s ms)
Speed (ns): 10s-100s 100s
100s GsSize (bytes): Ks-Ms Ms
10,000,000,000s(10s sec)
Ts
The Principle of Locality
The Principle of Locality: Program access a relatively small portion of the address space at
any instant of time.
Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon(e.g., straightline code, array access)
Last 15 years, HW relied on locality for speed
10/19/2010 IT6030 4
-
7/31/2019 ACA Chapter3
2/13
10/19/201
T6030
Memory Hierarchy: Apple iMac G5Managed
by compiler Managedby hardware Managed by OS,hardware,application
iMac G51.6 GHz
07 Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
Latency
Cycles,
Time
1,
0.6 ns
3,
1.9 ns
3,
1.9 ns
11,
6.9 ns
88,
55 ns
107,
12 ms
10/19/2010 IT6030 5
Let programs address a memory space that
scales to the disk size, at a speed that isusually as fast as register access
oa : us on o arge, as , c eap memory
iMacs PowerPC 970: All caches on-chip
Re
L1 (64K Instruction)
gisters
512KL2
10/19/2010 IT6030 6
(1K)
L1 (32K Data)
Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level(example: Block X) Hit Rate: the fraction of memory access found in the upper level
:
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in thelower level (Block Y) Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
10/19/2010 IT6030 7
Hit Time
-
7/31/2019 ACA Chapter3
3/13
10/19/201
T6030
Q1: Where can a block be placed inthe upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative
S.A. Ma in = Block Number Modulo Number Sets
. .
Cache
01234567 0123456701234567
Full MappedDirect Mapped(12 mod 8) = 4
2-Way Assoc(12 mod 4) = 0
10/19/2010 IT6030 9
Memory
1111111111222222222233
01234567890123456789012345678901
Compulsory (cold start or process migration, firstreference): first access to a block Cold fact of life: not a whole lot you can do about it
Sources of Cache Misses
Note: If you are going to run billions of instruction, CompulsoryMisses are insignificant
Capacity: Cache cannot contain all blocks access by the program
Solution: increase cache size
Conflict (collision): Multi le memor locations ma ed
10/19/2010 IT6030 10
to the same cache location Solution 1: increase cache size
Solution 2: increase associativity
Coherence (Invalidation): other process (e.g., I/O)updates memory
Q2: How is a block found if it is in theupper level?
Blockoffset
Block Address
Tag Index
Index Used to Lookup Candidates in Cache Index identifies the set
Set Select
Data Select
10/19/2010 IT6030 11
If no candidates match, then declare cache miss
Block is minimum quantum of caching Data select field used to select data within block
Many caching applications dont have data select field
Word3Word0 Word1 Word2
Block Size and Spatial Locality
Tag
Block is unit of transfer between the cache and memory
4 word block,=
block address offsetb
2b = block size a.k.aline size (in bytes)
Split CPU
address
b bits32-b bits
10/19/2010 IT6030 12
less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses
What are the disadvantages of increasing block size?
Fewer blocks => more conflicts. Can waste bandwidth.
-
7/31/2019 ACA Chapter3
4/13
10/19/201
T6030
Review: Direct Mapped Cache Direct Mapped 2N byte cache:
The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M)
Example: 1 KB Direct Mapped Cache with 32 B Blocks
Valid Bit Cache Tag Cache Data
Tag checked to verify block Byte select chooses byte within block
Ex: 0x50 Ex: 0x00
Cache Index
0431
Cache Tag Byte Select
9
Ex: 0x01
10/19/2010 IT6030 13
:
0x50
:
Byte 32 1
2
3
:
yteyteyte :
Byte 33Byte 63 :
Byte 992Byte 1023 : 31
Review: Set Associative Cache N-way set associative: N entries per Cache Index
N direct mapped caches operates in parallel Example: Two-way set associative cache
Cache Index selects a set from the cache
Cache Index
0431
Cache Tag Byte Select
8
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Data is selected based on the tag result
10/19/2010 IT6030 14
Mux 01Sel1 Sel0
OR
Hit
Compare Compare
Cache Block
Review: Fully Associative Cache Fully Associative: Every block can hold any line
Address does not include a cache index Compare Cache Tags of all Cache Entries in Parallel
Exam le: Block Size=32B blocks We need N 27-bit comparators Still have byte select to choose from within block
04
Cache Tag (27 bit s long) Byte Select
31
Ex: 0x01
10/19/2010 IT6030 15
:
Byte 0Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
::
=
=
=
=
=
Q3: Which block should be replacedon a miss?
Easy for Direct Mapped
Set Associative or Fully Associative:
LRU (Least Recently Used): Appealing, but hard toimplement for high associativity
Random: Easy, but how well does it work?
Assoc: 2-way 4-way 8-way
Size LRU Ran LRU Ran LRU Ran
10/19/2010 IT6030 16
16K 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64K 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256K 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
10/19/201
-
7/31/2019 ACA Chapter3
5/13
10/19/201
T6030
Q4: What happens on a write?
Write-Through Write-Back
Write data only to
Policy
block
also written to lower-
level memory
Update lower levelwhen a block fallsout of the cache
Debug Easy Hard
10/19/2010 IT6030 17
produce writes? No Yes
Do repeatedwrites make it to
lower level?Yes No
Additional option -- let writes to an un-cached addressallocate a new cache line (write-allocate).
Write Buffers for Write-Through Caches
Processor Cache LowerLevel
Memory
Q. Why a write buffer ?
Holds data awaiting write-through tolower level memory
A. So CPU doesnt stall
10/19/2010 IT6030 18
. y a u er, w ynot just one register ? . urs s o wr es arecommon.
Q. Are Read After Write(RAW) hazards an issuefor write buffer?
A. Yes! Drain buffer beforenext read, or check writebuffers for match on reads
5 Basic Cache Optimizations
Reducing Miss Rate
1. Larger Block size (compulsory misses)
.
3. Higher Associativity (conflict misses)
Reducing Miss Penalty
4. Multilevel Caches
10/19/2010 IT6030 19
Reducing hit time
5. Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer
RISC: The integrated systems view(Discussion of Papers)
The Case for the Reduced Instruction Set Computer Dave Patterson and David Ditzel
Comments on The Case for the Reduced InstructionSet Computer Doug Clark and William Strecker
"Retrospective on High-Level Computer Architecture" David Ditzel and David Patterson
In-class discussion of these a ers
10/19/2010 IT6030 20
10/19/201
-
7/31/2019 ACA Chapter3
6/13
10/19/201
T6030
Physical
Address Space
Virtual
Address Space
What is virtual memory?
Virtual Address
V page no. offset10
Page Table
indexintopagetable
Page TableBase Reg
VAccessRights PA
table locatedin physicalmemory
P page no. offset
10/19/2010 IT6030 21
Virtual memory => treat memory as a cache for the disk Terminology: blocks in this cache are called Pages
Typical size of a page: 1K 8K
Page table maps virtual page numbers to physical frames PTE = Page Table Entry
Physical Address
What is in a Page Table Entry (PTE)? What is in a Page Table Entry (or PTE)?
Pointer to next-level page table or to actual page Permission bits: valid, read-only, read-write, write-only
Example: Intel x86 architecture PTE: Address same format previous slide (10, 10, 12-bit offset) Intermediate page tables called Directories
P: Present (same as valid bit in other architectures)W: Writeable
Page Frame Number(Physical Page Number)
Free(OS)
0 L D APCD
PWT U W P
01234567811-931-12
10/19/2010 IT6030 22
: ser access ePWT: Page write transparent: external cache write-throughPCD: Page cache disabled (page cannot be cached)
A: Accessed: page has been accessed recentlyD: Dirty (PTE only): page has been modified recentlyL: L=1 4MB page (directory only).
Bottom 22 bits of virtual address serve as offset
Three Advantages of Virtual Memory Translation:
Program can be given consistent view of memory, even though physicalmemory is scrambled
Makes multithreading reasonable (now used a lot!)
Only the most important part of program (Working Set) must be inphysical memory.
Contiguous structures (like stacks) use only as much physical memory
as necessary yet still grow later.
Protection: Different threads (or processes) protected from each other.
Different pages can be given special behavior
(Read Only, Invisible to user programs, etc).
10/19/2010 IT6030 23
Kernel data protected from User programs
Very important for protection from malicious programs
Sharing: Can map same physical page to multiple users
(Shared memory)
PhysicalAddress: Offset
PhysicalPage #
Large Address Space Support10 bits 10 bits 12 bits
VirtualAddress:
OffsetVirtualP2 indexVirtual
P1 index
4 bytes
PageTablePtr
10/19/2010 IT6030 24
Single-Level Page Table Large 4KB pages for a 32-bit address 1M entries Each process needs own page table!
Multi-Level Page Table Can allow sparseness of page table Portions of table can be swapped to disk
4 bytes
10/19/201
-
7/31/2019 ACA Chapter3
7/13
10/19/201
T6030
Translation Look-Aside Buffers (TLB)Cache on translations
Fully Associative, Set Associative, or Direct Mapped
Translation Look-Aside Buffers
CPU TLB CacheMain
Memory
VA PA miss
hit
Trans-
hit
missTranslationwith a TLB
10/19/2010 IT6030 25
TLBs are:
Small typically not more than 128 256 entries
Fully Associative
datalation
Caching Applied to Address Translation
CPU PhysicalTLBVirtualAddress PhysicalAddress
YesCached?
Data Read or Write
emory
Translate(MMU)
o
10/19/2010 IT6030 26
Question is one of page locality: does it exist? Instruction accesses spend a lot of time on the same page (since
accesses sequential) Stack accesses have definite locality of reference Data accesses have less page locality, but still some
Can we have a TLB hierarchy? Sure: multiple levels at different sizes/speeds
What Actually Happens on a TLB Miss?
Hardware traversed page tables: On TLB miss, hardware in MMU looks at current page table to fill TLB
(may walk multiple levels) If PTE valid, hardware fills TLB and processor never knows If PTE marked as invalid, causes Page Fault, after which kernel
decides what to do afterwards
Software traversed Page tables (like MIPS) On TLB miss, processor receives TLB fault Kernel traverses page table to find PTE
If PTE valid, fills TLB and returns from fault If PTE marked as invalid, internally calls Page Fault handler
Most chi sets rovide hardware traversal
10/19/2010 IT6030 27
Modern operating systems tend to have more TLB faults since they usetranslation for many things
Examples: shared segments user-level portions of an operating system
Single Clock Hand:Advances only on page fault!Check for pages not used recentlyMark pages as not used recently
Clock Algorithm: Not Recently Used
Clock Algorithm: Approximate LRU (approx to approx to MIN) Re lace an old a e not the oldest a e
in Memory
...
Page Table
1 0
useddirty
1 00 11 1
0 0
10/19/2010 IT6030 28
,
Details: Hardware use bit per physical page:
Hardware sets use bit on each reference If use bit isnt set, means not referenced in a long time
On page fault: Advance clock hand (not real time) Check use bit: 1 used recently; clear and leave alone
0 selected candidate for replacement
-
7/31/2019 ACA Chapter3
8/13
10/19/201
-
7/31/2019 ACA Chapter3
9/13
10/19/201
T6030
Summary #2/3: Caches The Principle of Locality:
Program access a relatively small portion of the address space at anyinstant of time.
Temporal Locality: Locality in Time
Spatial Locality: Locality in Space
Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses.
Capacity Misses: increase cache size
Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!
Write Policy: Write Through vs. Write Back
10/19/2010 IT6030 33
Today CPU time is a function of (ops, cache misses)vs. just f(ops): affects Compilers, Data structures, andAlgorithms
Summary #3/3: TLB, Virtual Memory
Page tables map virtual address to physical address TLBs are important for fast translation
TLB misses are significant in processor performance funny times, as most systems cant access all of 2nd level cache without
TLB misses!
Caches, TLBs, Virtual Memory all understood by examining howthey deal with 4 questions:1) Where can block be placed?2) How is block found?3) What block is replaced on miss?4) How are writes handled?
10/19/2010 IT6030 34
Today VM allows many processes to share single memorywithout having to swap all processes to disk; today VMprotection is more important than memory hierarchy benefits,but computers insecure
Prepare for debate + quiz on Wednesday
I/O Systems
Processorinterruptsinterrupts
Cache
Memory - I/O Bus
Main I/O I/O I/O
4/6/2009 cs252-S09, Lecture 18 35
Disk Disk Graphics Network
Example: Memory-Mapped Display Controller Memory-Mapped:
Hardware maps control registers and displaymemory into physical address space
Addresses set by hardware jumpers orprogramming at boot time
GraphicsCommand
0x80020000
Simply writing to display memory (also calledthe frame buffer) changes image on screen
Addr: 0x8000F0000x8000FFFF
Writing graphics description to command-queue area Say enter a set of triangles that describe
some scene Addr: 0x800100000x8001FFFF
Writing to the command register may causeon-board ra hics hardware to do somethin
Display
Memory0x8000F000
0x80010000
Command0x0007F004
4/6/2009 cs252-S09, Lecture 18 36
Say render the above scene Addr: 0x0007F004
Can protect with page tables
Physical Address
Space
a us0x0007F000
10/19/201
-
7/31/2019 ACA Chapter3
10/13T6030 1
Hard Disk Drives
Read/Write Head
Side View
4/6/2009 cs252-S09, Lecture 18 37
IBM/Hitachi Microdrive
Western Digital Drive
http://www.storagereview.com/guide/
Seagate Barracuda
2TB! 400 GB/in2
4 platters, 2 heads each
3.5 platters
Perpendicular recording
7200 RPM
4.2ms latency (?)
4/6/2009 cs252-S09, Lecture 18 38
100MB/Sec transfer speed 32MB cache
Properties of a Hard Magnetic Disk
Sector
Platters
Properties
Independently addressable element: sector OS always transfers groups of sectors togetherblocks A disk can access directly any given block of information it contains
(random access). Can access any file either sequentially or randomly. A disk can be rewritten in place: it is possible to read/modify/write a
block from the disk
Track
4/6/2009 cs252-S09, Lecture 18 39
500 to more than 20,000 tracks per surface 32 to 800 sectors per track
A sector is the smallest unit that can be read or written Zoned bit recording
Constant bit density: more sectors on outer tracks Speed varies with track location
MBits per square inch:DRAM as % of Disk over time
50%
9 v. 22 Mb/si
10%
20%
30%
40%
470 v. 3000 Mb/si
4/6/2009 cs252-S09, Lecture 18 40
0%
1974 1980 1986 1992 1998
source: New York Times, 2/23/98, page C3,Makers of disk drives crowd even mroe data into even smaller spaces
0.2 v. 1.7 Mb/si
10/19/201
-
7/31/2019 ACA Chapter3
11/13T6030 1
Nano-layered Disk Heads Special sensitivity of Disk head comes from Giant
Magneto-Resistive effect or (GMR) IBM is (was) leader in this technology
-
Coil for writing
4/6/2009 cs252-S09, Lecture 18 41
Disk Figure of Merit: Areal Density
Bits recorded along a track Metric is Bits Per Inch (BPI)
Number of tracks per surface
Metric is Tracks Per Inch (TPI)
Disk Designs Brag about bit density per unit area
Metric is Bits Per Square Inch: Areal Density = BPI x TPIYear Areal Density
1973 2
1979 8
1989 63
1997 3,090 10,000
100,000
1,000,000
ty
4/6/2009 cs252-S09, Lecture 18 42
2000 17,100
2006 130,000
2007 164,000
2009 400,000
1
10
100
1,000
1970 1980 1990 2000 2010
ArealDen
si
Disk I/O Performance
User
Thread
Controll
Disk
ResponseTime (ms)
200
300
Response Time = Queue+Disk Service Time
Queue[OS Paths]
ler
Performance of disk drive/file system Metrics: Res onse Time Throu h ut
100%
Throughput (Utilization)(% total BW)
0
100
0%
4/6/2009 cs252-S09, Lecture 18 43
,
Contributing factors to latency:
Software paths (can be loosely modeled by a queue)
Hardware controller
Physical disk media
Queuing behavior:
Can lead to big increase of latency as utilization approaches 100%
Magnetic Disk Characteristic Cylinder: all the tracks under the
head at a given point on all surface Read/write data is a three-stage
rocess:
Sector
Track
Cylinder
Head
Seek time: position the head/arm over the proper track (into propercylinder)
Rotational latency: wait for the desired sector
to rotate under the read/write head Transfer time: transfer a block of bits (sector)
under the read-write head
Disk Latency = Queueing Time + Controller time +Seek Time + Rotation Time + Xfer Time
a er
4/6/2009 cs252-S09, Lecture 18 44
Highest Bandwidth: transfer large group of blocks sequentially from one track
Software
Queue
(Device Driver)
Hardware
Controller
Media Time
(Seek+Rot+Xfer)
Request
Result
10/19/201
-
7/31/2019 ACA Chapter3
12/13T6030 1
Array Reliability
Reliability of N disks = Reliability of 1 Disk N
50,000 Hours 70 disks = 700 hours
Disk system MTTF: Drops from 6 years to 1 month!
Arrays (without redundancy) too unreliable to be useful!
4/6/2009 cs252-S09, Lecture 18 45
Hot spares support reconstruction in parallel withaccess: very high media availability can be achieved
Redundant Arrays of Disks
Files are "striped" across multiple spindles Redundancy yields high data availability
Disks will fail
Contents reconstructed from data redundantly stored in the arrayCapacity penalty to store it
Bandwidth penalty to update
Mirroring/Shadowing (high capacity cost)
4/6/2009 cs252-S09, Lecture 18 46
Horizontal Hamming Codes (overkill)
Parity & Reed-Solomon Codes
Failure Prediction (no capacity overhead!)VaxSimPlus Technique is controversial
Techniques:
Redundant Arrays of DisksRAID 1: Disk Mirroring/Shadowing
recoverygroup
Each disk is fully duplicated onto its "shadow"Very high availability can be achieved
Bandwidth sacrifice on write:
Logical write = two physical writes
4/6/2009 cs252-S09, Lecture 18 47
Reads may be optimized
Most expensive solution: 100% capacity overhead
Targeted for high I/O rate , high availability environments
Redundant Arrays of Disks RAID 5+:High I/O Rate Parity
A logical writebecomes four
D0 D1 D2 D3 PIncreasing
Logical
physical I/Os
Independent writes
possible because ofinterleaved parity
Reed-SolomonCodes ("Q") for
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
Addresses
Stripe
4/6/2009 cs252-S09, Lecture 18 48
reconstructionprotection duringreconstruction P D16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
StripeUnit
Targeted for mixedapplications
10/19/201
-
7/31/2019 ACA Chapter3
13/13T6030 1
Problems of Disk Arrays: Small Writes
RAID-5: Small Write Algorithm
1 Logical Write = 2 Physical Reads + 2 Physical Writes
D0 D1 D2 D3 PD0'
+
newdata
olddata
oldparity
XOR
(1. Read) (2. Read)
4/6/2009 cs252-S09, Lecture 18 49
+
D0' D1 D2 D3 P'
XOR
(3. Write) (4. Write)
System Availability: Orthogonal RAIDsString
Controller
StringController
. . .
. . .
ArrayController
StringController
StringController
StringController
. . .
. . .
. . .
4/6/2009 cs252-S09, Lecture 18 50
StringController . . .
Data Recovery Group:unit of data redundancy
Redundant Support Components:fans, power supplies, controller, cables
End to End Data Integrity:internal parity protected data paths
System-Level Availability
Fully dual redundantI/O Controller I/O Controller
host host
Array Controller Array Controller
. . .
. . .
. . . . . .
Goal: No SinglePoints ofFailure
4/6/2009 cs252-S09, Lecture 18 51
. . .
.
.
.
Recovery
Group
with duplicated paths, higher performance can beobtained when there are no failures