aca chapter3

7/31/2019 ACA Chapter3

1/13

10/19/201

T6030

IT6030Advanced Computer Architecture

Chapter 3

Memory and Storage Systems

N u en Kim Khanh

Department of Computer EngineeringSchool of Information and Communication Technology

Hanoi University of Science and Technology

Since 1980, CPU has outpaced DRAM ...

CPU60% per yrCPU

Performance

(1/latency)

n . yrs

DRAM

9% per yrDRAM

Gap grew 50% peryear

10/19/2010 IT6030 2

Year

How do architects address this gap?Put small, fast cache memories between CPU and DRAM.

Create a memory hierarchy

Memory Hierarchy Take advantage of the principle of locality to:

Present as much memory as in the cheapest technology

Provide access at speed offered by the fastest technology

OR

Control Secondary

Storage

(Disk/

FLASH/

Processor

Main

Memory

(DRAM/Second

Level

Tertiary

Storage

(Tape/

Cloud

Stora e)

10/19/2010 IT6030 3

n-Chip

ache

egisters

Datapath PCM)FLASH/

PCM)Cache

(SRAM)

1s 10,000,000s(10s ms)

Speed (ns): 10s-100s 100s

100s GsSize (bytes): Ks-Ms Ms

10,000,000,000s(10s sec)

Ts

The Principle of Locality

The Principle of Locality: Program access a relatively small portion of the address space at

any instant of time.

Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will

tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon(e.g., straightline code, array access)

Last 15 years, HW relied on locality for speed

10/19/2010 IT6030 4


2/13

10/19/201

T6030

Memory Hierarchy: Apple iMac G5Managed

by compiler Managedby hardware Managed by OS,hardware,application

iMac G51.6 GHz

07 Reg L1 Inst L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80G

Latency

Cycles,

Time

1,

0.6 ns

3,

1.9 ns

3,

1.9 ns

11,

6.9 ns

88,

55 ns

107,

12 ms

10/19/2010 IT6030 5

Let programs address a memory space that

scales to the disk size, at a speed that isusually as fast as register access

oa : us on o arge, as , c eap memory

iMacs PowerPC 970: All caches on-chip

Re

L1 (64K Instruction)

gisters

512KL2

10/19/2010 IT6030 6

(1K)

L1 (32K Data)

Memory Hierarchy: Terminology

Hit: data appears in some block in the upper level(example: Block X) Hit Rate: the fraction of memory access found in the upper level

:

RAM access time + Time to determine hit/miss

Miss: data needs to be retrieve from a block in thelower level (Block Y) Miss Rate = 1 - (Hit Rate)

Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

10/19/2010 IT6030 7

Hit Time


3/13

10/19/201

T6030

Q1: Where can a block be placed inthe upper level?

Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative

S.A. Ma in = Block Number Modulo Number Sets

. .

Cache

01234567 0123456701234567

Full MappedDirect Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

10/19/2010 IT6030 9

Memory

1111111111222222222233

01234567890123456789012345678901

Compulsory (cold start or process migration, firstreference): first access to a block Cold fact of life: not a whole lot you can do about it

Sources of Cache Misses

Note: If you are going to run billions of instruction, CompulsoryMisses are insignificant

Capacity: Cache cannot contain all blocks access by the program

Solution: increase cache size

Conflict (collision): Multi le memor locations ma ed

10/19/2010 IT6030 10

to the same cache location Solution 1: increase cache size

Solution 2: increase associativity

Coherence (Invalidation): other process (e.g., I/O)updates memory

Q2: How is a block found if it is in theupper level?

Blockoffset

Block Address

Tag Index

Index Used to Lookup Candidates in Cache Index identifies the set

Set Select

Data Select

10/19/2010 IT6030 11

If no candidates match, then declare cache miss

Block is minimum quantum of caching Data select field used to select data within block

Many caching applications dont have data select field

Word3Word0 Word1 Word2

Block Size and Spatial Locality

Tag

Block is unit of transfer between the cache and memory

4 word block,=

block address offsetb

2b = block size a.k.aline size (in bytes)

Split CPU

address

b bits32-b bits

10/19/2010 IT6030 12

less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses

What are the disadvantages of increasing block size?

Fewer blocks => more conflicts. Can waste bandwidth.


4/13

10/19/201

T6030

Review: Direct Mapped Cache Direct Mapped 2N byte cache:

The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M)

Example: 1 KB Direct Mapped Cache with 32 B Blocks

Valid Bit Cache Tag Cache Data

Tag checked to verify block Byte select chooses byte within block

Ex: 0x50 Ex: 0x00

Cache Index

0431

Cache Tag Byte Select

9

Ex: 0x01

10/19/2010 IT6030 13

:

0x50

:

Byte 32 1

2

3

:

yteyteyte :

Byte 33Byte 63 :

Byte 992Byte 1023 : 31

Review: Set Associative Cache N-way set associative: N entries per Cache Index

N direct mapped caches operates in parallel Example: Two-way set associative cache

Cache Index selects a set from the cache

Cache Index

0431

Cache Tag Byte Select

8

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Data is selected based on the tag result

10/19/2010 IT6030 14

Mux 01Sel1 Sel0

OR

Hit

Compare Compare

Cache Block

Review: Fully Associative Cache Fully Associative: Every block can hold any line

Address does not include a cache index Compare Cache Tags of all Cache Entries in Parallel

Exam le: Block Size=32B blocks We need N 27-bit comparators Still have byte select to choose from within block

04

Cache Tag (27 bit s long) Byte Select

31

Ex: 0x01

10/19/2010 IT6030 15

:

Byte 0Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

::

=

=

=

=

=

Q3: Which block should be replacedon a miss?

Easy for Direct Mapped

Set Associative or Fully Associative:

LRU (Least Recently Used): Appealing, but hard toimplement for high associativity

Random: Easy, but how well does it work?

Assoc: 2-way 4-way 8-way

Size LRU Ran LRU Ran LRU Ran

10/19/2010 IT6030 16

16K 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64K 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256K 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

10/19/201


5/13

10/19/201

T6030

Q4: What happens on a write?

Write-Through Write-Back

Write data only to

Policy

block

also written to lower-

level memory

Update lower levelwhen a block fallsout of the cache

Debug Easy Hard

10/19/2010 IT6030 17

produce writes? No Yes

Do repeatedwrites make it to

lower level?Yes No

Additional option -- let writes to an un-cached addressallocate a new cache line (write-allocate).

Write Buffers for Write-Through Caches

Processor Cache LowerLevel

Memory

Q. Why a write buffer ?

Holds data awaiting write-through tolower level memory

A. So CPU doesnt stall

10/19/2010 IT6030 18

. y a u er, w ynot just one register ? . urs s o wr es arecommon.

Q. Are Read After Write(RAW) hazards an issuefor write buffer?

A. Yes! Drain buffer beforenext read, or check writebuffers for match on reads

5 Basic Cache Optimizations

Reducing Miss Rate

1. Larger Block size (compulsory misses)

.

3. Higher Associativity (conflict misses)

Reducing Miss Penalty

4. Multilevel Caches

10/19/2010 IT6030 19

Reducing hit time

5. Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer

RISC: The integrated systems view(Discussion of Papers)

The Case for the Reduced Instruction Set Computer Dave Patterson and David Ditzel

Comments on The Case for the Reduced InstructionSet Computer Doug Clark and William Strecker

"Retrospective on High-Level Computer Architecture" David Ditzel and David Patterson

In-class discussion of these a ers

10/19/2010 IT6030 20

10/19/201


6/13

10/19/201

T6030

Physical

Address Space

Virtual

Address Space

What is virtual memory?

Virtual Address

V page no. offset10

Page Table

indexintopagetable

Page TableBase Reg

VAccessRights PA

table locatedin physicalmemory

P page no. offset

10/19/2010 IT6030 21

Virtual memory => treat memory as a cache for the disk Terminology: blocks in this cache are called Pages

Typical size of a page: 1K 8K

Page table maps virtual page numbers to physical frames PTE = Page Table Entry

Physical Address

What is in a Page Table Entry (PTE)? What is in a Page Table Entry (or PTE)?

Pointer to next-level page table or to actual page Permission bits: valid, read-only, read-write, write-only

Example: Intel x86 architecture PTE: Address same format previous slide (10, 10, 12-bit offset) Intermediate page tables called Directories

P: Present (same as valid bit in other architectures)W: Writeable

Page Frame Number(Physical Page Number)

Free(OS)

0 L D APCD

PWT U W P

01234567811-931-12

10/19/2010 IT6030 22

: ser access ePWT: Page write transparent: external cache write-throughPCD: Page cache disabled (page cannot be cached)

A: Accessed: page has been accessed recentlyD: Dirty (PTE only): page has been modified recentlyL: L=1 4MB page (directory only).

Bottom 22 bits of virtual address serve as offset

Three Advantages of Virtual Memory Translation:

Program can be given consistent view of memory, even though physicalmemory is scrambled

Makes multithreading reasonable (now used a lot!)

Only the most important part of program (Working Set) must be inphysical memory.

Contiguous structures (like stacks) use only as much physical memory

as necessary yet still grow later.

Protection: Different threads (or processes) protected from each other.

Different pages can be given special behavior

(Read Only, Invisible to user programs, etc).

10/19/2010 IT6030 23

Kernel data protected from User programs

Very important for protection from malicious programs

Sharing: Can map same physical page to multiple users

(Shared memory)

PhysicalAddress: Offset

PhysicalPage #

Large Address Space Support10 bits 10 bits 12 bits

VirtualAddress:

OffsetVirtualP2 indexVirtual

P1 index

4 bytes

PageTablePtr

10/19/2010 IT6030 24

Single-Level Page Table Large 4KB pages for a 32-bit address 1M entries Each process needs own page table!

Multi-Level Page Table Can allow sparseness of page table Portions of table can be swapped to disk

4 bytes

10/19/201


7/13

10/19/201

T6030

Translation Look-Aside Buffers (TLB)Cache on translations

Fully Associative, Set Associative, or Direct Mapped

Translation Look-Aside Buffers

CPU TLB CacheMain

Memory

VA PA miss

hit

Trans-

hit

missTranslationwith a TLB

10/19/2010 IT6030 25

TLBs are:

Small typically not more than 128 256 entries

Fully Associative

datalation

Caching Applied to Address Translation

CPU PhysicalTLBVirtualAddress PhysicalAddress

YesCached?

Data Read or Write

emory

Translate(MMU)

o

10/19/2010 IT6030 26

Question is one of page locality: does it exist? Instruction accesses spend a lot of time on the same page (since

accesses sequential) Stack accesses have definite locality of reference Data accesses have less page locality, but still some

Can we have a TLB hierarchy? Sure: multiple levels at different sizes/speeds

What Actually Happens on a TLB Miss?

Hardware traversed page tables: On TLB miss, hardware in MMU looks at current page table to fill TLB

(may walk multiple levels) If PTE valid, hardware fills TLB and processor never knows If PTE marked as invalid, causes Page Fault, after which kernel

decides what to do afterwards

Software traversed Page tables (like MIPS) On TLB miss, processor receives TLB fault Kernel traverses page table to find PTE

If PTE valid, fills TLB and returns from fault If PTE marked as invalid, internally calls Page Fault handler

Most chi sets rovide hardware traversal

10/19/2010 IT6030 27

Modern operating systems tend to have more TLB faults since they usetranslation for many things

Examples: shared segments user-level portions of an operating system

Single Clock Hand:Advances only on page fault!Check for pages not used recentlyMark pages as not used recently

Clock Algorithm: Not Recently Used

Clock Algorithm: Approximate LRU (approx to approx to MIN) Re lace an old a e not the oldest a e

in Memory

...

Page Table

1 0

useddirty

1 00 11 1

0 0

10/19/2010 IT6030 28

,

Details: Hardware use bit per physical page:

Hardware sets use bit on each reference If use bit isnt set, means not referenced in a long time

On page fault: Advance clock hand (not real time) Check use bit: 1 used recently; clear and leave alone

0 selected candidate for replacement


8/13

10/19/201


9/13

10/19/201

T6030

Summary #2/3: Caches The Principle of Locality:

Program access a relatively small portion of the address space at anyinstant of time.

Temporal Locality: Locality in Time

Spatial Locality: Locality in Space

Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses.

Capacity Misses: increase cache size

Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!

Write Policy: Write Through vs. Write Back

10/19/2010 IT6030 33

Today CPU time is a function of (ops, cache misses)vs. just f(ops): affects Compilers, Data structures, andAlgorithms

Summary #3/3: TLB, Virtual Memory

Page tables map virtual address to physical address TLBs are important for fast translation

TLB misses are significant in processor performance funny times, as most systems cant access all of 2nd level cache without

TLB misses!

Caches, TLBs, Virtual Memory all understood by examining howthey deal with 4 questions:1) Where can block be placed?2) How is block found?3) What block is replaced on miss?4) How are writes handled?

10/19/2010 IT6030 34

Today VM allows many processes to share single memorywithout having to swap all processes to disk; today VMprotection is more important than memory hierarchy benefits,but computers insecure

Prepare for debate + quiz on Wednesday

I/O Systems

Processorinterruptsinterrupts

Cache

Memory - I/O Bus

Main I/O I/O I/O

4/6/2009 cs252-S09, Lecture 18 35

Disk Disk Graphics Network

Example: Memory-Mapped Display Controller Memory-Mapped:

Hardware maps control registers and displaymemory into physical address space

Addresses set by hardware jumpers orprogramming at boot time

GraphicsCommand

0x80020000

Simply writing to display memory (also calledthe frame buffer) changes image on screen

Addr: 0x8000F0000x8000FFFF

Writing graphics description to command-queue area Say enter a set of triangles that describe

some scene Addr: 0x800100000x8001FFFF

Writing to the command register may causeon-board ra hics hardware to do somethin

Display

Memory0x8000F000

0x80010000

Command0x0007F004

4/6/2009 cs252-S09, Lecture 18 36

Say render the above scene Addr: 0x0007F004

Can protect with page tables

Physical Address

Space

a us0x0007F000

10/19/201


10/13T6030 1

Hard Disk Drives

Read/Write Head

Side View

4/6/2009 cs252-S09, Lecture 18 37

IBM/Hitachi Microdrive

Western Digital Drive

http://www.storagereview.com/guide/

Seagate Barracuda

2TB! 400 GB/in2

4 platters, 2 heads each

3.5 platters

Perpendicular recording

7200 RPM

4.2ms latency (?)

4/6/2009 cs252-S09, Lecture 18 38

100MB/Sec transfer speed 32MB cache

Properties of a Hard Magnetic Disk

Sector

Platters

Properties

Independently addressable element: sector OS always transfers groups of sectors togetherblocks A disk can access directly any given block of information it contains

(random access). Can access any file either sequentially or randomly. A disk can be rewritten in place: it is possible to read/modify/write a

block from the disk

Track

4/6/2009 cs252-S09, Lecture 18 39

500 to more than 20,000 tracks per surface 32 to 800 sectors per track

A sector is the smallest unit that can be read or written Zoned bit recording

Constant bit density: more sectors on outer tracks Speed varies with track location

MBits per square inch:DRAM as % of Disk over time

50%

9 v. 22 Mb/si

10%

20%

30%

40%

470 v. 3000 Mb/si

4/6/2009 cs252-S09, Lecture 18 40

0%

1974 1980 1986 1992 1998

source: New York Times, 2/23/98, page C3,Makers of disk drives crowd even mroe data into even smaller spaces

0.2 v. 1.7 Mb/si

10/19/201


11/13T6030 1

Nano-layered Disk Heads Special sensitivity of Disk head comes from Giant

Magneto-Resistive effect or (GMR) IBM is (was) leader in this technology

-

Coil for writing

4/6/2009 cs252-S09, Lecture 18 41

Disk Figure of Merit: Areal Density

Bits recorded along a track Metric is Bits Per Inch (BPI)

Number of tracks per surface

Metric is Tracks Per Inch (TPI)

Disk Designs Brag about bit density per unit area

Metric is Bits Per Square Inch: Areal Density = BPI x TPIYear Areal Density

1973 2

1979 8

1989 63

1997 3,090 10,000

100,000

1,000,000

ty

4/6/2009 cs252-S09, Lecture 18 42

2000 17,100

2006 130,000

2007 164,000

2009 400,000

1

10

100

1,000

1970 1980 1990 2000 2010

ArealDen

si

Disk I/O Performance

User

Thread

Controll

Disk

ResponseTime (ms)

200

300

Response Time = Queue+Disk Service Time

Queue[OS Paths]

ler

Performance of disk drive/file system Metrics: Res onse Time Throu h ut

100%

Throughput (Utilization)(% total BW)

0

100

0%

4/6/2009 cs252-S09, Lecture 18 43

,

Contributing factors to latency:

Software paths (can be loosely modeled by a queue)

Hardware controller

Physical disk media

Queuing behavior:

Can lead to big increase of latency as utilization approaches 100%

Magnetic Disk Characteristic Cylinder: all the tracks under the

head at a given point on all surface Read/write data is a three-stage

rocess:

Sector

Track

Cylinder

Head

Seek time: position the head/arm over the proper track (into propercylinder)

Rotational latency: wait for the desired sector

to rotate under the read/write head Transfer time: transfer a block of bits (sector)

under the read-write head

Disk Latency = Queueing Time + Controller time +Seek Time + Rotation Time + Xfer Time

a er

4/6/2009 cs252-S09, Lecture 18 44

Highest Bandwidth: transfer large group of blocks sequentially from one track

Software

Queue

(Device Driver)

Hardware

Controller

Media Time

(Seek+Rot+Xfer)

Request

Result

10/19/201


12/13T6030 1

Array Reliability

Reliability of N disks = Reliability of 1 Disk N

50,000 Hours 70 disks = 700 hours

Disk system MTTF: Drops from 6 years to 1 month!

Arrays (without redundancy) too unreliable to be useful!

4/6/2009 cs252-S09, Lecture 18 45

Hot spares support reconstruction in parallel withaccess: very high media availability can be achieved

Redundant Arrays of Disks

Files are "striped" across multiple spindles Redundancy yields high data availability

Disks will fail

Contents reconstructed from data redundantly stored in the arrayCapacity penalty to store it

Bandwidth penalty to update

Mirroring/Shadowing (high capacity cost)

4/6/2009 cs252-S09, Lecture 18 46

Horizontal Hamming Codes (overkill)

Parity & Reed-Solomon Codes

Failure Prediction (no capacity overhead!)VaxSimPlus Technique is controversial

Techniques:

Redundant Arrays of DisksRAID 1: Disk Mirroring/Shadowing

recoverygroup

Each disk is fully duplicated onto its "shadow"Very high availability can be achieved

Bandwidth sacrifice on write:

Logical write = two physical writes

4/6/2009 cs252-S09, Lecture 18 47

Reads may be optimized

Most expensive solution: 100% capacity overhead

Targeted for high I/O rate , high availability environments

Redundant Arrays of Disks RAID 5+:High I/O Rate Parity

A logical writebecomes four

D0 D1 D2 D3 PIncreasing

Logical

physical I/Os

Independent writes

possible because ofinterleaved parity

Reed-SolomonCodes ("Q") for

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

Addresses

Stripe

4/6/2009 cs252-S09, Lecture 18 48

reconstructionprotection duringreconstruction P D16 D17 D18 D19

D20 D21 D22 D23 P

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.Disk Columns

StripeUnit

Targeted for mixedapplications

10/19/201


13/13T6030 1

Problems of Disk Arrays: Small Writes

RAID-5: Small Write Algorithm

1 Logical Write = 2 Physical Reads + 2 Physical Writes

D0 D1 D2 D3 PD0'

+

newdata

olddata

oldparity

XOR

(1. Read) (2. Read)

4/6/2009 cs252-S09, Lecture 18 49

+

D0' D1 D2 D3 P'

XOR

(3. Write) (4. Write)

System Availability: Orthogonal RAIDsString

Controller

StringController

. . .

. . .

ArrayController

StringController

StringController

StringController

. . .

. . .

. . .

4/6/2009 cs252-S09, Lecture 18 50

StringController . . .

Data Recovery Group:unit of data redundancy

Redundant Support Components:fans, power supplies, controller, cables

End to End Data Integrity:internal parity protected data paths

System-Level Availability

Fully dual redundantI/O Controller I/O Controller

host host

Array Controller Array Controller

. . .

. . .

. . . . . .

Goal: No SinglePoints ofFailure

4/6/2009 cs252-S09, Lecture 18 51

. . .

.

.

.

Recovery

Group

with duplicated paths, higher performance can beobtained when there are no failures

aca chapter3

Documents