aca chapter3

Upload: ngat-sky

Post on 05-Apr-2018

245 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 ACA Chapter3

    1/13

    10/19/201

    T6030

    IT6030Advanced Computer Architecture

    Chapter 3

    Memory and Storage Systems

    N u en Kim Khanh

    Department of Computer EngineeringSchool of Information and Communication Technology

    Hanoi University of Science and Technology

    Since 1980, CPU has outpaced DRAM ...

    CPU60% per yrCPU

    Performance

    (1/latency)

    n . yrs

    DRAM

    9% per yrDRAM

    Gap grew 50% peryear

    10/19/2010 IT6030 2

    Year

    How do architects address this gap?Put small, fast cache memories between CPU and DRAM.

    Create a memory hierarchy

    Memory Hierarchy Take advantage of the principle of locality to:

    Present as much memory as in the cheapest technology

    Provide access at speed offered by the fastest technology

    OR

    Control Secondary

    Storage

    (Disk/

    FLASH/

    Processor

    Main

    Memory

    (DRAM/Second

    Level

    Tertiary

    Storage

    (Tape/

    Cloud

    Stora e)

    10/19/2010 IT6030 3

    n-Chip

    ache

    egisters

    Datapath PCM)FLASH/

    PCM)Cache

    (SRAM)

    1s 10,000,000s(10s ms)

    Speed (ns): 10s-100s 100s

    100s GsSize (bytes): Ks-Ms Ms

    10,000,000,000s(10s sec)

    Ts

    The Principle of Locality

    The Principle of Locality: Program access a relatively small portion of the address space at

    any instant of time.

    Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will

    tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items

    whose addresses are close by tend to be referenced soon(e.g., straightline code, array access)

    Last 15 years, HW relied on locality for speed

    10/19/2010 IT6030 4

  • 7/31/2019 ACA Chapter3

    2/13

    10/19/201

    T6030

    Memory Hierarchy: Apple iMac G5Managed

    by compiler Managedby hardware Managed by OS,hardware,application

    iMac G51.6 GHz

    07 Reg L1 Inst L1 Data L2 DRAM Disk

    Size 1K 64K 32K 512K 256M 80G

    Latency

    Cycles,

    Time

    1,

    0.6 ns

    3,

    1.9 ns

    3,

    1.9 ns

    11,

    6.9 ns

    88,

    55 ns

    107,

    12 ms

    10/19/2010 IT6030 5

    Let programs address a memory space that

    scales to the disk size, at a speed that isusually as fast as register access

    oa : us on o arge, as , c eap memory

    iMacs PowerPC 970: All caches on-chip

    Re

    L1 (64K Instruction)

    gisters

    512KL2

    10/19/2010 IT6030 6

    (1K)

    L1 (32K Data)

    Memory Hierarchy: Terminology

    Hit: data appears in some block in the upper level(example: Block X) Hit Rate: the fraction of memory access found in the upper level

    :

    RAM access time + Time to determine hit/miss

    Miss: data needs to be retrieve from a block in thelower level (Block Y) Miss Rate = 1 - (Hit Rate)

    Miss Penalty: Time to replace a block in the upper level +

    Time to deliver the block the processor

    10/19/2010 IT6030 7

    Hit Time

  • 7/31/2019 ACA Chapter3

    3/13

    10/19/201

    T6030

    Q1: Where can a block be placed inthe upper level?

    Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative

    S.A. Ma in = Block Number Modulo Number Sets

    . .

    Cache

    01234567 0123456701234567

    Full MappedDirect Mapped(12 mod 8) = 4

    2-Way Assoc(12 mod 4) = 0

    10/19/2010 IT6030 9

    Memory

    1111111111222222222233

    01234567890123456789012345678901

    Compulsory (cold start or process migration, firstreference): first access to a block Cold fact of life: not a whole lot you can do about it

    Sources of Cache Misses

    Note: If you are going to run billions of instruction, CompulsoryMisses are insignificant

    Capacity: Cache cannot contain all blocks access by the program

    Solution: increase cache size

    Conflict (collision): Multi le memor locations ma ed

    10/19/2010 IT6030 10

    to the same cache location Solution 1: increase cache size

    Solution 2: increase associativity

    Coherence (Invalidation): other process (e.g., I/O)updates memory

    Q2: How is a block found if it is in theupper level?

    Blockoffset

    Block Address

    Tag Index

    Index Used to Lookup Candidates in Cache Index identifies the set

    Set Select

    Data Select

    10/19/2010 IT6030 11

    If no candidates match, then declare cache miss

    Block is minimum quantum of caching Data select field used to select data within block

    Many caching applications dont have data select field

    Word3Word0 Word1 Word2

    Block Size and Spatial Locality

    Tag

    Block is unit of transfer between the cache and memory

    4 word block,=

    block address offsetb

    2b = block size a.k.aline size (in bytes)

    Split CPU

    address

    b bits32-b bits

    10/19/2010 IT6030 12

    less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses

    What are the disadvantages of increasing block size?

    Fewer blocks => more conflicts. Can waste bandwidth.

  • 7/31/2019 ACA Chapter3

    4/13

    10/19/201

    T6030

    Review: Direct Mapped Cache Direct Mapped 2N byte cache:

    The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M)

    Example: 1 KB Direct Mapped Cache with 32 B Blocks

    Valid Bit Cache Tag Cache Data

    Tag checked to verify block Byte select chooses byte within block

    Ex: 0x50 Ex: 0x00

    Cache Index

    0431

    Cache Tag Byte Select

    9

    Ex: 0x01

    10/19/2010 IT6030 13

    :

    0x50

    :

    Byte 32 1

    2

    3

    :

    yteyteyte :

    Byte 33Byte 63 :

    Byte 992Byte 1023 : 31

    Review: Set Associative Cache N-way set associative: N entries per Cache Index

    N direct mapped caches operates in parallel Example: Two-way set associative cache

    Cache Index selects a set from the cache

    Cache Index

    0431

    Cache Tag Byte Select

    8

    Cache Data

    Cache Block 0

    Cache TagValid

    :: :

    Cache Data

    Cache Block 0

    Cache Tag Valid

    : ::

    Data is selected based on the tag result

    10/19/2010 IT6030 14

    Mux 01Sel1 Sel0

    OR

    Hit

    Compare Compare

    Cache Block

    Review: Fully Associative Cache Fully Associative: Every block can hold any line

    Address does not include a cache index Compare Cache Tags of all Cache Entries in Parallel

    Exam le: Block Size=32B blocks We need N 27-bit comparators Still have byte select to choose from within block

    04

    Cache Tag (27 bit s long) Byte Select

    31

    Ex: 0x01

    10/19/2010 IT6030 15

    :

    Byte 0Byte 1Byte 31 :

    Byte 32Byte 33Byte 63 :

    ::

    =

    =

    =

    =

    =

    Q3: Which block should be replacedon a miss?

    Easy for Direct Mapped

    Set Associative or Fully Associative:

    LRU (Least Recently Used): Appealing, but hard toimplement for high associativity

    Random: Easy, but how well does it work?

    Assoc: 2-way 4-way 8-way

    Size LRU Ran LRU Ran LRU Ran

    10/19/2010 IT6030 16

    16K 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

    64K 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

    256K 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

    10/19/201

  • 7/31/2019 ACA Chapter3

    5/13

    10/19/201

    T6030

    Q4: What happens on a write?

    Write-Through Write-Back

    Write data only to

    Policy

    block

    also written to lower-

    level memory

    Update lower levelwhen a block fallsout of the cache

    Debug Easy Hard

    10/19/2010 IT6030 17

    produce writes? No Yes

    Do repeatedwrites make it to

    lower level?Yes No

    Additional option -- let writes to an un-cached addressallocate a new cache line (write-allocate).

    Write Buffers for Write-Through Caches

    Processor Cache LowerLevel

    Memory

    Q. Why a write buffer ?

    Holds data awaiting write-through tolower level memory

    A. So CPU doesnt stall

    10/19/2010 IT6030 18

    . y a u er, w ynot just one register ? . urs s o wr es arecommon.

    Q. Are Read After Write(RAW) hazards an issuefor write buffer?

    A. Yes! Drain buffer beforenext read, or check writebuffers for match on reads

    5 Basic Cache Optimizations

    Reducing Miss Rate

    1. Larger Block size (compulsory misses)

    .

    3. Higher Associativity (conflict misses)

    Reducing Miss Penalty

    4. Multilevel Caches

    10/19/2010 IT6030 19

    Reducing hit time

    5. Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer

    RISC: The integrated systems view(Discussion of Papers)

    The Case for the Reduced Instruction Set Computer Dave Patterson and David Ditzel

    Comments on The Case for the Reduced InstructionSet Computer Doug Clark and William Strecker

    "Retrospective on High-Level Computer Architecture" David Ditzel and David Patterson

    In-class discussion of these a ers

    10/19/2010 IT6030 20

    10/19/201

  • 7/31/2019 ACA Chapter3

    6/13

    10/19/201

    T6030

    Physical

    Address Space

    Virtual

    Address Space

    What is virtual memory?

    Virtual Address

    V page no. offset10

    Page Table

    indexintopagetable

    Page TableBase Reg

    VAccessRights PA

    table locatedin physicalmemory

    P page no. offset

    10/19/2010 IT6030 21

    Virtual memory => treat memory as a cache for the disk Terminology: blocks in this cache are called Pages

    Typical size of a page: 1K 8K

    Page table maps virtual page numbers to physical frames PTE = Page Table Entry

    Physical Address

    What is in a Page Table Entry (PTE)? What is in a Page Table Entry (or PTE)?

    Pointer to next-level page table or to actual page Permission bits: valid, read-only, read-write, write-only

    Example: Intel x86 architecture PTE: Address same format previous slide (10, 10, 12-bit offset) Intermediate page tables called Directories

    P: Present (same as valid bit in other architectures)W: Writeable

    Page Frame Number(Physical Page Number)

    Free(OS)

    0 L D APCD

    PWT U W P

    01234567811-931-12

    10/19/2010 IT6030 22

    : ser access ePWT: Page write transparent: external cache write-throughPCD: Page cache disabled (page cannot be cached)

    A: Accessed: page has been accessed recentlyD: Dirty (PTE only): page has been modified recentlyL: L=1 4MB page (directory only).

    Bottom 22 bits of virtual address serve as offset

    Three Advantages of Virtual Memory Translation:

    Program can be given consistent view of memory, even though physicalmemory is scrambled

    Makes multithreading reasonable (now used a lot!)

    Only the most important part of program (Working Set) must be inphysical memory.

    Contiguous structures (like stacks) use only as much physical memory

    as necessary yet still grow later.

    Protection: Different threads (or processes) protected from each other.

    Different pages can be given special behavior

    (Read Only, Invisible to user programs, etc).

    10/19/2010 IT6030 23

    Kernel data protected from User programs

    Very important for protection from malicious programs

    Sharing: Can map same physical page to multiple users

    (Shared memory)

    PhysicalAddress: Offset

    PhysicalPage #

    Large Address Space Support10 bits 10 bits 12 bits

    VirtualAddress:

    OffsetVirtualP2 indexVirtual

    P1 index

    4 bytes

    PageTablePtr

    10/19/2010 IT6030 24

    Single-Level Page Table Large 4KB pages for a 32-bit address 1M entries Each process needs own page table!

    Multi-Level Page Table Can allow sparseness of page table Portions of table can be swapped to disk

    4 bytes

    10/19/201

  • 7/31/2019 ACA Chapter3

    7/13

    10/19/201

    T6030

    Translation Look-Aside Buffers (TLB)Cache on translations

    Fully Associative, Set Associative, or Direct Mapped

    Translation Look-Aside Buffers

    CPU TLB CacheMain

    Memory

    VA PA miss

    hit

    Trans-

    hit

    missTranslationwith a TLB

    10/19/2010 IT6030 25

    TLBs are:

    Small typically not more than 128 256 entries

    Fully Associative

    datalation

    Caching Applied to Address Translation

    CPU PhysicalTLBVirtualAddress PhysicalAddress

    YesCached?

    Data Read or Write

    emory

    Translate(MMU)

    o

    10/19/2010 IT6030 26

    Question is one of page locality: does it exist? Instruction accesses spend a lot of time on the same page (since

    accesses sequential) Stack accesses have definite locality of reference Data accesses have less page locality, but still some

    Can we have a TLB hierarchy? Sure: multiple levels at different sizes/speeds

    What Actually Happens on a TLB Miss?

    Hardware traversed page tables: On TLB miss, hardware in MMU looks at current page table to fill TLB

    (may walk multiple levels) If PTE valid, hardware fills TLB and processor never knows If PTE marked as invalid, causes Page Fault, after which kernel

    decides what to do afterwards

    Software traversed Page tables (like MIPS) On TLB miss, processor receives TLB fault Kernel traverses page table to find PTE

    If PTE valid, fills TLB and returns from fault If PTE marked as invalid, internally calls Page Fault handler

    Most chi sets rovide hardware traversal

    10/19/2010 IT6030 27

    Modern operating systems tend to have more TLB faults since they usetranslation for many things

    Examples: shared segments user-level portions of an operating system

    Single Clock Hand:Advances only on page fault!Check for pages not used recentlyMark pages as not used recently

    Clock Algorithm: Not Recently Used

    Clock Algorithm: Approximate LRU (approx to approx to MIN) Re lace an old a e not the oldest a e

    in Memory

    ...

    Page Table

    1 0

    useddirty

    1 00 11 1

    0 0

    10/19/2010 IT6030 28

    ,

    Details: Hardware use bit per physical page:

    Hardware sets use bit on each reference If use bit isnt set, means not referenced in a long time

    On page fault: Advance clock hand (not real time) Check use bit: 1 used recently; clear and leave alone

    0 selected candidate for replacement

  • 7/31/2019 ACA Chapter3

    8/13

    10/19/201

  • 7/31/2019 ACA Chapter3

    9/13

    10/19/201

    T6030

    Summary #2/3: Caches The Principle of Locality:

    Program access a relatively small portion of the address space at anyinstant of time.

    Temporal Locality: Locality in Time

    Spatial Locality: Locality in Space

    Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses.

    Capacity Misses: increase cache size

    Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!

    Write Policy: Write Through vs. Write Back

    10/19/2010 IT6030 33

    Today CPU time is a function of (ops, cache misses)vs. just f(ops): affects Compilers, Data structures, andAlgorithms

    Summary #3/3: TLB, Virtual Memory

    Page tables map virtual address to physical address TLBs are important for fast translation

    TLB misses are significant in processor performance funny times, as most systems cant access all of 2nd level cache without

    TLB misses!

    Caches, TLBs, Virtual Memory all understood by examining howthey deal with 4 questions:1) Where can block be placed?2) How is block found?3) What block is replaced on miss?4) How are writes handled?

    10/19/2010 IT6030 34

    Today VM allows many processes to share single memorywithout having to swap all processes to disk; today VMprotection is more important than memory hierarchy benefits,but computers insecure

    Prepare for debate + quiz on Wednesday

    I/O Systems

    Processorinterruptsinterrupts

    Cache

    Memory - I/O Bus

    Main I/O I/O I/O

    4/6/2009 cs252-S09, Lecture 18 35

    Disk Disk Graphics Network

    Example: Memory-Mapped Display Controller Memory-Mapped:

    Hardware maps control registers and displaymemory into physical address space

    Addresses set by hardware jumpers orprogramming at boot time

    GraphicsCommand

    0x80020000

    Simply writing to display memory (also calledthe frame buffer) changes image on screen

    Addr: 0x8000F0000x8000FFFF

    Writing graphics description to command-queue area Say enter a set of triangles that describe

    some scene Addr: 0x800100000x8001FFFF

    Writing to the command register may causeon-board ra hics hardware to do somethin

    Display

    Memory0x8000F000

    0x80010000

    Command0x0007F004

    4/6/2009 cs252-S09, Lecture 18 36

    Say render the above scene Addr: 0x0007F004

    Can protect with page tables

    Physical Address

    Space

    a us0x0007F000

    10/19/201

  • 7/31/2019 ACA Chapter3

    10/13T6030 1

    Hard Disk Drives

    Read/Write Head

    Side View

    4/6/2009 cs252-S09, Lecture 18 37

    IBM/Hitachi Microdrive

    Western Digital Drive

    http://www.storagereview.com/guide/

    Seagate Barracuda

    2TB! 400 GB/in2

    4 platters, 2 heads each

    3.5 platters

    Perpendicular recording

    7200 RPM

    4.2ms latency (?)

    4/6/2009 cs252-S09, Lecture 18 38

    100MB/Sec transfer speed 32MB cache

    Properties of a Hard Magnetic Disk

    Sector

    Platters

    Properties

    Independently addressable element: sector OS always transfers groups of sectors togetherblocks A disk can access directly any given block of information it contains

    (random access). Can access any file either sequentially or randomly. A disk can be rewritten in place: it is possible to read/modify/write a

    block from the disk

    Track

    4/6/2009 cs252-S09, Lecture 18 39

    500 to more than 20,000 tracks per surface 32 to 800 sectors per track

    A sector is the smallest unit that can be read or written Zoned bit recording

    Constant bit density: more sectors on outer tracks Speed varies with track location

    MBits per square inch:DRAM as % of Disk over time

    50%

    9 v. 22 Mb/si

    10%

    20%

    30%

    40%

    470 v. 3000 Mb/si

    4/6/2009 cs252-S09, Lecture 18 40

    0%

    1974 1980 1986 1992 1998

    source: New York Times, 2/23/98, page C3,Makers of disk drives crowd even mroe data into even smaller spaces

    0.2 v. 1.7 Mb/si

    10/19/201

  • 7/31/2019 ACA Chapter3

    11/13T6030 1

    Nano-layered Disk Heads Special sensitivity of Disk head comes from Giant

    Magneto-Resistive effect or (GMR) IBM is (was) leader in this technology

    -

    Coil for writing

    4/6/2009 cs252-S09, Lecture 18 41

    Disk Figure of Merit: Areal Density

    Bits recorded along a track Metric is Bits Per Inch (BPI)

    Number of tracks per surface

    Metric is Tracks Per Inch (TPI)

    Disk Designs Brag about bit density per unit area

    Metric is Bits Per Square Inch: Areal Density = BPI x TPIYear Areal Density

    1973 2

    1979 8

    1989 63

    1997 3,090 10,000

    100,000

    1,000,000

    ty

    4/6/2009 cs252-S09, Lecture 18 42

    2000 17,100

    2006 130,000

    2007 164,000

    2009 400,000

    1

    10

    100

    1,000

    1970 1980 1990 2000 2010

    ArealDen

    si

    Disk I/O Performance

    User

    Thread

    Controll

    Disk

    ResponseTime (ms)

    200

    300

    Response Time = Queue+Disk Service Time

    Queue[OS Paths]

    ler

    Performance of disk drive/file system Metrics: Res onse Time Throu h ut

    100%

    Throughput (Utilization)(% total BW)

    0

    100

    0%

    4/6/2009 cs252-S09, Lecture 18 43

    ,

    Contributing factors to latency:

    Software paths (can be loosely modeled by a queue)

    Hardware controller

    Physical disk media

    Queuing behavior:

    Can lead to big increase of latency as utilization approaches 100%

    Magnetic Disk Characteristic Cylinder: all the tracks under the

    head at a given point on all surface Read/write data is a three-stage

    rocess:

    Sector

    Track

    Cylinder

    Head

    Seek time: position the head/arm over the proper track (into propercylinder)

    Rotational latency: wait for the desired sector

    to rotate under the read/write head Transfer time: transfer a block of bits (sector)

    under the read-write head

    Disk Latency = Queueing Time + Controller time +Seek Time + Rotation Time + Xfer Time

    a er

    4/6/2009 cs252-S09, Lecture 18 44

    Highest Bandwidth: transfer large group of blocks sequentially from one track

    Software

    Queue

    (Device Driver)

    Hardware

    Controller

    Media Time

    (Seek+Rot+Xfer)

    Request

    Result

    10/19/201

  • 7/31/2019 ACA Chapter3

    12/13T6030 1

    Array Reliability

    Reliability of N disks = Reliability of 1 Disk N

    50,000 Hours 70 disks = 700 hours

    Disk system MTTF: Drops from 6 years to 1 month!

    Arrays (without redundancy) too unreliable to be useful!

    4/6/2009 cs252-S09, Lecture 18 45

    Hot spares support reconstruction in parallel withaccess: very high media availability can be achieved

    Redundant Arrays of Disks

    Files are "striped" across multiple spindles Redundancy yields high data availability

    Disks will fail

    Contents reconstructed from data redundantly stored in the arrayCapacity penalty to store it

    Bandwidth penalty to update

    Mirroring/Shadowing (high capacity cost)

    4/6/2009 cs252-S09, Lecture 18 46

    Horizontal Hamming Codes (overkill)

    Parity & Reed-Solomon Codes

    Failure Prediction (no capacity overhead!)VaxSimPlus Technique is controversial

    Techniques:

    Redundant Arrays of DisksRAID 1: Disk Mirroring/Shadowing

    recoverygroup

    Each disk is fully duplicated onto its "shadow"Very high availability can be achieved

    Bandwidth sacrifice on write:

    Logical write = two physical writes

    4/6/2009 cs252-S09, Lecture 18 47

    Reads may be optimized

    Most expensive solution: 100% capacity overhead

    Targeted for high I/O rate , high availability environments

    Redundant Arrays of Disks RAID 5+:High I/O Rate Parity

    A logical writebecomes four

    D0 D1 D2 D3 PIncreasing

    Logical

    physical I/Os

    Independent writes

    possible because ofinterleaved parity

    Reed-SolomonCodes ("Q") for

    D4 D5 D6 P D7

    D8 D9 P D10 D11

    D12 P D13 D14 D15

    Addresses

    Stripe

    4/6/2009 cs252-S09, Lecture 18 48

    reconstructionprotection duringreconstruction P D16 D17 D18 D19

    D20 D21 D22 D23 P

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .Disk Columns

    StripeUnit

    Targeted for mixedapplications

    10/19/201

  • 7/31/2019 ACA Chapter3

    13/13T6030 1

    Problems of Disk Arrays: Small Writes

    RAID-5: Small Write Algorithm

    1 Logical Write = 2 Physical Reads + 2 Physical Writes

    D0 D1 D2 D3 PD0'

    +

    newdata

    olddata

    oldparity

    XOR

    (1. Read) (2. Read)

    4/6/2009 cs252-S09, Lecture 18 49

    +

    D0' D1 D2 D3 P'

    XOR

    (3. Write) (4. Write)

    System Availability: Orthogonal RAIDsString

    Controller

    StringController

    . . .

    . . .

    ArrayController

    StringController

    StringController

    StringController

    . . .

    . . .

    . . .

    4/6/2009 cs252-S09, Lecture 18 50

    StringController . . .

    Data Recovery Group:unit of data redundancy

    Redundant Support Components:fans, power supplies, controller, cables

    End to End Data Integrity:internal parity protected data paths

    System-Level Availability

    Fully dual redundantI/O Controller I/O Controller

    host host

    Array Controller Array Controller

    . . .

    . . .

    . . . . . .

    Goal: No SinglePoints ofFailure

    4/6/2009 cs252-S09, Lecture 18 51

    . . .

    .

    .

    .

    Recovery

    Group

    with duplicated paths, higher performance can beobtained when there are no failures