computer architecture - halatsis/advanced_comp_arch/geniki_ superscalar processors decoding, ......

Download Computer Architecture - halatsis/Advanced_Comp_Arch/Geniki_ superscalar processors Decoding, ... Intel/HP IA64 (Itanium 2) ... Computer Architecture Virtually vs physically addressed

Post on 06-Mar-2018

224 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Computer ArchitectureA bottom-up perspective

    Andy Pimentel

    Computer Architecture Modeling & Simulation group

    andy@science.uva.nl

    Andy Pimentel p. 1/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Course material

    Book: J.L. Hennessy and D.A. Patterson, ComputerArchitecture, A Quantitative Approach, 3rd ed.

    Other nice book: D. Sima, T. Fountain and P. Kacsuk,Advanced Computer Architecture, A Design SpaceApproach

    Sheets available at website(http://www.science.uva.nl/andy/aca.html)

    Idem for schedule, practical assignments, deadlines, etc.

    Andy Pimentel p. 2/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Outline

    Memory hierarchyDRAMCaches: from concept to implementation

    Pipelined processorsPipeline hazardsSome design space issues

    Modern superscalar processorsDecoding, dispatching, issuing and execution ofinstructionsRegister renamingSequential consistency, exception handlingBranch prediction

    Andy Pimentel p. 3/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Outline (contd)

    Application specific optimizationsSIMD instructionsData prefetching

    Case studiesCompaq Alpha 21264, HP PA-8700, IBM POWER 4,Intel Pentium 4

    VLIW processorsPhilips TriMediaIntel/HP IA64 (Itanium 2)Transmeta Crusoe

    Embedded processors

    Andy Pimentel p. 4/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Outline (contd)

    Parallel computersInterconnection networks

    Topology, switching, routing, etc.Memory hierarchy

    Shared/distributed memory, cache coherency, etc.Case studies

    Future directionsSuper-speculative processorsTrace/Multiscalar processorsSimultaneous multithreadingI(ntelligent)RAMs...

    Andy Pimentel p. 5/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Memory hierarchy: DRAM

    8 to 16 times slower than SRAM

    More dense than SRAM (e.g. SRAM needs about 6transistors/cell)

    RAS/CAS addressing using time multiplexing

    Needs refreshing

    Cycle time roughly 2 times the access time

    Processor Memory speed-gap is wideningProcessors 50% to 100% faster/year (Moores Law)DRAM cycle time improves 7%/year

    Andy Pimentel p. 6/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    RAMs

    capacitorStorage

    Transistor

    Ground

    Address line

    Bitline B

    SRAM cellDRAM cell

    Ground

    C2C1

    Address line

    dc voltage

    T6T5

    T4T3

    T2 T1

    Bitline B Bitline B

    Andy Pimentel p. 7/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    DRAM (contd)

    RAS/CAS addressing

    Capacitor(1 transistor)

    RAS

    CAS

    Step 1: Row Address SelectStep 2: Column Address Select (select bit)

    Refresh: read and write back a whole row

    Andy Pimentel p. 8/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    DRAM (contd)

    Refresh time typically in the tens of milliseconds

    Number of refresh cycles dependent on number of rows

    Two types of refreshing

    Refresh Cycle

    Burst

    Refresh Time

    Time

    DistributedRefresh

    Refresh

    Andy Pimentel p. 9/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    DRAM (contd)

    Improving bandwidth (not latency!) by exploiting spatial locality

    one RAS, multiple CAS addressesFast page mode DRAMsE(xtended) D(ata) O(utput) RAM

    Burst mode DRAMs: for one burst 1 RAS and CAS addressBurst EDO RAMSDRAM

    or by improving interface: SDRAM and Rambus

    Andy Pimentel p. 10/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    DRAM (contd)

    DATA2

    CAS

    RAS

    DATA1Data

    EDO RAMRAS

    CAS

    ROW

    Data

    COL1 COL3

    DATA1 DATA3

    2-bit Burst EDO RAM

    DATA2

    ROW COL1 COL2 COL3

    Address

    Address

    Andy Pimentel p. 11/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    DRAM (contd)

    SDRAM changed interface from asynchronous to synchronous:Synchronous DRAM Standard DRAM

    Decode R/W Output Decode R/W OutputAddr. latchAddr. latch

    Addr1

    Addr1

    Addr1

    Addr1

    Addr1

    Addr1

    Addr1

    Addr1

    Addr2

    Addr2

    Addr2

    Addr3

    Addr3 Addr2Addr4Addr5

    Addr4

    Addr3

    Addr2

    Clock

    Brought (a sort of) pipelining to DRAMs

    DDR-SDRAM (Double Data Rate) transfers data on bothrising and falling clock edges

    Andy Pimentel p. 12/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    DRAM (contd)

    Rambus (RDRAM)

    Interface using a split-transaction (= packet-switched) bus(pipelining!)

    Separate row, column address control and (18 bits) datalines

    So, three transactions can be active at the same time

    High clock rate (400 Mhz), but long latency

    Data can be transferred on both clock edges

    Andy Pimentel p. 13/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    DRAM (contd)

    Interleaved memory: multiple banks

    4 5 7

    9 11

    0 1 2 3

    6

    108

    Bank 0 Bank 1 Bank 2 Bank 3

    Optimizes sequential accesses and can hide refresh cycles

    Problem: aliased accessesLarge number of banks (Nec SX/3, 128 banks!),number of banks prime

    Andy Pimentel p. 14/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Memory hierarchy: caches

    Performance gap between processor and main memory

    apply caching (basically a poor mans solution)

    Caches are small and fast memories (close to theprocessor, typically SRAM)

    Nowadays, 2 (or 3) levels of cache between processor andmain memory

    Caches are transparent to the user (important!...however...)

    Andy Pimentel p. 15/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Caches (contd)

    Andy Pimentel p. 16/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Caches (contd)

    Cache exploits locality in software

    Temporal locality : a referenced item tends to be referencedsoon again

    InstructionsData??

    Spatial locality : items close to a referenced item tend to bereferenced soon

    Instructions + data

    Andy Pimentel p. 17/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Caches (contd)

    Instruction, data or unified caches

    Address cache (TLB Translation Lookaside Buffer)Caches VA PA translationsSplit I + D TLBs or unified, sometimes 2 levels

    Three common implementationsDirect mappedFully associativeSet-associative

    Andy Pimentel p. 18/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Caches (contd)

    Instruction and data caches store cache blocks (also called cachelines)

    Tag V D Data

    Valid Dirty Typically 16 - 128 bytesHigher-orderaddress bits

    Andy Pimentel p. 19/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Cache implementations

    Direct mapped cache (often 2nd-level cache)

    data

    data

    data

    data

    data

    data

    data

    data

    Tag

    Tag

    cache block

    Byte

    4

    3

    16 bits memory address

    9

    Block0Block1Block2Block3Block4Block5Block6Block7Block8Block9Block10Block11Block12Block13Block14Block15

    16 bytesMain memory

    with 16 bytes of data

    Block

    compare hit?

    Simple hardware & high speed access

    Rigid mapping: many memory blocks map onto onecache block large cache size required

    Andy Pimentel p. 20/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Cache implementations (contd)

    Fully associative cache (e.g. TLB, branch history table)

    data

    data

    data

    data

    data

    data

    data

    data

    Tag

    cache block

    Byte

    16 bits memory address

    Block0Block1Block2Block3Block4Block5Block6Block7Block8Block9Block10Block11Block12Block13Block14Block15

    16 bytesMain memory

    with 16 bytes of data

    Tag

    4

    hit?compare

    12

    Very flexible mapping (few conflicts)

    CAMs (Content Addressable Memory) are expensive

    small caches multimedia applications often a killer forTLBs

    Andy Pimentel p. 21/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Cache implementations (contd)

    Set-associative cache (often 1st-level cache)

    data

    data

    data

    data

    data

    data

    data

    data

    Tag

    cache block

    Byte

    16 bits memory address

    Block0Block1Block2Block3Block4Block5Block6Block7Block8Block9Block10Block11Block12Block13Block14Block15

    16 bytesMain memory

    with 16 bytes of data

    4

    Tag Set

    compare

    10 2

    set

    hit?

    Performance similar to fully associative cachebut less expensive

    Andy Pimentel p. 22/259

  • Universityof

    Amsterdam

    CSPCSPComputer

    Architecture

    Virtually vs physically addressed cache

    Virtually addressed

    CPU

    Cache

    MMUVA PA

    I or DI or D

    Memory

    Parallel VA translation and cache lookupAliasing problem

    Andy Pim

Recommended

View more >