database architectures for new hardware · 2004. 9. 10. · wish for a database architecture: that...

131
Database Architectures for New Hardware a tutorial by Anastassia Ailamaki Database Group Carnegie Mellon University http://www.cs.cmu.edu/~natassa

Upload: others

Post on 02-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Database Architectures for New Hardware

    a tutorial byAnastassia Ailamaki

    Database GroupCarnegie Mellon University

    http://www.cs.cmu.edu/~natassa

  • @Carnegie MellonDatabases

    2©2004 Anastassia Ailamaki

    Focus of this tutorialDB workload execution on a modern computer

    Processor

    0%

    20%

    40%

    60%

    80%

    100%

    Ideal seq.scan

    indexscan

    DSS OLTP

    exec

    utio

    n tim

    e

    BUSY IDLE

    DBMS can run MUCH faster if they use new hardware efficiently

  • @Carnegie MellonDatabases

    3©2004 Anastassia Ailamaki

    Trends in processor performanceScaling # of transistors, innovative microarchitectureHigher performance, despite technological hurdles!

    Processor speed doubles every 18 months

  • @Carnegie MellonDatabases

    4©2004 Anastassia Ailamaki

    Trends in Memory (DRAM) PerformanceMemory capacity increases exponentially

    DRAM Fabrication primarily targets densitySpeed increases linearly

    Larger but not as much faster memories

    16MB4MB

    1MB64KB 256KB

    64MB

    4GB

    512MB

    0.1

    1

    10

    100

    1000

    10000

    1980 1983 1986 1989 1992 1995 2000 2005

    DRAM sizeD R A M S P E E D T R E N D S

    YEAR OF INTRODUCTION1980 1982 1984 1986 1988 1990 1992 1994

    SPEE

    D(n s )

    0

    50

    100

    150

    200

    250SLOWEST RAS (ns)

    FASTEST RAS (ns)CAS (ns)

    CYCLE TIME (ns)

    1 Mbit

    64 Mbit

    16 Mbit

    4 Mbit

    256 Kbit

    64 Kbit

    AC

    CES

    S TI

    ME

    (µs)

  • @Carnegie MellonDatabases

    5©2004 Anastassia Ailamaki

    The Memory/Processor Speed Gap

    Trip to memory = thousands of instructions!

    0.25

    10

    0.0625

    80

    6

    0.01

    0.1

    1

    10

    100

    1000pr

    oces

    sor

    cycl

    es /

    inst

    ruct

    ion

    0.01

    0.1

    1

    10

    100

    1000

    cycl

    es /

    acce

    ss to

    DR

    AM

    CPUMemory

    VAX/1980 PPro/1996 2010+

  • @Carnegie MellonDatabases

    6©2004 Anastassia Ailamaki

    100G

    New Hardware

    Caches trade off capacity for speedExploit instruction/data localityDemand fetch/wait for data

    [ADH99]:Running top 4 database systemsAt most 50% CPU utilization

    But wait a minute…Isn’t I/O the bottleneck??? MemoryMemory

    CCPPUU

    1000

    clk

    100

    clk

    1 cl

    k10

    clk

    L2 2M

    L1 64K

    4GBto

    1TB

    L3 32M

  • @Carnegie MellonDatabases

    7©2004 Anastassia Ailamaki

    Modern storage managers

    Several decades work to hide I/OAsynchronous I/O + Prefetch & Postwrite

    Overlap I/O latency by useful computationParallel data access

    Partition data on modern disk array [PAT88]Smart data placement / clustering

    Improve data localityMaximize parallelismExploit hardware characteristics

    …and larger main memories fit more data1MB in the 80’s, 10GB today, TBs coming soon

    DB storage mgrs efficiently hide I/O latencies

  • @Carnegie MellonDatabases

    8©2004 Anastassia Ailamaki

    Why should we (databasers) care?

    0.330.8

    1.4

    DB

    4

    DB

    Theoreticalminimum

    Desktop/Engineering

    (SPECInt)

    DecisionSupport (TPC-H)

    OnlineTransactionProcessing

    (TPC-C)

    Cyc

    les

    per i

    nstr

    uctio

    n

    Database workloads under-utilize hardwareNew bottleneck: Processor-memory delays

  • @Carnegie MellonDatabases

    9©2004 Anastassia Ailamaki

    Breaking the Memory Wall

    Wish for a Database Architecture:that uses hardware intelligentlythat won’t fall apart when new computers arrivethat will adapt to alternate configurations

    Efforts from multiple research communities Cache-conscious data placement and algorithmsInstruction stream optimizationsNovel database software architecturesNovel hardware designs (covered briefly)

  • @Carnegie MellonDatabases

    10©2004 Anastassia Ailamaki

    Detailed OutlineIntroduction and OverviewNew Hardware

    Execution PipelinesCache memories

    Where Does Time Go?Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results

    Bridging the Processor/Memory Speed GapData PlacementAccess MethodsQuery Processing AlorithmsInstruction Stream OptimizationsStaged Database Systems

    Newer HardwareHip and Trendy

    Query co-processingDatabases on MEMStore

    Directions for Future Research

  • @Carnegie MellonDatabases

    11©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    12©2004 Anastassia Ailamaki

    This Section’s Goals

    Understand how a program is executedHow new hardware parallelizes executionWhat are the pitfalls

    Understand why database programs do not take advantage of microarchitectural advancesUnderstand memory hierarchies

    How they workWhat are the parameters that affect program behaviorWhy they are important to database performance

  • @Carnegie MellonDatabases

    13©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew Hardware

    Execution PipelinesCache memories

    Where Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    14©2004 Anastassia Ailamaki

    Sequential Program Execution

    Sequential Code

    Precedences: overspecificationsSufficient, NOT necessary for correctness

    i1: xxxx

    i2: xxxx

    i3: xxxx

    i1

    i2

    i3

    Instruction-level Parallelism (ILP)

    i1 i2 i3

    pipeliningsuperscalar execution

    Modern processors do both!

  • @Carnegie MellonDatabases

    15©2004 Anastassia Ailamaki

    fetch decode execute memory write

    Pipelined Program Execution

    Ft0

    DF

    t1ED

    t2

    F

    ME

    t3

    D

    WM

    t4

    E

    t5

    WM W

    Tpipeline = Tbase / 5

    Inst2Inst1

    Inst3

    FETCH

    Inst

    ruct

    ion

    stre

    am

    EXECUTE RETIRE

    Write results

  • @Carnegie MellonDatabases

    16©2004 Anastassia Ailamaki

    Pipeline Stalls (delays)Reason: dependencies between instructions E.g., Inst1: r1 ← r2 + r3

    Inst2: r4 ← r1 + r2

    F D E M WF D E M W

    t0 t1 t2 t3 t4 t5Inst1Inst2

    F D E M W

    Read-after-write (RAW)

    DB programs: frequent data dependencies

    F D E M WF D E M W

    t0 t1 t2 t3 t4 t5Inst1Inst2 E Stall

    F E MD Stall D

    peak ILP = d

    Peak instruction-per-cycle (IPC) = CPI = 1Inst3

  • @Carnegie MellonDatabases

    17©2004 Anastassia Ailamaki

    Higher ILP: Superscalar Out-of-Order

    F D E M Wt0 t1 t2 t3 t4 t5

    Inst1…n

    Peak instruction-per-cycle (IPC)=n (CPI=1/n)

    F D E M WF D E M W

    Inst(n+1)…2n

    Inst(2n+1)…3n

    at most n

    peak ILP = d*n

    Out-of-order (as opposed to “inorder”) execution:Shuffle execution of independent instructionsRetire instruction results using a reorder bufferDB: 1.5x faster than inorder [KPH98,RGA98]

    Limited ILP opportunity

  • @Carnegie MellonDatabases

    18©2004 Anastassia Ailamaki

    true:

    fetc

    h B

    Even Higher ILP: Branch PredictionWhich instruction block to fetch?

    Evaluating a branch condition causes pipeline stall

    C?IDEA: Speculate branch while evaluating C!

    Record branch history in a buffer, predict A or BIf correct, saved a (long) delay!If incorrect, misprediction penalty=Flush pipeline, fetch correct

    instruction streamExcellent predictors (97% accuracy!)Mispredictions costlier in OOO

    1 lost cycle = >1 missed instructions!

    false

    : fet

    ch A

    xxxxif C goto B

    A: xxxxxxxxxxxxxxxx

    B: xxxxxxxxxxxxxxxxxxxxxxxxDB programs: long code paths => mispredictions

  • @Carnegie MellonDatabases

    19©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew Hardware

    Execution PipelinesCache memories

    Where Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    20©2004 Anastassia Ailamaki

    Larger

    Faster

    Memory HierarchyMake common case fast

    common: temporal & spatial localityfast: smaller, more expensive memory

    Keep recently accessed blocks (temporal locality)Group data into blocks (spatial locality)

    Registers

    Caches

    Memory

    DisksDB programs: >50% load/store instructions

  • @Carnegie MellonDatabases

    21©2004 Anastassia Ailamaki

    Cache Contents

    Keep recently accessed block in “cache line”

    address state data

    On memory readif incoming address = a stored address tag then

    HIT: return dataelse

    MISS: choose & displace a line in usefetch new (referenced) block from memory into linereturn data

    Important parameters:cache size, cache line size, cache associativity

  • @Carnegie MellonDatabases

    22©2004 Anastassia Ailamaki

    Cache Associativitymeans # of lines a block can be in (set size)Replacement: LRU or random, within set

    Fully-associativea block goes in

    any frame

    Direct-mappeda block goes in

    exactly one frame

    Set-associativea block goes in any frame in

    exactly one set

    0

    1

    2

    3

    01234567

    01010101

    01234567

    Line Set/Line Set

    lower associativity ⇒ faster lookup

  • @Carnegie MellonDatabases

    23©2004 Anastassia Ailamaki

    Miss Classification (3+1 C’s)

    compulsory (cold)“cold miss” on first access to a block

    — defined as: miss in infinite cache

    capacity misses occur because cache not large enough— defined as: miss in fully-associative cache

    conflictmisses occur because of restrictive mapping strategyonly in set-associative or direct-mapped cache— defined as: not attributable to compulsory or capacity

    coherencemisses occur because of sharing among multiprocessors

    Cold misses are unavoidableCapacity, conflict, and coherence misses can be reduced

  • @Carnegie MellonDatabases

    24©2004 Anastassia Ailamaki

    Lookups in Memory Hierarchy

    referencesmissesratemiss

    ## =EXECUTION PIPELINE

    MAIN MEMORY

    L1 I-CACHE L1 D-CACHE

    L2 CACHE

    Trips to memory are most expensive

    $$$

    L1: Split, 16-64K each.As fast as processor (1 cycle)L2: Unified, 512K-8MOrder of magnitude slower than L1

    Memory: Unified, 512M-8GB~400 cycles (Pentium4)

    (there may be more cache levels)

  • @Carnegie MellonDatabases

    25©2004 Anastassia Ailamaki

    Miss penalty

    L1D: low miss penalty, if L2 hit (partly overlapped with OOO execution)

    enalty)avg(miss pmiss rate*ttavg hitaccess +=)(

    EXECUTION PIPELINE

    MAIN MEMORY

    L1 I-CACHE L1 D-CACHE

    L2 CACHE

    DB: long code paths, large data footprints

    $$$L2: High penalty (trip to memory)

    means the time to fetch and deliver block

    L1I: In critical execution path.Cannot be overlapped with OOO execution.

    Modern caches: non-blocking

  • @Carnegie MellonDatabases

    26©2004 Anastassia Ailamaki

    Typical processor microarchitecture

    I-Unit E-Unit

    L1 I-Cache L1 D-Cache

    L2 Cache (SRAM on-chip)

    D-TLBI-TLB

    Regs

    Main Memory (DRAM)

    Processor

    L3 Cache (SRAM off-chip)

    Will assume a 2-level cache in this talk

    TLB: Translation Lookaside Buffer (page table cache)

  • @Carnegie MellonDatabases

    27©2004 Anastassia Ailamaki

    Summary: New Hardware

    Fundamental goal in processor design: max ILPPipelined, superscalar, speculative executionOut-of-order executionNon-blocking cachesDependencies in instruction stream lower ILP

    Deep memory hierarchiesCaches important for database performanceLevel 1 instruction cache in critical execution pathTrips to memory most expensive

    DB workloads on new hardwareToo many load/store instructionsTight dependencies in instruction streamAlgorithms not optimized for cache hierarchiesLong code pathsLarge instruction and data footprints

  • @Carnegie MellonDatabases

    28©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    29©2004 Anastassia Ailamaki

    This Section’s Goals

    Hardware takes time: how do we measure time?Understand how to efficiently analyze microarchitectural behavior of database workloads

    Should we use simulators? When? Why?How do we use processor counters?Which tools are available for analysis?Which database systems/benchmarks to use?

    Survey experimental results on workload characterization

    Discover what matters for database performance

  • @Carnegie MellonDatabases

    30©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?

    Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results

    Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    31©2004 Anastassia Ailamaki

    Simulator vs. Real MachineReal machine

    Limited to available hardware counters/eventsLimited to (real) hardware configurations Fast (real-life) execution

    Enables testing real: large & more realistic workloads

    Sometimes not repeatable

    Tool: performance counters

    Real-machine experiments to locate problemsSimulation to evaluate solutions

    SimulatorCan measure any event

    Vary hardware configurations

    (Too) Slow executionOften forces use of scaled-down/simplified workloads

    Always repeatable

    Virtutech Simics, SimOS, SimpleScalar, etc.

  • @Carnegie MellonDatabases

    32©2004 Anastassia Ailamaki

    Hardware Performance Counters

    What are they?Special purpose registers that keep track of programmable eventsNon-intrusive counts “accurately” measure processor eventsSoftware API’s handle event programming/overflowGUI interfaces built on top of API’s to provide higher-level analysis

    What can they count?Instructions, branch mispredictions, cache misses, etc.No standard set exists

    Issues that may complicate lifeProvides only hard counts, analysis must be done by user or toolsMade specifically for each processor

    even processor families may have different interfacesVendors don’t like to support because is not profit contributor

  • @Carnegie MellonDatabases

    33©2004 Anastassia Ailamaki

    Evaluating Behavior using HW CountersStall time (cycle) counters

    very useful for time breakdowns(e.g., instruction-related stall time)

    Event countersuseful to compute ratios(e.g., # misses in L1-Data cache)

    Need to understand counters before using themOften not easy from documentationBest way: microbenchmark (run programs with pre-computed events)

    E.g., strided accesses to an array

  • @Carnegie MellonDatabases

    34©2004 Anastassia Ailamaki

    Example: Intel PPRO/PIII

    RESOURCE_STALLSResource stallsPARTIAL_RAT_STALLSDependence stalls

    IFU_IFETCH_MISSL1 Instruction missesITLB_MISSTLB misses

    BR_MISS_PRED_RETIREDBranch mispredictionsBR_INST_DECODEDBranches

    IFU_MEM_STALLInstruction-related stallsL2_LINES_INL2 Misses

    DCU_LINES_INL1 Data (L1D) missesDATA_MEM_REFSL1 Data (L1D) accesses

    INST_RETIREDInstructionsCPU_CLK_UNHALTEDCycles

    “time”

    Lots more detail, measurable events, statisticsOften >1 ways to measure the same thing

  • @Carnegie MellonDatabases

    35©2004 Anastassia Ailamaki

    Producing time breakdowns

    Determine benchmark/methodology (more later)Devise formulae to derive useful statisticsDetermine (and test!) software

    E.g., Intel Vtune (GUI, sampling), or emonPublicly available & universal (e.g., PAPI [DMM04])

    Determine time components T1….TnDetermine how to measure each using the countersCompute execution time as the sum

    Verify model correctnessMeasure execution time (in #cycles)Ensure measured time = computed time (or almost)Validate computations using redundant formulae

  • @Carnegie MellonDatabases

    36©2004 Anastassia Ailamaki

    Computation

    Memory

    BranchMispredictions

    HardwareResources

    Overlap opportunity:Load AD=B+CLoad E

    Execution Time = Computation + StallsExecution Time = Computation + Stalls - Overlap

    Execution Time Breakdown Formula

    Stalls

    [ADH99]

  • @Carnegie MellonDatabases

    37©2004 Anastassia Ailamaki

    Computation

    Memory

    BranchMispredictions

    HardwareResources

    Memory Stalls = Σn(stalls at cache level n)

    Where Does Time Go (memory)?

    Instruction lookup missed in L1I, hit in L2L1I

    Data lookup missed in L1D, hit in L2L1D

    Instruction or data lookup missed in L1, missed in L2, hit in memory

    L2

    [ADH99]

  • @Carnegie MellonDatabases

    38©2004 Anastassia Ailamaki

    What to measure?Decision Support System (DSS:TPC-H)

    Complex queries, low-concurrencyRead-only (with rare batch updates)Sequential access dominatesRepeatable (unit of work = query)

    On-Line Transaction Processing (OLTP:TPCC, ODB)Transactions with simple queries, high-concurrencyUpdate-intensiveRandom access frequentNot repeatable (unit of work = 5s of execution after rampup)

    Often too complex to provide useful insight

  • @Carnegie MellonDatabases

    39©2004 Anastassia Ailamaki

    Microbenchmarks

    What matters is basic execution loopsIsolate three basic operations:

    Sequential scan (no index)Random access on records (non-clustered index)Join (access on two tables)

    Vary parameters:selectivity, projectivity, # of attributes in predicatejoin algorithm, isolate phasestable size, record size, # of fields, type of fields

    Determine behavior and trendsMicrobenchmarks can efficiently mimic TPC microarchitecturalbehavior! Widely used to analyze query execution

    Excellent for microarchitectural analysis

    [KPH98,ADH99,KP00,SAF04]

  • @Carnegie MellonDatabases

    40©2004 Anastassia Ailamaki

    Shore: YES!

    On which DBMS to measure?Commercial DBMS are most realistic

    Difficult to setup, may need help from companies

    Prototypes can evaluate techniquesShore [ADH01] (for PAX), PostgreSQL[TLZ97] (eval)

    Tricky: similar behavior to commercial DBMS?

    Execution time breakdown

    0%

    20%

    40%

    60%

    80%

    100%

    A B C D ShoreDBMS

    % e

    xecu

    tion

    time

    Computation Memory Branch mispr. Resource

    Memory stall time breakdown

    0%

    20%

    40%

    60%

    80%

    100%

    A B C D ShoreDBMS

    Mem

    ory

    stal

    l tim

    e (%

    )

    L1 Data L2 Data L1 Instruction L2 Instruction

    [ADH02]

  • @Carnegie MellonDatabases

    41©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?

    Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results

    Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    42©2004 Anastassia Ailamaki

    DB Execution Time Breakdown

    0%

    20%

    40%

    60%

    80%

    100%

    seq. scan TPC-D index scan TPC-C

    exec

    utio

    n tim

    e

    Computation Memory Branch mispred. Resource[ADH99,BGB98,BGN00,KPH98]

    PII XeonNT 4.0Four DBMS: A, B, C, D

    At least 50% cycles on stalls Memory is major bottleneck

    Branch mispredictions increase cache misses!

  • @Carnegie MellonDatabases

    43©2004 Anastassia Ailamaki

    Join (no index)

    0%

    20%

    40%

    60%

    80%

    100%

    A B C DDBMS

    table scan

    0%

    20%

    40%

    60%

    80%

    100%

    A B C DDBMS

    Mem

    ory

    stal

    l tim

    e (%

    )

    clustered index scan

    0%

    20%

    40%

    60%

    80%

    100%

    A B C DDBMS

    L1 Data L2 Data L1 Instruction L2 Instruction

    PII Xeon running NT 4.0, used performance countersFour commercial Database Systems: A, B, C, D

    DSS/OLTP basics: Cache Behavior

    non-clustered index

    0%

    20%

    40%

    60%

    80%

    100%

    B C DDBMS

    [ADH99,ADH01]

    Bottlenecks: data in L2, instructions in L1Random access (OLTP): L1I-bound

  • @Carnegie MellonDatabases

    44©2004 Anastassia Ailamaki

    Why Not Increase L1I Size?

    L1I: in critical execution pathslower L1I: slower clock

    Trends:L1-I cache

    Max on-chipL2/L3 cache

    ‘96 ‘00 ‘04‘98 ‘02Year Introduced

    10 KB

    100 KB

    1 MB

    10 MB

    Cac

    he s

    ize

    Problem: a larger cache is typically a slower cacheNot a big problem for L2

    L1I size is stableL2 size increase: Effect on performance?

    [HA04]

  • @Carnegie MellonDatabases

    45©2004 Anastassia Ailamaki

    Increasing L2 Cache SizeDSS: Performance improves as L2 cache growsNot as clear a win for OLTP on multiprocessors

    Reduce cache size ⇒ more capacity/conflict missesIncrease cache size ⇒ more coherence misses

    0%

    5%

    10%

    15%

    20%

    25%

    1P 2P 4P# of processors

    % o

    f L2

    cach

    e m

    isse

    s to

    dirt

    y da

    ta

    in a

    noth

    er p

    roce

    ssor

    's c

    ache

    256KB 512KB 1MB

    Larger L2: trade-off for OLTP

    [BGB98,KPH98]

  • @Carnegie MellonDatabases

    46©2004 Anastassia Ailamaki

    Summary: Where Does Time Go?Goal: discover bottlenecks

    Hardware performance counters ⇒ time breakdownTools available for access and analysis (+simulators)Run commercial DBMS and equivalent prototypesMicrobenchmarks offer valuable insight

    Database workloads: more than 50% stallsMostly due to memory delaysCannot always reduce stalls by increasing cache size

    Crucial bottlenecksData accesses to L2 cache (esp. for DSS)Instruction accesses to L1 cache (esp. for OLTP)

  • @Carnegie MellonDatabases

    47©2004 Anastassia Ailamaki

    Memory

    BBranchMispredictions

    RHardwareResources

    D-cache D

    I-cache I

    DBMS

    DBMS + Compiler

    Compiler + Hardware

    Hardware

    How to Address Bottlenecks

    Next: Optimizing cache accesses

  • @Carnegie MellonDatabases

    48©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    49©2004 Anastassia Ailamaki

    This Section’s Goals

    Survey techniques to improve localityRelational dataAccess methods

    Survey new query processing algorithmsPresent a new database system architectureBriefly explain Instruction Stream Optimizations

    Show how much good understanding of the platform can achieve

  • @Carnegie MellonDatabases

    50©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

    Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems

    Newer HardwareHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    51©2004 Anastassia Ailamaki

    Current Database Storage Managers

    Same layout on disk/memoryMulti-level storage hierarchy

    different devices at each leveldifferent “optimal” access on each device

    Variable workloads and access patterns

    OLTP: Full-record accessDSS: Partial-record accessno optimal “universal” layout

    non-volatile storage

    main memory

    CPU cache

    Goal: Reduce data traffic in memory hierarchy

  • @Carnegie MellonDatabases

    52©2004 Anastassia Ailamaki

    PAGE HEADER 1237RH1

    30Jane RH2 4322 John

    45

    •••

    RH4

    7658 Susan 52

    RH3 Jim 201563

    37Dan87916

    43Leon25345

    52Susan76584

    20Jim15633

    45John43222

    30Jane12371

    AgeNameSSNRID

    R

    NSM (n-ary Storage Model, or Slotted Pages)

    Records are stored sequentiallyAttributes of a record are stored together

    “Classic” Data Layout on Disk Pages

  • @Carnegie MellonDatabases

    53©2004 Anastassia Ailamaki

    NSM in Memory Hierarchy

    DISK

    PAGE HEADER

    7658 Susan 52

    1237 Jane

    Jim 20

    4322 John 4530 1563

    CPU CACHEMAIN MEMORY

    PAGE HEADER

    7658 Susan 52

    1237 Jane

    Jim 20

    4322 John 4530 1563

    4322 Jo30 Block 1

    hn 45 1563 Block 27658Jim 20 Block 3

    Susan 52 Block 4

    Optimized for full-record accessSlow partial-record access

    Wastes I/O bandwidth (fixed page layout)Low spatial locality at CPU cache

    Query accesses all attributes (full-record access)Query evaluates attribute “age” (partial-record access)

    BEST select namefrom Rwhere age > 50

  • @Carnegie MellonDatabases

    54©2004 Anastassia Ailamaki

    Decomposition Storage Model (DSM)

    37Dan8791

    43Leon2534

    52Susan7658

    20Jim1563

    45John4322

    30Jane1237

    AgeNameEID

    Partition original table into n 1-attribute sub-tables

    [CK85]

  • @Carnegie MellonDatabases

    55©2004 Anastassia Ailamaki

    12371PAGE HEADER

    4322 1563 76583 42

    John Jim Suzan

    Jane1PAGE HEADER

    3 42

    30

    45 20 52

    1PAGE HEADER

    3 4

    2

    376

    435

    524

    203

    452

    301

    AgeRID

    R38791625345

    76584

    15633

    43222

    12371

    EIDRID

    R1

    Dan6

    Leon5

    Suzan4

    Jim3

    John2

    Jane1

    NameRID

    R2 8KB

    8KB

    8KB

    Partition original table into n 1-attribute sub-tablesEach sub-table stored separately in NSM pages

    Decomposition Storage Model (DSM)[CK85]

  • @Carnegie MellonDatabases

    56©2004 Anastassia Ailamaki

    DSM in Memory Hierarchy

    John Jim Suzan

    Jane1PAGE HEADER

    3 4

    2

    DISK

    301PAGE HEADER 2 45

    20 523 4 5 43

    12371PAGE HEADER

    4322 1563 76583 42

    MAIN MEMORY

    301PAGE HEADER 2 45

    20 523 4 5 43CPU CACHE

    block 1301 2 45

    block 220 523 4 5

    John Jim Suzan

    Jane1PAGE HEADER

    3 4

    2

    12371PAGE HEADER

    4322 1563 76583 42

    Query accesses all attributes (full-record access)Query accesses attribute “age” (partial-record access)

    CostlyCostly

    Optimized for partial-record accessSlow full-record access

    Reconstructing full record may incur random I/O

    BEST

    select namefrom Rwhere age > 50

  • @Carnegie MellonDatabases

    57©2004 Anastassia Ailamaki

    Repairing NSM’s cache performance

    We need a data placement that…Eliminates unnecessary memory accessesImproves inter-record localityKeeps a record’s fields togetherDoes not affect NSM’s I/O performance

    and, most importantly, is…

    low-implementation-cost, high-impact

  • @Carnegie MellonDatabases

    58©2004 Anastassia Ailamaki

    1237RH1PAGE HEADER

    30Jane RH2 4322 John

    45

    1563

    RH3 Jim 20

    •••

    RH4

    7658 Susan 52

    PAGE HEADER 1237 4322

    1563

    7658

    Jane John Jim Susan

    30 45 2052

    • •••

    NSM PAGE PAX PAGE

    Idea: Partition data within page for spatial locality

    Partition Attributes Across (PAX)

    minipage

    [ADH01]

  • @Carnegie MellonDatabases

    59©2004 Anastassia Ailamaki

    PAX in Memory Hierarchy

    1563PAGE HEADER 1237 4322

    7658

    Jane John Jim Susan

    30 45 2052

    DISK

    1563PAGE HEADER 1237 4322

    7658

    Jane John Jim Susan

    30 45 2052

    MAIN MEMORY CPU CACHE

    block 152 45 2030

    Optimizes CPU cache-to-memory communicationRetains NSM’s I/O (page contents do not change)

    cheap

    Partial-record access in memoryFull-record access on diskBEST

    BEST select namefrom Rwhere age > 50

    [ADH01]

  • @Carnegie MellonDatabases

    60©2004 Anastassia Ailamaki

    PAX Performance Results (Shore)Cache data stalls

    0

    20

    40

    60

    80

    100

    120

    140

    160

    NSM PAXpage layout

    stal

    l cyc

    les

    / rec

    ord

    L1 Data stallsL2 Data stalls

    Execution time breakdown

    0

    300

    600

    900

    1200

    1500

    1800

    NSM PAX

    page layout

    cloc

    k cy

    cles

    per

    reco

    rd

    HardwareResource

    BranchMispredict

    Memory

    ComputationQuery:select avg (ai) from R where aj >= Lo

    and aj

  • @Carnegie MellonDatabases

    61©2004 Anastassia Ailamaki

    Dynamic PAX: Data Morphing

    PAX random access: more cache misses in recordStore attributes accessed together contiguouslyDynamic partition updates with changing workloads

    Optimize total cost based on cache missesPartition algorithms: naïve & hill-climbing algorithms

    Fewer cache missesBetter projectivity and scalability for index scan queriesUp to 45% faster than NSM & 25% faster than PAX

    Same I/O performance as PAX and NSMFuture work: how to handle conflicts?

    [HP03]

  • @Carnegie MellonDatabases

    62©2004 Anastassia Ailamaki

    Alternatively: Repair DSM’s I/O behaviorWe like DSM for partial record accessWe like NSM for full-record access

    Solution: Fractured Mirrors [RDS02]

    1. Get data placement right

    Sparse B-Tree on ID

    4 A4 A51 A1 A2 A3

    3

    2. Faster record reconstructionLineitem (TPCH) 1GB

    020406080

    100120140160180

    1 2 3 4 6 8 10 12 14No. of Attributes

    Seco

    nds

    NSMPage-at-a-timeChunk-Merge

    Instead of record- or page-at-a-time…Chunk-based merge algorithm!

    1. Read in segments of M pages ( a “chunk”)2. Merge segments in memory3. Requires (N*K)/M disk seeks4. For a memory budget of B pages, each

    partition gets B/N pages in a chunk

  • @Carnegie MellonDatabases

    63©2004 Anastassia Ailamaki

    Fractured Mirrors3. Smart mirroring

    Achieves 2-3x speedups on TPC-HNeeds 2 copies of the databaseFuture work:

    A new optimizerSmart buffer pool managementUpdates

    Disk 1 Disk 2

    NSM Copy DSM Copy

    Disk 1 Disk 2

    NSM Copy DSM Copy

  • @Carnegie MellonDatabases

    64©2004 Anastassia Ailamaki

    Summary (no replication)

    PAXPAX

    fullfull--record record accessaccess

    MemoryMemory--disk Performancedisk PerformanceCacheCache--memory Performancememory Performance

    DSMDSM

    NSMNSM

    partial record partial record accessaccess

    partial record partial record accessaccess

    fullfull--record record accessaccess

    Page layoutPage layout

    ☺☺ ☺☺

    ☺☺ ☺☺☺☺ ☺☺ ☺☺

    Need new placement method: Efficient full- and partial-record accessesMaximize utilization at all levels of memory hierarchy

    Difficult!!! Different devices/access methodsDifferent workloads on the same database

  • @Carnegie MellonDatabases

    65©2004 Anastassia Ailamaki

    Clotho

    Operators

    Main-memory Manager

    LachesisStorage Manager

    LachesisStorage Manager

    AtroposLV ManagerAtropos

    LV Manager

    disk array

    disk 0 disk 1

    Bufferpool

    page hdr

    The Fates Storage ManagerIDEA: Decouple layout!

    non-volatile storage

    main memory

    CPU cache

    [SAG03,SSS04,SSS04a]

    data directly placed via scatter/gather I/O

    Memory does not need to store full NSM pages

  • @Carnegie MellonDatabases

    66©2004 Anastassia Ailamaki

    Clotho: memory stores PAX minipagesDISK select EID from R where AGE>30

    1563

    PAGE HEADER

    1237 4322 7658

    30 45 2052

    Jane John Jim Susan

    MAIN MEMORY2534

    PAGE HEADER

    2865 1015 8791

    Tom Jerry Jean Kate

    31 54 3325

    PAGE HEADER (EID &AGE)

    15631237 4322 7658

    30 52 45 20

    25342865 1015 8791

    31 25 54 33

    In-memory “skeleton”(Tailored to query)

    On-disk page:PAX-like layoutBlock boundary aligned

    Just the data you needQuery-specific pages!Great cache performance

    Decoupled layoutFits different hardware

    Low reconstruction costDone at I/O levelGuaranteed by Lachesis and Atropos [SSS04]

    [SSS04]

    New buffer pool manager handles sharing

  • @Carnegie MellonDatabases

    67©2004 Anastassia Ailamaki

    CSM: best-case performance of DSM and NSMTable scan time

    0

    50

    100

    150

    200

    250

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Query payload [# of attributes]

    Run

    time

    [s]

    NSMDSMPAXCSM

    Table: a1 … a15 (float)

    Query: select a1, …from R where a1 < Hi

    [SSS04]

    TPC-H: Outperform DSM by 20% to 2xTPC-C: Comparable to NSM (6% lower throughput)

    CSMCSM

    fullfull--record record accessaccess

    MemoryMemory--disk Performancedisk PerformanceCacheCache--memory Performancememory Performance

    partial record partial record accessaccess

    partial record partial record accessaccess

    fullfull--record record accessaccess

    Page layoutPage layout

    ☺☺ ☺☺ ☺☺ ☺☺

  • @Carnegie MellonDatabases

    68©2004 Anastassia Ailamaki

    Data Placement: Summary

    Smart data placement increases spatial localityResearch targets table (relation) dataGoal: Reduce number of non-cold cache misses

    Techniques focus grouping attributes into cache lines for quick access

    PAX, Data morphing: Cache optimization techniquesFractured Mirrors: Cache-and-disk optimizationFates DB Storage Manager: Independent data layout support across the entire memory hierarchy

  • @Carnegie MellonDatabases

    69©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

    Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems

    Newer HardwareHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    70©2004 Anastassia Ailamaki

    Main-Memory Tree Indexes

    T Trees: proposed in mid-80s for MMDBs [LC86]Aim: balance space overhead with searching time Uniform memory access assumption (no caches)

    Main-memory B+ Trees: better cache performance [RR99]Node width = cache line size (32-128b)

    Minimize number of cache misses for searchMuch higher than traditionaldisk-based B-Trees

    So now trees are too deep

    How to make trees shallower?

  • @Carnegie MellonDatabases

    71©2004 Anastassia Ailamaki

    Reducing Pointers for Larger Fanout

    Cache Sensitive B+ Trees (CSB+ Trees)Layout child nodes contiguouslyEliminate all but one child pointers

    Double fanout of nonleaf node

    B+ Trees CSB+ TreesK1 K2

    K3 K4 K5 K6 K7 K8

    K1 K3K2 K4

    K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4

    35% faster tree lookupsUpdate performance is 30% worse (splits)

    [RR00]

  • @Carnegie MellonDatabases

    72©2004 Anastassia Ailamaki

    What do we do with cold misses?

    Answer: hide latencies using prefetchingPrefetching enabled by

    Non-blocking cache technologyPrefetch assembly instructions

    SGI R10000, Alpha 21264, Intel Pentium4

    Main MemoryCPU L2/L3CacheL1

    Cache

    pref 0(r2)pref 4(r7)pref 0(r3)pref 8(r9)

    Prefetching hides cold cache miss latencyEfficiently used in pointer-chasing lookups!

  • @Carnegie MellonDatabases

    73©2004 Anastassia Ailamaki

    Prefetching B+ Trees(pB+ Trees) Idea: Larger nodesNode size = multiple cache lines (e.g. 8 lines)

    Later corroborated by [HP03a]Prefetch all lines of a node before searching it

    Cost to access a node only increases slightlyMuch shallower trees, no changes required

    Time

    Cache miss

    [CGM01]

    >2x better search AND update performanceApproach complementary to CSB+ Trees!

  • @Carnegie MellonDatabases

    74©2004 Anastassia Ailamaki

    Prefetching B+ Trees

    Goal: faster range scan

    Leaf parent nodes contain addresses of all leavesLink leaf parent nodes togetherUse this structure for prefetching leaf nodes

    Leaf parent nodes

    pB+ Trees: 8X speedup over B+ Trees

    [CGM01]

  • @Carnegie MellonDatabases

    75©2004 Anastassia Ailamaki

    Fractal Prefetching B+ TreesWhat if B+-tree does not fit in memory?(fpB+ Trees) Idea: Combine memory & disk trees

    Embed cache-optimized trees in disk tree nodesfpB+ Trees optimize both cache AND diskKey compression to increase fanout [BMR01]

    Compared to disk-based B+ Trees, 80% faster in-memory searches with similar disk performance

    [CGM02]

  • @Carnegie MellonDatabases

    76©2004 Anastassia Ailamaki

    Bulk lookups: Buffer Index AccessesOptimize data cache performanceSimilar technique in [PMH02]Idea: increase temporal locality by delaying (buffering) node probes until a group is formedExample: NLJ probe stream: (r1, 10) (r2, 80) (r3, 15)

    r110keyRID

    (r1, 10)

    buffer

    root

    B C

    D E

    (r1,10) is bufferedbefore accessing B

    r110keyRID

    (r2, 80)

    r2 80

    B C

    (r2,80) is bufferedbefore accessing C

    r110keyRID

    (r3, 15)

    r2 80

    B Cr315

    B is accessed,buffer entries are

    divided among children

    [ZR03a]

    3x speedup with enough concurrency

  • @Carnegie MellonDatabases

    77©2004 Anastassia Ailamaki

    Access Methods: Summary

    Optimize B+ Tree pointer-chasing cache behaviorReduce node size to few cache linesReduce pointers for larger fanout (CSB+)“Next” pointers to lowest non-leaf level for easy prefetching (pB+)Simultaneously optimize cache and disk (fpB+)Bulk searches: Buffer index accesses

    Additional work:Cache-oblivious B-Trees [BDF00]

    Optimal bound in number of memory transfersRegardless of # of memory levels, block size, or level speed

    Survey of techniques for B-Tree cache performance [GL01]Existing heretofore-folkloric knowledgeKey normalization/compression, alignment, separating keys/pointers

    Lots more to be done in area – consider interference and scarce resources

  • @Carnegie MellonDatabases

    78©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

    Data PlacementAccess MethodsQuery ProcessingStaged Database SystemsInstruction Stream Optimizations

    Newer HardwareHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    79©2004 Anastassia Ailamaki

    Query Processing Algorithms

    Idea: Adapt query processing algorithms to cachesRelated work includes:

    Improving data cache performanceSortingJoin

    Improving instruction cache performanceDSS applications

  • @Carnegie MellonDatabases

    80©2004 Anastassia Ailamaki

    Sorting

    In-memory sorting / generating runsAlphaSort

    Use quick sort rather than replacement selectionSequential vs. random accessNo cache misses after sub-arrays fit in cache

    Sort (key-prefix, pointer) pairs rather than records3x cpu speedup for the Datamation benchmark

    L2cache

    L1

    Quick SortReplacement-selection

    [NBC94]

  • @Carnegie MellonDatabases

    81©2004 Anastassia Ailamaki

    Hash Join

    Random accesses to hash tableBoth when building AND when probing!!!

    Poor cache performance≥ 73% of user time is CPU cache stalls [CAG04]

    Approaches to improving cache performanceCache partitioning – maximizes localityPrefetching – hides latencies

    Build Relation

    Probe Relation

    Hash Table

  • @Carnegie MellonDatabases

    82©2004 Anastassia Ailamaki

    Reducing non-cold misses

    Idea: Cache partitioning (similar to I/O partitioning)Divide relations into cache-sized partitionsFit build partition and hash table into cacheAvoid cache misses for hash table visits

    1/3 fewer cache misses, 9.3% speedup>50% misses due to partitioning overhead

    Build Probe

    cache

    [SKN94]

  • @Carnegie MellonDatabases

    83©2004 Anastassia Ailamaki

    Hash Joins in Monet

    Monet main-memory database system [B02]Vertically partitioned tuples (DSM)

    Join two vertically partitioned relationsJoin two join-attribute arrays [BMK99,MBK00]Extract other fields for output relation [MBN04]

    Build Probe

    Output

    [B02]

  • @Carnegie MellonDatabases

    84©2004 Anastassia Ailamaki

    Monet: Reducing Partition CostJoin two arrays of simple fields (8 byte tuples)

    Original cache partitioning is single passTLB thrashing if # partitions > # TLB entriesCache thrashing if # partitions > # lines in cache

    Solution: multiple passes# partitions per pass is smallRadix-cluster [BMK99,MBK00]

    Use different bits of hashed keys fordifferent passesE.g. In figure, use 2 bits of hashed keys for each pass

    Plus CPU optimizationsXOR instead of %Simple assignments instead of memcpy

    2-pass partitionUp to 2.7X speedup on an Origin 2000

    Results most significant for small tuples

    [BMK99,MBK00]

  • @Carnegie MellonDatabases

    85©2004 Anastassia Ailamaki

    Monet: Extracting Payload

    Two ways to extract payload:Pre-projection: copy fields during cache partitioningPost-projection: generate join index, then extract fields

    Monet: post-projectionRadix-decluster algorithm for good cache performance

    Post-projection good for DSMUp to 2X speedup compared to pre-projection

    Post-projection is not recommended for NSMCopying fields during cache partitioning is better

    [MBN04]

    Paper presented in this conference!

  • @Carnegie MellonDatabases

    86©2004 Anastassia Ailamaki

    Optimizing non-DSM hash joins

    foreach probe tuple{

    (0)compute bucket number;(1)visit header;(2)visit cell array;(3)visit matching build tuple;

    }

    HashBucket

    Headers

    Hash Cell (hash code, build tuple ptr)

    BuildPartition

    0123 0123

    time

    Cache miss latency

    Idea: Exploit inter-tuple parallelism

    [CAG04]

    Simplified probing algorithm

  • @Carnegie MellonDatabases

    87©2004 Anastassia Ailamaki

    Group Prefetching

    0123

    0123

    0123 0

    123

    0123

    0123

    a group

    foreach group of probe tuples {foreach tuple in group {(0)compute bucket number;

    prefetch header;}foreach tuple in group {(1)visit header;

    prefetch cell array;}foreach tuple in group {(2)visit cell array;

    prefetch build tuple;}foreach tuple in group {(3)visit matching build tuple;

    }}

    [CAG04]

  • @Carnegie MellonDatabases

    88©2004 Anastassia Ailamaki

    Software PipeliningPrologue;for j=0 to N-4 do {

    tuple j+3:(0)compute bucket number;

    prefetch header;tuple j+2:(1)visit header;

    prefetch cell array;tuple j+1:(2)visit cell array;

    prefetch build tuple;tuple j:(3)visit matching build tuple;

    }Epilogue;

    prologue

    epilogue

    j

    j+3

    0123

    0123

    0123

    0123

    0123

    0123

    0123

    [CAG04]

  • @Carnegie MellonDatabases

    89©2004 Anastassia Ailamaki

    Prefetching: Performance Results

    Techniques exhibit similar performanceGroup prefetching easier to implementCompared to cache partitioning:

    Cache partitioning costly when tuples are large (>20b)Prefetching about 50% faster than cache partitioning

    0

    1000

    2000

    3000

    4000

    5000

    6000

    150 cycles 1000 cycles

    BaselineGroup PrefSP Pref

    processor to memory latency

    exec

    utio

    n tim

    e (M

    cyc

    les)

    9X speedups over baseline at 1000 cycles

    Absolute numbers do not change!

    [CAG04]

  • @Carnegie MellonDatabases

    90©2004 Anastassia Ailamaki

    DSS: Reducing I-missesDemand-pull execution model: one tuple at a time

    ABABABABABABABABAB…If A + B > L1 instruction cache sizePoor instruction cache utilization!

    Solution: multiple tuples at an operatorABBBBBAAAAABBBBB…

    Modify operators to support block of tuples [PMA01]Insert “buffer” operators between A and B [ZR04]

    “buffer” calls B multiple timesStores intermediate tuple pointers to serve A’s requestNo need to change original operators

    A

    BQuery Plan

    12% speedup for simple TPC-H queries

    [PMA01,ZR04]

  • @Carnegie MellonDatabases

    91©2004 Anastassia Ailamaki

    Concurrency ControlMultiple CPUs share a treeLock coupling: too much cost

    Latching a node means writingTrue even for readers !!!Coherence cache misses due to writes from different CPUs

    Solution: Optimistic approach for readersUpdaters still latch nodesUpdaters also set node versionsReaders check version to ensure correctness

    Search throughput: 5x (=no locking case)Update throughput: 4x

    [CHK01]

  • @Carnegie MellonDatabases

    92©2004 Anastassia Ailamaki

    Query processing: summary

    Alphasort: use quicksort and key prefix-pointerMonet: MM-DBMS uses aggressive DSM

    Optimize partitioning with hierarchical radix-clusteringOptimize post-projection with radix-declusteringMany other optimizations

    Traditional hash joins: aggressive prefetchingEfficiently hides data cache missesRobust performance with future long latencies

    DSS I-misses: group computation (new operator)B-tree concurrency control: reduce readers’ latching

  • @Carnegie MellonDatabases

    93©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

    Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems

    Newer HardwareHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    94©2004 Anastassia Ailamaki

    Instruction-Related Stalls

    25-40% of execution time [KPH98, HA04]Recall importance of instruction cache: In the critical execution path!

    EXECUTION PIPELINE

    L1 I-CACHE L1 D-CACHE

    L2 CACHE

    Impossible to overlap I-cache delays

  • @Carnegie MellonDatabases

    95©2004 Anastassia Ailamaki

    Goal: improve DSS I-cache performanceIdea: Predict next function call using small cache

    Example: create_recalways calls find_ , lock_ , update_ , and unlock_ page in same order

    Experiments: Shore on SimpleScalar SimulatorRunning Wisconsin Benchmark

    Call graph prefetching for DB apps[APD03]

    Beneficial for predictable DSS streams

  • @Carnegie MellonDatabases

    96©2004 Anastassia Ailamaki

    SIMD: Single – Instruction – Multiple – Data In modern CPUs, target multimedia apps

    Example: Pentium 4, 128-bit SIMD register holds four 32-bit values

    Assume data stored columnwise as contiguous array of fixed-length numeric values (e.g., PAX)Scan example:

    X3 X2 X1 X0

    Y3 Y2 Y1 Y0

    OP OP OP OP

    X3 OP Y3 X2 OP Y2 X1 OP Y1 X0 OP Y0

    if x[n] > 10result[pos++] = x[n]

    x[n+3] x[n+2] x[n+1] x[n]

    10 10 10 10

    > > > >

    0 1 0 0

    8 12 6 5

    original scan code

    SIMD 1st phase:produce bitmapvector with 4comparison resultsin parallel

    [ZR02]DB operators using SIMD

  • @Carnegie MellonDatabases

    97©2004 Anastassia Ailamaki

    DB operators using SIMDScan example (cont’d)

    Parallel comparisons, fewer branches ⇒ fewer mispredictions

    0 1 0 0

    SIMD 2nd phase:if bit_vector == 0, continueelse copy all 4 results, increase pos when bit==1

    keep this result

    [ZR02]

    Superlinear speedup to # of parallelismNeed to rewrite code to use SIMD

    Aggregation operation (1M records w/ 20% selectivity)

    0

    5

    10

    15

    20

    25

    SUM SIMDSUM

    COUNT SIMDCOUNT

    MAX SIMDMAX

    MIN SIMDMIN

    Elap

    sed

    tim

    e [m

    s] Branch mispred. PenaltyOther cost

  • @Carnegie MellonDatabases

    98©2004 Anastassia Ailamaki

    STEPS: Cache-Resident OLTP

    Targets instruction-cache performance for OLTPExploits high transaction concurrencySynchronized Transactions through Explicit Processor Scheduling: Multiplex concurrent transactions to exploit common code paths

    [HA04]

    code fits inI-cache

    context-switchpoint

    CPU00101100100010110111010011011000110

    thread A

    CPU executes code

    CPU performs context-switch

    00101100100010110111010011011000110

    thread B

    instructioncache

    capacitywindow

    CPU00101100100010110111010011011000110

    thread A

    00101100100010110111010011011000110

    thread B

    before after

    All capacity/conflict I-cache misses gone!

  • @Carnegie MellonDatabases

    99©2004 Anastassia Ailamaki

    STEPS implementation runs full OLTP workloads (TPC-C)Groups threads per DB operator, then uses fast context-switch to reuse instructions in the cache

    Full-system TPC-C implementation:65% fewer L1-I misses, 40% speedup

    STEPS: Cache-Resident OLTP[HA04]

    STEPS minimizes L1-I cache misses without increasing cache size

  • @Carnegie MellonDatabases

    100©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

    Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems

    Newer HardwareHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    101©2004 Anastassia Ailamaki

    Context loaded multiple times for each queryNo means to exploit overlapping work

    Thread-based concurrency pitfalls

    Q1Q2Q3

    context-switch points TIME

    CPU

    : component loading time

    Current

    [HA03]

  • @Carnegie MellonDatabases

    102©2004 Anastassia Ailamaki

    Thread-based concurrency pitfalls

    Q1Q2Q3

    context-switch points TIME

    CPU

    : component loading time

    Current

    Q1Q2Q3

    CPU

    Desired

    [HA03]

    Context-switch at module boundaryContext-switch at module boundary

    Load context once for all queries

    Load context once for all queries

    Load context once for all queries

  • @Carnegie MellonDatabases

    103©2004 Anastassia Ailamaki

    Staged software design allows forCohort scheduling of queries to amortize loading timeSuspend at module boundaries to maintain context

    Break DBMS into stagesStages act as independent serversQueries exist in the form of “packets”

    Proposed query scheduling algorithms to address locality/wait time tradeoffs [HA02]

    Staged Database Systems[HA03]

  • @Carnegie MellonDatabases

    104©2004 Anastassia Ailamaki

    Staged Database Systems[HA03]

    Optimize instruction/data cache localityNaturally enable multi-query processing

    Highly scalable, fault-tolerant, trustworthy

    INOUT

    connect parser optimizer sendresults

    FSCAN

    JOIN

    SORT

    AGGRISCAN

    L1

    L2

    MEMORY

    L1

    L2

    MEMORY

  • @Carnegie MellonDatabases

    105©2004 Anastassia Ailamaki

    Summary: Bridging the Gap

    Cache-aware data placementEliminates unnecessary trips to memoryMinimizes conflict/capacity missesFates: decouple memory from storage layout

    What about compulsory (cold) misses?Can’t avoid, but can hide latency with prefetchingTechniques for B-trees, hash joins

    Staged Database Systems: a scalable futureAddressing instruction stalls

    DSS: Call Graph Prefetching, SIMD, group operatorOLTP: STEPS, a promising direction for any platform

  • @Carnegie MellonDatabases

    106©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapNewer HardwareHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    107©2004 Anastassia Ailamaki

    Current/Near-future Multiprocessors

    Typical platforms:1. Chips with multiple cores2. Servers with multiple chips3. Memory shared across

    Memory access:Traverse multiple hierarchiesLarge non-uniform latencies

    PP

    PP

    Mem

    Multiprocessor Server

    PP

    PPP

    PP

    P

    MemMem

    Programmer/Software must Hide/Tolerate Latency

  • @Carnegie MellonDatabases

    108©2004 Anastassia Ailamaki

    Chip Multi-Processors (CMP)

    Highly variable memory latencySpeedup: OLTP 3x, DSS 2.3x on Piranha [BGM00]

    Two cores

    Shared L2

    Example: IBM Power4, Power5

  • @Carnegie MellonDatabases

    109©2004 Anastassia Ailamaki

    Simultaneous Multi-Threading (SMT)

    Speedup: OLTP 3x, DSS 0.5x (simulated) [LBE98]

    Implements threads in a superscalar processorKeeps hardware state for multiple threadsE.g.: Intel Pentium 4 (SMT), IBM Power5 (SMT&CMP)

    2 cores * 2 threads

  • @Carnegie MellonDatabases

    110©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and Trendy

    Query co-processingDatabases on MEMStore

    Directions for Future Research

  • @Carnegie MellonDatabases

    111©2004 Anastassia Ailamaki

    Oprimizing Spatial Operations

    Spatial operation is computation intensiveIntersection, distance computationNumber of vertices per object↑, cost↑

    Use graphics card to increase speedIdea: use color blending to detect intersection

    Draw each polygon with grayIntersected area is black because of color mixing effectAlgorithms cleverly use hardware features

    Intersection selection: up to 64% improvement using graphics card

    [SAA03]

  • @Carnegie MellonDatabases

    112©2004 Anastassia Ailamaki

    Fast Computation of DB Operations Using Graphics Processors

    Exploit graphics features for database operationsPredicate, Boolean operations, Aggregates

    Examples:Predicate: attribute > constant

    Graphics: test a set of pixels against a reference valuepixel = attribute value, reference value = constant

    Aggregations: COUNTGraphics: count number of pixels passing a test

    Good performance: e.g. over 2X improvement for predicate evaluations

    [GLW04]

    Promising! Peak performance of graphics processor increases 2.5-3 times a year

  • @Carnegie MellonDatabases

    113©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and Trendy

    Query co-processingDatabases on MEMStore

    Directions for Future Research

  • @Carnegie MellonDatabases

    114©2004 Anastassia Ailamaki

    MEMStore (MEMS*-based storage)

    On-chip mechanical storage - using MEMS for media positioning

    Read/writetips

    Read/writetips

    Recordingmedia (sled)Recording

    media (sled)

    ActuatorsActuators

    * microelectromechanical systems

  • @Carnegie MellonDatabases

    115©2004 Anastassia Ailamaki

    MEMStore (MEMS*-based storage)

    60 - 200 GB capacity4 – 40 GB portable

    100 cm3 volume10’s MB/s bandwidth< 10 ms latency

    10 – 15 ms portable

    2 - 10 GB capacity

    < 1 cm3 volume~100 MB/s bandwidth< 1 ms latency

    Many parallelheads

    Many parallelheads

    Singleread/write

    head

    Singleread/write

    head

    So how can MEMS help improve DB performance?

    * microelectromechanical systems

  • @Carnegie MellonDatabases

    116©2004 Anastassia Ailamaki

    Two-dimensional database accessR

    ecor

    ds

    Attributes33

    30

    27

    34

    31

    28

    35

    32

    29

    3

    6

    12

    9

    21

    24

    0

    15

    18

    4

    7

    13

    10

    22

    25

    1

    16

    19

    5

    8

    14

    11

    2

    17

    20

    23

    26

    54

    57

    60

    55

    58

    61

    56

    59

    62

    36 69

    39 66

    42 63

    37 70

    40 67

    43 64

    38 71

    41 68

    44 65

    51 72

    48 75

    45 78

    52 73

    49 76

    46 79

    53 74

    50 77

    47 80

    Exploit inherent parallelism

    [SSA03,YAA03,YAA04]

  • @Carnegie MellonDatabases

    117©2004 Anastassia Ailamaki

    Two-dimensional database access

    0

    20

    40

    60

    80

    100

    NSM - Row order MEMStore - Roworder

    NSM - Attributeorder

    MEMStore - Attributeorder

    Sca

    n tim

    e (s

    )

    all a1 a2 a3 a4[SSA03]

    Peak performance along both dimensions

  • @Carnegie MellonDatabases

    118©2004 Anastassia Ailamaki

    Outline

    Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

  • @Carnegie MellonDatabases

    119©2004 Anastassia Ailamaki

    Future research directions

    Rethink Query Optimization – with increasing complexity, cost-based optimization not idealMultiprocessors and really new modular software architectures to fit new computers

    Current research in DB workloads only scratches surfaceOptimize execution on multiple-core chipsExploit multithreaded processors

    Power-aware database systems On embeded processors, laptops, etc.

    Automatic data placement and memory layer optimization –one level should not need to know what others do

    Auto-everything

    Aggressive use of hybrid processors

  • ACKNOWLEDGEMENTS

  • @Carnegie MellonDatabases

    121©2004 Anastassia Ailamaki

    Special thanks go to…

    Shimin Chen, Minglong Shao, Stavros Harizopoulos, and Nikos Hardavellas for invaluable contributions to this talkSteve Schlosser (MEMStore)Ravi Ramamurthy (fractured mirrors)Babak Falsafi and Chris Colohan (h/w architecture)

  • REFERENCES(used in presentation)

  • @Carnegie MellonDatabases

    123©2004 Anastassia Ailamaki

    ReferencesWhere Does Time Go? (simulation only)

    [ADS02] Branch Behavior of a Commercial OLTP Workload on Intel IA32 Processors. M. Annavaram, T. Diep, J. Shen. International Conference on Computer Design: VLSI in Computers and Processors (ICCD), Freiburg, Germany, September 2002.

    [SBG02] A Detailed Comparison of Two Transaction Processing Workloads. R. Stets, L.A. Barroso, and K. Gharachorloo. IEEE Annual Workshop on Workload Characterization (WWC), Austin,Texas, November 2002.

    [BGN00] Impact of Chip-Level Integration on Performance of OLTP Workloads. L.A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. IEEE International Symposium on High-Performance Computer Architecture (HPCA), Toulouse, France, January 2000.

    [RGA98] Performance of Database Workloads on Shared Memory Systems with Out-of-Order Processors. P. Ranganathan, K. Gharachorloo, S. Adve, and L.A. Barroso. International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, October 1998.

    [LBE98] An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors. J. Lo, L.A. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.

    [EJL96] Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. R.J. Eickemeyer, R.E. Johnson, S.R. Kunkel, M.S. Squillante, and S. Liu. ACM International Symposium on Computer Architecture (ISCA), Philadelphia, Pennsylvania, May 1996.

  • @Carnegie MellonDatabases

    124©2004 Anastassia Ailamaki

    ReferencesWhere Does Time Go? (real-machine/simulation)

    [RAD02] Comparing and Contrasting a Commercial OLTP Workload with CPU2000. J. Rupley II, M. Annavaram, J. DeVale, T. Diep and B. Black (Intel). IEEE Annual Workshop on Workload Characterization (WWC), Austin, Texas, November 2002.

    [CTT99] Detailed Characterization of a Quad Pentium Pro Server Running TPC-D. Q. Cao, J. Torrellas, P. Trancoso, J. Larriba-Pey, B. Knighten, Y. Won. International Conference on Computer Design (ICCD), Austin, Texas, October 1999.

    [ADH99] DBMSs on a Modern Processor: Experimental Results A. Ailamaki, D. J. DeWitt, M. D. Hill, D.A. Wood. International Conference on Very Large Data Bases (VLDB), Edinburgh, UK, September 1999.

    [KPH98] Performance Characterization of a Quad Pentium Pro SMP using OLTP Workloads. K. Keeton, D.A. Patterson, Y.Q. He, R.C. Raphael, W.E. Baker. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.

    [BGB98] Memory System Characterization of Commercial Workloads. L.A. Barroso, K. Gharachorloo, and E. Bugnion. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.

    [TLZ97] The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors. P. Trancoso, J. Larriba-Pey, Z. Zhang, J. Torrellas. IEEE International Symposium on High-Performance Computer Architecture (HPCA), San Antonio, Texas, February 1997.

  • @Carnegie MellonDatabases

    125©2004 Anastassia Ailamaki

    ReferencesArchitecture-Conscious Data Placement

    [SSS04] Clotho: Decoupling memory page layout from storage organization. M. Shao, J. Schindler, S.W. Schlosser, A. Ailamaki, G.R. Ganger. International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.

    [SSS04a] Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks. J. Schindler, S.W. Schlosser, M. Shao, A. Ailamaki, G.R. Ganger. USENIX Conference on File and Storage Technologies (FAST), San Francisco, California, March 2004.

    [YAA04] Declustering Two Dimensional

  • @Carnegie MellonDatabases

    126©2004 Anastassia Ailamaki

    ReferencesArchitecture-Conscious Access Methods

    [ZR03a] Buffering Accesses to Memory-Resident Index Structures. J. Zhou and K.A. Ross. International Conference on Very Large Data Bases (VLDB), Berlin, Germany, September 2003.

    [HP03a] Effect of node size on the performance of cache-conscious B+ Trees. R.A. Hankins and J.M. Patel. ACM International conference on Measurement and Modeling of Computer Systems (SIGMETRICS), San Diego, California, June 2003.

    [CGM02] Fractal Prefetching B+ Trees: Optimizing Both Cache and Disk Performance. S. Chen, P.B. Gibbons, T.C. Mowry, and G. Valentin. ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsin, June 2002.

    [GL01] B-Tree Indexes and CPU Caches. G. Graefe and P. Larson. International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2001.

    [CGM01] Improving Index Performance through Prefetching. S. Chen, P.B. Gibbons, and T.C. Mowry. ACM International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001.

    [BMR01] Main-memory index structures with fixed-size partial keys. P. Bohannon, P. Mcllroy, and R. Rastogi. ACM International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001.

    [BDF00] Cache-Oblivious B-Trees. M.A. Bender, E.D. Demaine, and M. Farach-Colton. Symposium on Foundations of Computer Science (FOCS), Redondo Beach, California, November 2000.

    [RR00] Making B+ Trees Cache Conscious in Main Memory. J. Rao and K.A. Ross. ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, May 2000.

    [RR99] Cache Conscious Indexing for Decision-Support in Main Memory. J. Rao and K.A. Ross. International Conference on Very Large Data Bases (VLDB), Edinburgh, the United Kingdom, September 1999.

    [LC86] Query Processing in main-memory database management systems. T. J. Lehman and M. J. Carey. ACM International Conference on Management of Data (SIGMOD), 1986.

  • @Carnegie MellonDatabases

    127©2004 Anastassia Ailamaki

    ReferencesArchitecture-Conscious Query Processing

    [MBN04] Cache-Conscious Radix-Decluster Projections. Stefan Manegold, Peter A. Boncz, Niels Nes, Martin L. Kersten. In Proceedings of the International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.

    [GLW04] Fast Computation of Database Operations using Graphics Processors. N.K. Govindaraju, B. Lloyd, W. Wang, M. Lin, D. Manocha. ACM International Conference on Management of Data (SIGMOD), Paris, France, June 2004.

    [CAG04] Improving Hash Join Performance through Prefetching. S. Chen, A. Ailamaki, P. B. Gibbons, and T.C. Mowry. International Conference on Data Engineering (ICDE), Boston, Massachusetts, March 2004.

    [ZR04] Buffering Database Operations for Enhanced Instruction Cache Performance. J. Zhou, K. A. Ross. ACMInternational Conference on Management of Data (SIGMOD), Paris, France, June 2004.

    [SAA03] Hardware Acceleration for Spatial Selections and Joins. C. Sun, D. Agrawal, A.E. Abbadi. ACM International conference on Management of Data (SIGMOD), San Diego, California, June,2003.

    [CHK01] Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. S. K. Cha, S. Hwang, K. Kim, and K. Kwon. International Conference on Very Large Data Bases (VLDB), Rome, Italy, September 2001.

    [PMA01] Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. S. Padmanabhan, T. Malkemus, R.C. Agarwal, A. Jhingran. International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2001.

    [MBK00] What Happens During a Join? Dissecting CPU and Memory Optimization Effects. S. Manegold, P.A. Boncz, and M.L.. Kersten. International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, September 2000.

    [SKN94] Cache Conscious Algorithms for Relational Query Processing. A. Shatdal, C. Kant, and J.F. Naughton. International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, September 1994.

    [NBC94] AlphaSort: A RISC Machine Sort. C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D.B. Lomet. ACM International Conference on Management of Data (SIGMOD), Minneapolis, Minnesota, May 1994.

  • @Carnegie MellonDatabases

    128©2004 Anastassia Ailamaki

    ReferencesInstrustion Stream Optimizations and

    DBMS Architectures

    [HA04] STEPS towards Cache-resident Transaction Processing. S. Harizopoulos and A. Ailamaki. International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.

    [APD03] Call Graph Prefetching for Database Applications. M. Annavaram, J.M. Patel, and E.S. Davidson. ACM Transactions on Computer Systems, 21(4):412-444, November 2003.

    [SAG03] Lachesis: Robust Database Storage Management Based on Device-specific Performance Characteristics. J. Schindler, A. Ailamaki, and G. R. Ganger. International Conference on Very Large Data Bases (VLDB), Berlin, Germany, September 2003.

    [HA02] Affinity Scheduling in Staged Server Architectures. S. Harizopoulos and A. Ailamaki. Carnegie Mellon University, Technical Report CMU-CS-02-113, March, 2002.

    [HA03] A Case for Staged Database Systems. S. Harizopoulos and A. Ailamaki. Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, January 2003.

    [B02] Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. P. A. Boncz. Ph.D. Thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, May 2002.

    [PMH02] Computation Regrouping: Restructuring Programs for Temporal Data Cache Locality. V.K. Pingali, S.A. McKee, W.C. Hseih, and J.B. Carter. International Conference on Supercomputing (ICS), New York, New York, June 2002.

    [ZR02] Implementing Database Operations Using SIMD Instructions. J. Zhou and K.A. Ross. ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsin, June 2002.

  • @Carnegie MellonDatabases

    129©2004 Anastassia Ailamaki

    ReferencesNewer Hardware

    [BWS03] Improving the Performance of OLTP Workloads on SMP Computer Systems by Limiting Modified Cache Lines. J.E. Black, D.F. Wright, and E.M. Salgueiro. IEEE Annual Workshop on Workload Characterization (WWC), Austin, Texas, October 2003.

    [GH03] Technological impact of magnetic hard disk drives on storage systems. E. Grochowskiand R. D. Halem IBM Systems Journal 42(2), 2003.

    [DJN02] Shared Cache Architectures for Decision Support Systems. M. Dubois, J. Jeong , A. Nanda, Performance Evaluation 49(1), September 2002 .

    [G02] Put Everything in Future (Disk) Controllers. Jim Gray, talk at the USENIX Conference on File and Storage Technologies (FAST), Monterey, California, January 2002.

    [BGM00] Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. L.A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. International Symposium on Computer Architecture (ISCA). Vancouver, Canada, June 2000.

    [AUS98] Active disks: Programming model, algorithms and evaluation. A. Acharya, M. Uysal, and J. Saltz. International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, October 1998.

    [KPH98] A Case for Intelligent Disks (IDISKs). K. Keeton, D. A. Patterson, J. Hellerstein. SIGMOD Record, 27(3):42--52, September 1998.

    [PGK88] A Case for Redundant Arrays of Inexpensive Disks (RAID). D. A. Patterson, G. A. Gibson, and R. H. Katz. ACM International Conference on Management of Data (SIGMOD), June 1988.

  • @Carnegie MellonDatabases

    130©2004 Anastassia Ailamaki

    ReferencesMethodologies and Benchmarks

    [DMM04] Accurate Cache and TLB Characterization Using hardware Counters. J. Dongarra, S. Moore, P. Mucci, K. Seymour, H. You. International Conference on Computational Science (ICCS), Krakow, Poland, June 2004.

    [SAF04] DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture. M. Shao, A. Ailamaki, and B. Falsafi. Carnegie Mellon University Technical Report CMU-CS-03-161, 2004 .

    [KP00] Towards a Simplified Database Workload for Computer Architecture Evaluations. K. Keeton and D. Patterson. IEEE Annual Workshop on Workload Characterization, Austin, Texas, October 1999.

  • @Carnegie MellonDatabases

    131©2004 Anastassia Ailamaki

    Useful Links

    Info on Intel Pentium4 Performance Counters: ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf

    AMD hardware performance countershttp://www.amd.com/us-en/Processors/DevelopWithAMD/

    PAPI Performance Libraryhttp://icl.cs.utk.edu/papi/Intel® VTune™ Performance Analyzers http://developer.intel.com/software/products/vtune/