parallel databases pres

Upload: uzaifa

Post on 09-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Parallel Databases Pres

    1/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Parallel Databases

  • 8/8/2019 Parallel Databases Pres

    2/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Introduction

    A variety of hardware architectures allow multiple

    computers to share and access to data. A parallel

    database can allow access to a single database by

    users on multiple machines, with increased

    performance.

  • 8/8/2019 Parallel Databases Pres

    3/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Purpose

    Databases are growing increasingly large

    Large volumes of transaction data are collected and stored for

    later analysis.

    Multimedia objects like images are increasingly stored indatabases

  • 8/8/2019 Parallel Databases Pres

    4/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Large-scale parallel database

    Large-scale parallel database systems increasingly

    used for:

    storing large volumes of data

    providing high throughput for transaction processing

  • 8/8/2019 Parallel Databases Pres

    5/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Queries

    Queries are expressed in high level language (SQL)

    makes parallelization easier.

    Different queries can be run in parallel with each other.Concurrency control takes care of conflicts.

    Thus, databases naturally provide themselves to

    parallelism.

  • 8/8/2019 Parallel Databases Pres

    6/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Parallelism in Databases

    Data can be partitioned across multiple disks for

    parallel I/O.

    Individual relational operations

    (e.g., sort, join, aggregation)

    can be executed in parallel

    database.

  • 8/8/2019 Parallel Databases Pres

    7/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Parallel Database Architectures

  • 8/8/2019 Parallel Databases Pres

    8/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Parallel Database Architectures

    Shared memory

    Shared disk

    Shared nothing

    Hierarchical

  • 8/8/2019 Parallel Databases Pres

    9/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Parallel Database Architectures

    Shared memory -- processors share a common

    memory (Extremely efficient communication between

    processors).

    Shared disk -- processors share a common disk

    (Shared-disk systems can scale to a somewhat larger

    number of processors, but communication between

    processors is slower.)

  • 8/8/2019 Parallel Databases Pres

    10/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Shared nothing -- Each processor has exclusive

    access to its main memory.

    Hierarchical -- hybrid of the above architectures(Combines characteristics of shared-memory, shared-

    disk, and shared-nothing architectures.)

  • 8/8/2019 Parallel Databases Pres

    11/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Parallel Level

    A coarse-grain parallel machine

    consists of a small number of

    powerful processors

  • 8/8/2019 Parallel Databases Pres

    12/19

    04/25/2005 Yan Huang - CSCI5330 DatabaseImplementation Parallel Database

    Massively Parallel

    A massively parallel orfine grain parallel machineutilizes thousands of smaller processors.

  • 8/8/2019 Parallel Databases Pres

    13/19

    04/25/2005 Yan Huang - CSCI5330 Database

    Implementation Parallel Database

    Parallel System Performance Measure

    Speedup: = small system elapsed time

    large system elapsed time

    Scaleup: = small system small problem elapsed time

    big system big problem elapsed time

  • 8/8/2019 Parallel Databases Pres

    14/19

    04/25/2005 Yan Huang - CSCI5330 Database

    Implementation Parallel Database

    Database Performance Measures

    throughput --- the number of tasks that can becompleted in a given time interval.

    response time --- the amount of time it takes to

    complete a single task from the time it is submitted.

  • 8/8/2019 Parallel Databases Pres

    15/19

    04/25/2005 Yan Huang - CSCI5330 Database

    Implementation Parallel Database

    Factors Limiting Speedup and Scaleup

    Speedup and scaleup are often sublinear due to:

    Startup costs: Cost of starting up multiple processes

    may dominate computation time, if the degree of

    parallelism is high.

    Interference: Processes accessing shared resources(e.g.,system bus, disks, or locks) compete with each

    other

    Skew: Overall execution time determined by slowest of

    parallely executing tasks.

  • 8/8/2019 Parallel Databases Pres

    16/19

    04/25/2005 Yan Huang - CSCI5330 Database

    Implementation Parallel Database

    Parallel Database Issues

    Data Partitioning

    Parallel Query Processing

  • 8/8/2019 Parallel Databases Pres

    17/19

    04/25/2005 Yan Huang - CSCI5330 Database

    Implementation Parallel Database

    I/O Parallelism

    Horizontal partitioning tuples of a relation are dividedamong many disks such that each tuple resides on one

    disk.

    Partitioning techniques (number of disks = n):

    Round-robin Hash partitioning

    Range partitioning

  • 8/8/2019 Parallel Databases Pres

    18/19

    04/25/2005 Yan Huang - CSCI5330 Database

    Implementation Parallel Database

    Typical Database Query Types

    Sequential scan

    Point query

    Range query

  • 8/8/2019 Parallel Databases Pres

    19/19

    04/25/2005 Yan Huang - CSCI5330 Database

    Implementation Parallel Database

    Comparison of Partitioning Techniques

    Round

    Robin

    Hashing Range

    Sequential

    Scan

    Best/good

    parallelism

    Good Good

    Point Query Difficult Good for hash key Good for range

    vector

    Range Query Difficult Difficult Good for range

    vector