lecture 3 data warehouse structures

Upload: deepika-monga

Post on 05-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    1/22

    Rajeev Tiwari

    Lecture 3

    Data warehouse structures

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    2/22

    [1] Data Mining Concepts and Techniques Jiawei Han and Micheline

    Kamber

    [2] http://www.daneil-lemire.com

    [3] http://www.kalmstrom.nu

    2

    References

    http://www.daneil-lemire.com/http://www.kalmstrom.nu/http://www.kalmstrom.nu/http://www.daneil-lemire.com/http://www.daneil-lemire.com/http://www.daneil-lemire.com/
  • 7/31/2019 Lecture 3 Data Warehouse Structures

    3/22

    What is Data Warehouse?

    o Defined in many different ways.

    A decision support database that is maintained separately from the

    organizations operational database.

    Support information processing by providing a solid platform of

    consolidated, historical data for analysis.

    o A data warehouse is a subject-oriented, integrated, time-variant,

    and nonvolatile collection of data in support of managements

    decision-making process.W. H. Inmon

    o Data warehousing:

    The process of constructing and using data warehouses

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    4/22

    Data Warehouse Subject Oriented

    o Organized around major subjects, such as customer, product,

    sales.

    o Focused on the modeling and analysis of data for decision makers,

    not on daily operations

    o Provide a simple and concise view around particular subject issues

    by excluding data that are not useful in the decision support

    process.

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    5/22

    Data Warehouse - Integrated

    5

    o Constructed by integrating multiple, heterogeneous data sources

    relational databases, flat files, on-line transaction records

    o Data cleaning and data integration techniques are applied.

    Ensure consistency in naming conventions, encoding structures,attribute measures, etc. among different data sources

    When data is moved to the warehouse, it is converted.

    o Eg: Sales data may be on RDB, customer information in flat files.

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    6/22

    Data Warehouse - Time Variant

    o The time horizon for the data warehouse is significantly longer than

    that of operational database systems

    Operational database: current value

    Data warehouse data: provide information from a historicalperspective (e.g., past 5-10 years)

    o Every key structure in the data warehouse

    Contains an element of time, explicitly or implicitly

    But the key of operational data may or may not contain time

    element

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    7/22

    Data Warehouse - Nonvolatile

    o A physically separate store of data, transformed from the operational

    environment

    o Operational update of data does not occur in the data warehouse

    environment

    Does not require transaction processing, recovery, and

    concurrency control mechanisms

    Requires only two operations in data accessing:

    initial loading of dataand access of data

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    8/22

    Heterogeneous Databases

    8

    o Consists of a set of interconnected, autonomous databases.

    o Objects in one database may differ from objects in other

    databases.o Information exchange across such databases is difficult.

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    9/22

    Data Warehouse vs. HeterogeneousDBMS

    9

    o Heterogeneous DBMS: A query driven approach

    Build wrappers/mediators on top of heterogeneous databases

    A meta-dictionary is used to translate the query into queries appropriate for

    individual heterogeneous sites.

    The results are integrated into a global answer set.

    This approach involves complex information filtering.

    Inefficient and potentially expensive.

    o Data warehouse: update-driven, high performance

    Information from heterogeneous sources is integrated in advance and stored

    in warehouses for direct query and analysis

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    10/22

    Operational DBMS

    10

    o They consist of tables with a set of attributes and stores alarge set of tuples.

    o They use the Entity-Relationship (ER) data model.

    o They are used to store transactional data.o They contain the most current information.

    o Thus known as Online Transaction Processing (OLTP)systems.

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    11/22

    Data Warehouse vs. Operational DBMS

    11

    o User and system orientation

    customer vs. market

    o Data contents

    current, detailed vs. historical, consolidated

    o Database design

    ER + application vs. star + subject

    o View

    current, local vs. evolutionary, integrated

    o Access patterns

    update vs. read-only but complex queries

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    12/22

    OLTP vs. OLAPOLTP( online transactionprocessing)

    OLAP(online analyticalprocessing)

    users clerk, IT professional knowledge worker

    function day to day operations decision support

    DB design application-oriented subject-oriented

    data current, up-to-date

    detailed, flat relationalisolated

    historical,

    summarized, multidimensionalintegrated, consolidated

    usage repetitive ad-hoc

    access read/write

    index/hash on prim. key

    lots of scans

    unit of work short, simple transaction complex query

    # records accessed tens millions

    #users thousands hundreds

    DB size 100MB-GB 100GB-TB

    metric transaction throughput query throughput, response

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    13/22

    Why Separate Data

    Warehouse?

    13

    o High performance for both systems

    DBMS - Tuned for Online Transaction Processing Systems

    Warehouse - Tuned for Online Analytical Processing systems

    involving complex OLAP queries

    Processing OLAP queries would degrade DBMS performance of

    operational tasks.

    o Decision support requires historical data which operational

    Databases do not typically maintain.o Decision Support requires consolidation of data from

    heterogeneous sources.

    o Solution

    To maintain separate database systems which support special

    primitives and structures suitable to store, access and process

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    14/22

    Multidimensional Data Model

    o A Data warehouse is based on multidimensional data model,which views data in the form of a data cube.

    o Data cube models n-D data, defined by dimensions and facts.

    Dimensions: They are entities with respect to which an

    organization wants to keep records such as items(item_name).

    Facts: It is a subject of decision oriented analysis such asdollars_sold or units_sold.

    Facts are numerical measures.

    Quantities by which we want to analyze relationshipbetween dimensions.

    Contains key to each of the related dimension tables.

    o A multidimensional data model is typically organized around a

    central theme, like sales, and is represented by a fact table.

    S l l f ti f

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    15/22

    Sales volume as a function ofproduct, Date, Country

    Total annualsales

    of TV in U.S.A.

    Date

    Country

    sum

    sumTV

    VCR

    PC

    1Qtr 2Qtr 3Qtr 4Qtr

    U.S.A

    Canada

    Mexico

    sum

    Dimensions: Product, Location, Time

    Hierarchical summarization paths

    Industry Region Year

    Category Country Quarter

    Product City Month

    Office Week

    Day

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    16/22

    Cube: A Lattice of Cuboids

    se

    all

    time item location supplier

    time,location

    time,supplier

    item,location

    item,supplier

    location,supplier

    time,item,supplier

    time,location,supplier

    item,location,supplier

    0-D(apex) cuboid

    1-D cuboids

    2-D cuboids

    3-D cuboids

    4-D(base) cuboid

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    17/22

    Schemas for Multidimensional

    Databases Multidimensional model exists in form of

    1. Star Schema: A fact table in the middle connected to a set ofdimension tables.

    time_key

    dayday_of_the_week

    month

    quarter

    year

    timetime_key

    item_key

    branch_key

    location_key

    units_sold

    dollars_sold

    avg_salesbranch_keybranch_name

    branch_type

    branch

    item_key

    item_name

    brand

    type

    supplier_type

    item

    location_key

    street

    city

    state_or_province

    country

    location

    Sales Fact Table

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    18/22

    The Star SchemaEmployee_DimEmployeeKey

    EmployeeID...

    Time_DimTimeKey

    TheDate...

    Product_DimProductKey

    ProductID...

    Customer_DimCustomerKey

    CustomerID

    ...

    Shipper_DimShipperKey

    ShipperID

    ...

    Sales_Fact

    TimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey

    Sales Amount

    Unit Sales ...

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    19/22

    2. Snowflake schema: A refinement of star schema where somedimensional hierarchy is normalized into a set of smaller dimensiontables, forming a shape similar to snowflake.

    time_key

    day

    day_of_the_week

    month

    quarter

    year

    time

    branch_key

    branch_name

    branch_type

    branch

    time_key

    item_key

    branch_key

    location_key

    units_sold

    dollars_sold

    avg_sales

    item_keyitem_name

    brand

    type

    supplier_key

    item

    location_key

    street

    city_key

    location

    city_key

    city

    state_or_province

    country

    citySales Fact Table

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    20/22

    Snowflakes are conglomerations of frozen icecrystals which fall through the Earth's atmosphere.They begin as two snow crystals which develop

    when microscopic supercooled clouddroplets freeze.

    http://en.wikipedia.org/wiki/Earth's_atmospherehttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Supercoolinghttp://en.wikipedia.org/wiki/Freezinghttp://en.wikipedia.org/wiki/Freezinghttp://en.wikipedia.org/wiki/Supercoolinghttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Earth's_atmospherehttp://en.wikipedia.org/wiki/Earth's_atmospherehttp://en.wikipedia.org/wiki/Earth's_atmosphere
  • 7/31/2019 Lecture 3 Data Warehouse Structures

    21/22

    3. Fact Constellation: Multiple facts tables share dimension tables, viewed ascollection of stars, therefore called galaxy schema or fact constellation.

    qq

    time_keyday

    day_of_the_week

    month

    quarter

    year

    time

    branch_key

    branch_name

    branch_type

    branchlocation_key

    street

    city

    province_or_state

    country

    location

    time_key

    item_key

    branch_key

    location_key

    units_sold

    dollars_sold

    avg_sales

    item_key

    item_name

    brand

    type

    supplier_type

    item

    time_key

    item_key

    shipper_keyfrom_location

    to_location

    dollars_cost

    units_shipped

    shipper_key

    shipper_name

    location_key

    shipper_type

    shipperSales Fact Table

    Shipping Fact Table

  • 7/31/2019 Lecture 3 Data Warehouse Structures

    22/22

    THANKS