dwdm_lect2

29
Data Warehousing and Data Mining Sunil Paudel [email protected]

Upload: ajaykumar988

Post on 03-Sep-2015

214 views

Category:

Documents


2 download

DESCRIPTION

data mining

TRANSCRIPT

  • Data Warehousing and Data Mining

    Sunil [email protected]

  • OutlineReview on RDBMSOLAP Operations

  • DBMS Overview DBMS (Database Management Systems) are designed to

    achieve the following four main goals:1. Increase Data Independence

    Data & programs are independent Change in data did not affect user programs

    2. Reduce Data Redundancy2. Reduce Data Redundancy Data is only stored once Different applications share the same centralized data

    3. Increase Data Security Authorize the access to the database Place restrictions on operations that may be performed on data

    4. Maintain Data Integrity Same data is used by many users

  • RDBMSA relational database is a database that is

    perceived by the user as a collection of tablesThis user view is independent of the actual way

    the data is storedTables are sets of data made up from rows and Tables are sets of data made up from rows and

    columnsStructure

    Very flexible -- create views Keep the data secure (use views) Relation between tables Primary & Foreign Keys Normalization

  • RDBMS- Normalization Normalization is the process of streamlining your tables

    and their relationships1. Normal Form (1NF)

    Action: Eliminate repeating values in one atom and repeating groups

    Rule: Each column must be a fact about .... the key

    5

    Rule: Each column must be a fact about .... the key2. Normal Form (2NF)

    Action: Regroup columns dependent on only one part of the composite key

    Rule: Each column must be a fact about .... the whole key3. Normal Form (3NF)

    Action: Regroup non-key columns representing a fact about another non-key column

    Rule: Each column must be a fact about .... nothing but the key

  • Views and JoinsViews are ways of looking at data from one or

    more tablesTables can be related to each other by the data

    they hold (called joins)Join Strategies:

    6

    Join Strategies: Cross Product Inner Join Outer Join

  • SQL

    SQL is divided into three major categories:1. DDL Data Definition Language

    Used to create, modify or drop database objects2.DML Data Manipulation Language

    Used to select, insert , update or delete database data

    7

    Used to select, insert , update or delete database data(records)3. DCL Data Control Language

    Used to provide data object access control E.g. connect to database, grant, revoke

  • Multidimesional Data A data warehouse is based on a multidimensional data

    model which views data in the form of a data cube

  • Sample Quary Query:

    "What are the net sales, in terms of revenue and quantities of items sold, Per product, Per store and sales region, Per customer and customer sales area, Per day as well as aggregated over time, Over the last two weeks?

    Evaluation entails viewing historical sales figures from multiple perspectives such as: Sales (overall) Sales per product Sales per store and per sales region Sales per customer and customer sales area Sales per day and aggregated over time Sales and aggregated sales over given time periods

  • Representation of Query as a Cube

    Multi-dimensional data models could be presented using cubes or using amathematical notation technique representing points in a multi-dimensional space, for example: QTY_SOLD = F(S,P,C,t)

  • Query as a Cube : Usage

  • Hypercube Representation If more than three dimensions are present in the

    solution, the cube or 3D space representation is no longer usable.

    The principle of the cube can be extended to hyper-cube

    4th Dimension

  • Sample Multidimensional Representation

  • Six Basic Concepts of MDDMTo build an initial multi-dimensional data model, the

    following six base elements have to be identified:1. Measures2. Dimensions3. Grains of dimensions and granularities of measures and

    factsfacts4. Facts5. Dimension hierarchies6. Aggregation levels

  • 1. Measure A measure is a data item which information analysts use

    in their queries to measure the performance or behavior of a business process or a business object

    Sample types of measures Quantities Sizes Sizes Amounts Durations, delay And so forth

    KPI Key Performance Indicator is a common known synonym for the most important measures of a business.

  • 2. Dimensions A dimension is an entity or a collection of related

    entities, used by information analysts to identify the context of the measures they work with Examples: Product, Customer, Store, Time

    Dimensions are referred to through so-called Dimension keys

    Dimensions contain Dimension entities Dimension attributes Dimension hierarchies

    As an example, the measure sales revenue only make sense if we know this value for special item, special customer, at a day and in a certain store - we got the four dimensions: item, store, customer and time.

  • 3. Granularity The grain of a dimension is the lowest level of detail

    available within that dimension Product grain: Item Customer grain: Customer Store grain: Store Time grain: Day Time grain: Day

    The granularity of a measure is determined by the combination of the grains of all its dimensions

    For example the granularity of the measure QTY_SOLD is: (item, customer, store, day).

    Fine granularity enables fine analysis possibilities, but on the other side it has a big impact on the size of the Data Warehouse.

  • Granularity

    Here we see an example how fine granularity can show hidden information, like that stores in a region are better performers than other.

  • 4. FactsA fact is a collection of related measures and

    their associated dimensions, represented by the dimension keys Example: Sales

    A fact can represent a business object, a A fact can represent a business object, a business transaction or an event which is used by the information analyst

    Facts contain A Fact Identifier Dimension Keys Measures Supportive Attributes

  • 5. Dimension Hierarchies Dimensions consist of one or more dimension hierarchies Examples: Hierarchies in the Product Dimension

    Product Classification Hierarchy ("Merchandising Hierarchy") Branding Hierarchy

    Each dimension hierarchy can include several aggregation levelslevels

    Examples: Aggregation Levels in the Product Classification Hierarchy

  • 6. Aggregation Levels

    Each dimension hierarchy can include several aggregation levelsFor example:

    Item: 4-pack Duracell AA Alkaline Batteries. Product: Duracell AA Alkaline BatteriesSub-category: AA Alkaline BatteriesCategory: BatteriesDepartment: Supplies

    Dimension hierarchies and aggregation levels are used by users when drilling up or down.

  • Summary

  • Initial MDM- Example

    This shows the six base concepts as they apply to our Sales Query and the initial model that corresponds with that query.

  • Star Schema A star schema is a way to represent multidimensional

    data in a relational database The star schema logical design, unlike the entity-

    relationship model, is specifically geared towards decision support applications.

    Fact table stores business data Fact table stores business data Generally several orders of magnitude larger than any dimension

    table One key column joined to each dimension table One or more data columns

    Multidimensional queries can be built by joining fact and dimension tables

    Some products use this method to make a relational OLAP (ROLAP) system

  • Star Schema- Example

  • Sales

    time_keybranch_keylocation_keyproduct_key

    Time

    time_keydaymonthyear

    ProductLocation

    Branch

    branch_keyname

    type

    1

    n

    1

    n

    n

    n

    ???

    Logical Data Modeling: A Star Schema Example

    num_unitsamount_usd

    Product

    product_keyname

    brandtype

    Supplier

    supplier_keyname

    type

    Location

    location_keycitystatecountry

    1 1

    One-to-many relationships between the fact and dimensions. The fact-dimension relationships are certain. Dimensions in star models are often tightly coupled.

  • Snowflake Schema The snowflake model is a further

    normalized version of the star schema.

    When a dimension table contains data that is not always necessary for queries, too much data may be for queries, too much data may be picked up each time a dimension table is accessed.

    To eliminate access to this data, it is kept in a separate table off the dimension, thereby making the star resemble a snowflake.

  • Typical OLAP Operations Roll up (drill-up): summarize data

    by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up

    from higher level summary to lower level summary or detailed data, or introducing new dimensions

    Slice and dice project and select project and select

    Pivot (rotate) reorient the cube, visualization, 3D to series of 2D planes.

    Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end

    relational tables (using SQL) Rankings time functions: e.g. time avg.