data warehouse by amr ali

23
Prepared by Amr Ali

Upload: amr-ali

Post on 17-Jul-2015

81 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Prepared by

Amr Ali

Agenda

Introduction

DWH Definitions

DWH Architecture

DWH Design Process

Types of Fact Tables

Types of Dimensions

Types of Data Marts

Introduction

Information is a very powerful asset that can provide significant benefits to any organization.

Organizations have vast amounts of data but is difficult to access and use.

Data is in many formats, exists on different platforms, and resides in different file and database structures.

Introduction

In order to make a valuable decisions, an organization has to write hundreds of programs to extract, prepare, and integrate data for analysis and reporting.

Instead of doing that you need to implement data warehouse and BI system to get more insights from the data you own.

As BI tools help you do extract, transform, load and integrate heterogeneous data sources into your DWH easily and efficiently.

Data Warehouse Definitions

Bill Inmon Definition

Data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball Definition

A data warehouse is a copy of transaction data specifically structured for query and analysis.

Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in dimensional model.

Data Warehouse Bill Inmon

Explained

Subject-Oriented A data warehouse is used to analyze a particular subject

area.

Integrated A data warehouse integrates data from multiple data

sources.

Time-Variant Historical data is kept in a data warehouse.

Non-volatile Once data is in the data warehouse, it will not change.

Historical data in a data warehouse should never be altered.

Data Warehouse vs. Database

Database Data Warehouse

Application Oriented Subject Oriented

Entity R Diagram Star/Snowflake Schema

Thousands of rows Millions/Billions

MB to GB 10’s of GB/TB’s

Transaction Throughput Response Time

Detailed Summarized

Operational Processing Informational Processing

Data Warehouse Architecture

Data Warehouse Stages

Stage 1

Offline

Operational

Databases

Stage 2

Offline Data

Warehouse

Stage 3

Real Time Data

Warehouse

Stage 4

Integrated Data

Warehouse

Data Warehouse Design

Process

Identify Subject Area of Interest

Indentify the Dimensions of this SA

Identify the Key Performance Indicators

“KPIs” and Measures of this SA

Data Warehouse Modeling

Concepts

Dimension A category of information.

○ Time dimension, Product Dimension.

Attribute A unique level within a dimension.

○ Month is an attribute in the Time Dimension.

Hierarchy The specification of levels that represents relationship between

different attributes within a dimension. ○ Year → Quarter → Month → Day.

Fact Table a table that contains the measures of interest, along with FK of

Dimension Tables connected to the fact. ○ sales amount would be such a measure.

Data Warehouses Data Models -

RK Star Schema

Star schema design where the fact table sits in the middle

and is connected to the dimension lookup tables like a star.

Each dimension is represented as a single table.

Fact Table

Time

Dimension

Customer

Dimension

Product

Dimension

Store

Dimension

Data Warehouses Data Models -

RK Snowflake Schema

Time Dimension that consists of 2 different hierarchies:

1. Year → Month → Day

2. Week → Day

Snowflake Schema is the normalized version of Star

Schema.

Fact Table

Time

Dimension

Customer

Dimension

Product

Dimension

Store

Dimension

Types of Measures

Additive are facts that can be summed up through all of the

dimensions in the fact table.

Semi-Additive are facts that can be summed up for some of the

dimensions in the fact table, but not the others.

Non-Additive are facts that cannot be summed up for any of the

dimensions present in the fact table.

Types of Fact Table

Cumulative describes what has happened over a period of time.

Snapshot describes the state of things in a particular instance of

time.

Factless Fact Table is a fact table that does not have any measures

Type of Dimensions

Conformed Dimension is a dimension that has exactly the same meaning and content when

being referred from different fact tables.

Junk Dimension A junk dimension is a single table with a combination of different and

unrelated attributes to avoid having a large number of foreign keys in the

fact table.

Role Playing Dimension

Slowly Changing Dimensions this applies to cases where the attribute for a record of the dimension

varies over time.

Rapidly Changing Dimensions A dimension attribute that changes frequently is a Rapidly Changing

Attribute.

Types of Loading Dimension

Tables Conventional (Slow)

All the constraints and keys are validated against the data before, it is

loaded, this way data integrity is maintained.

Direct (Fast) All the constraints and keys are disabled before the data is loaded. Once

data is loaded, it is validated against all the constraints and keys. If data

is found invalid or dirty it is not included in index and all future processes

are skipped on this data.

Types of Loading Dimension

Tables Conventional (Slow)

All the constraints and keys are validated against the data before, it is

loaded, this way data integrity is maintained.

Direct (Fast) All the constraints and keys are disabled before the data is loaded. Once

data is loaded, it is validated against all the constraints and keys. If data

is found invalid or dirty it is not included in index and all future processes

are skipped on this data.

Data Marts Types

Independent Data Mart Created from operational systems and have separate physical data-

store

Logical Data Mart Exists as a subset of data warehouse

Build over data warehouse logically

Dependent Data Mart Created from a data warehouse to a separate physical data-store

Build over data warehouse physically

Data ModelingConceptual, Logical, And Physical Data Models

Conceptual Model identifies the highest-level relationships between the different

entities.

Logical Model describes the data in as much detail as possible, without regard

to how they will be physical implemented in the database.

Physical Model represents how the model will be built in the database.

Data ModelingConceptual, Logical, And Physical Data Models

Data ModelingConceptual, Logical, And Physical Data Models

Questions?