data models for warehouse session-12/13 data management for decision support

Data Models for WarehouseData Models for Warehouse

Session-12/13

Data Management for Decision Support

Data ModelsData Models

Data Models relations stars & snowflakes cubes

Operators slice & dice roll-up, drill down pivoting other

Star schemas are database schemas that exploit the structure of data for decision support query Queries in DSS tend to

Examine a set of factual transactions- POS, Customer events

Facts are analyzed in variety of ways - POS transaction by week, or store

For example a retail store POS is at the center Product information - SKU, hierarchy of ( section dept, BU) Time information - day, week, month, year Stores - Store-id, hierarchy (regions, city, locality) Suppliers- Sup-id, location, discounts

Sales Transactions

Products Time

SuppliersStores

Information is split between two classes- Factual information and Reference information

FACT DATAFACT DATA

Fact data records the information on factual event that occurred in the business- POS, Phone calls, Banking transactions

Typically 70% of Warehouse data is Fact data Important to identify and define structure right in

the first place as restructuring is an expensive process

Detail content of FACT is derived from the business requirement

Recorded Facts do not change as they are events of past

Dimension DataDimension Data

Information that is used for analyzing the elemental data, for example, product hierarchy, time periods, customers, stores

It is the reference data used for analysis of Facts

Organizing the information in separate reference tables offers better query performance

It differs from Fact data as it changes over time, due to changes in business, reorganization

It should be structured to permit rapid changes

FACT and Dimensions FACT and Dimensions

Millions to billions of rows

Multiple foreign keys Numeric Does not change

Tens to millions of rows

One primary key Textual decription Frequently modifies

Decision Support QueriesDecision Support Queries

Examples Average number of sales of Haldiram per store

over last month (various types within the brand) Projected sales of Deepavali gift packs against

the actual The top 20% customers (spending) over last

quarter The customers with average balance in excess of

Rs. 25000 for past one year ==> Each of these queries is based on Factual

Decision Support QueriesDecision Support Queries

Examples

POS Transaction

Membership card Transaction

Account transactions

Sales of Haldiram

Customer Spend

Account Balance

Quantity SoldProductStore Date, TimeRevenue Realized

Customer-IdStoreTransaction ValueDate and Time

CustomerAC numbertype of transactionamount

Star SchemaStar Schema

The star schema is a data-modeling technique used to map multidimensional decision support into a relational database.

Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database.

Four Components: Facts Dimensions Attributes Attribute hierarchies

A Simple Star Schema

Facts Facts are numeric measurements (values) that

represent a specific business aspect or activity.

The fact table contains facts that are linked through their dimensions.

Facts can be computed or derived at run-time (metrics).

Dimensions Dimensions are qualifying characteristics that provide

additional perspectives to a given fact.

Dimensions are stored in dimension tables.

Identifying Facts and DimensionsIdentifying Facts and Dimensions

Elemental Transaction

Determine Key Dimensions

Check if Fact is a dimension

Check if dimensions is a Fact

Identification: Step 1Identification: Step 1

Examine the enterprise model and identify the transaction that or of interest- driven by business requirement analysis

These will be transaction that describes events fundamental to the business e.g., #calls for Telecom, account transactions in banking

For each potential Fact ask a question- Is this information operated upon by business process? Daily sales versus POS, even if system reports daily sales POS may be the FACT

The limit of current recording should not influence Warehouse design

Sector and Business

Retail

SalesShrinkage

Retail Banking Customer profiling ProfitabilityInsurance Product ProfitabilityTelecom Call Analysis Customer Analysis

Fact Table

POS Transaction

Stock movement and position

Customer eventsAccount transactions

Claims and receipts

Call eventsCustomer events(install, disconnect, payment)

Look at the logical model to find the entities associated with entities in the fact table. List out all such logically associate entities.

These are candidate References, the task is to find key dimension entities that may not be directly associated.

For example, retail banking account transaction are candidate fact table. The account transaction is candidate reference. But, the customer I indirectly related to transaction. Although, a better choice.

Analyze account transaction by account? Analyze how customers use our services? You store both relationships but customer becomes a

dimension

Identification: Step3Identification: Step3

FACT is not actually a denormalized dimension table Consider the following:

house-details Cable-laid Sales-persons visit connected to the service promotional material sent subscription cancelled …

Home-details - candidate fact Operational events Report on number of connections quarter-to-date Time-lag between laying and subscrition

Dimension is not a FACT Lot depends on DSS requirements-

Customer can be FACT or Dimension Promotions can be fact or dimensions

Ask questions using other dimensions- Using how many other dimensions, Can I view this entity.

Can I view promotion by Time? Can I view promotions by product? Can I view promotion by store? Can I vie promotions by suppliers?

If answer to these question is yes, then it is a FACT

Attributes Each dimension table contains attributes. Attributes are

often used to search, filter, or classify facts. Dimensions provide descriptive characteristics about

the facts through their attributes.

Possible Attributes For Sales Dimensions

Three Dimensional View Of Sales

Slice And Dice View Of Sales

Attribute Hierarchies

Attributes within dimensions can be ordered in a well-defined attribute hierarchy.

The attribute hierarchy provides a top-down data organization that is used for two main purposes:

Aggregation

Drill-down/roll-up data analysis

A Location Attribute Hierarchy

Attribute Hierarchies In Multidimensional Analysis

Star Schema Representation

Facts and dimensions are normally represented by physical tables in the data warehouse database.

The fact table is related to each dimension table in a many-to-one (M:1) relationship.

Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.

Star Schema For Sales

Orders Star Schema

The Multi-Dimensional ModelThe Multi-Dimensional Model

“Sales by product line over the past six months”

“Sales by store between 1990 and 1995”

Prod Code Time Code Store Code Sales Qty

Store Info

Product Info

Time Info

Numerical MeasuresKey columns joining fact table

to dimension tables

Fact table for measures

Dimension tables

Dimensional ModelingDimensional Modeling

Dimensions are organized into hierarchies E.g., Time dimension: days weeks quarters E.g., Product dimension: product product line brand

Dimensions have attributes

Dimension Hierarchies

Store Dimension Product Dimension

District

Region

Manufacturer

Stores Products

ROLAP: Dimensional Modeling Using Relational DBMS

Special schema design: star, snowflake Special indexes: bitmap, multi-table join Special tuning: maximize query throughput Proven technology (relational model, DBMS), tend to

outperform specialized MDDB especially on large data sets Products

IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

MOLAP: Dimensional Modeling Using the Multi Dimensional Model

MDDB: a special-purpose data model Facts stored in multi-dimensional arrays Dimensions used to index array Sometimes on top of relational DB Products

Pilot, Arbor Essbase, Gentia

Star Schema (in RDBMS)Star Schema (in RDBMS)

Star Schema ExampleStar Schema Example

Star Schema with Sample Data

The “Classic” Star Schema

A single fact table, with detail and summary data

Fact table primary key has only one key column per dimension

Each key is generated Each dimension is a single table,

highly denormalized

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata

Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

PERIOD KEY

Store Dimension Time Dimension

Product Dimension

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagResolutionSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level

Product Desc.BrandColorSizeManufacturerLevel

STORE KEY

The “Classic” Star Schema

The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error.

Example: Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A

where A.STORE_KEY in (select STORE_KEYfrom Store_Dimension Bwhere region = “North” and Level = 2)

and etc...

Level is neededwhenever aggregates are stored with detail facts.

PERIOD KEY

Product Dimension

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagResolutionSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level

Product Desc.BrandColorSizeManufacturerLevel

STORE KEY

The “Level” Problem

Level is a problem because because it causes potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answers can occur.

One alternative: the FACT CONSTELLATION model...

The “Fact Constellation” Schema

DollarsUnitsPrice

District Fact Table

District_IDPRODUCT_KEYPERIOD_KEY

DollarsUnitsPrice

Region Fact Table

Region_IDPRODUCT_KEYPERIOD_KEY

PERIOD KEY

Product Dimension

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.

Product Desc.BrandColorSizeManufacturer

STORE KEY

The “Fact Constellation” Schema

In the Fact Constellations, aggregate tables are created separately from the detail, therefor it is impossible to pick up, forexample, Store detail when queryingthe District Fact Table.

Major Advantage: No need for the “Level” indicator in the dimension tables, since no aggregated data is stored with lower-level detail

Disadvantage: Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata

DollarsUnitsPrice

District Fact Table

District_IDPRODUCT_KEYPERIOD_KEY

DollarsUnitsPrice

Region Fact Table

PERIOD KEY

Product Dimension

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagSequence

Fact Table

PRODUCT KEY

Product Desc.BrandColorSizeManufacturer

STORE KEY

Another Alternative to “Level”

Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay.

An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...

The “Snowflake” Schema

STORE KEY

Store Dimension

District_IDDistrict Desc.Region_ID

Region_ID

Region Desc.Regional Mgr.

DollarsUnitsPrice

Store Fact Table

DollarsUnitsPrice

District Fact Table

District_IDPRODUCT_KEYPERIOD_KEY Dollars

UnitsPrice

RegionFact Table

No LEVEL in dimension tables Dimension tables are normalized by

decomposing at the attribute level Each dimension table has one key for each

level of the dimensionís hierarchy The lowest level key joins the dimension table

to both the fact table and the lower level attribute table

How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.

STORE KEY

Store Dimension

Store DescriptionCityStateDistrict IDDistrict Desc.Region_ IDRegion Desc.Regional Mgr.

District_ IDDistrict Desc.Region_ ID

Region_ ID

DollarsUnitsPrice

Store Fact Table

DollarsUnitsPrice

District Fact Table

UnitsPrice

RegionFact Table

Additional features: The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content.

In practice, start with a Star Schema and create the “snowflakes” with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table.

Advantage: Best performance when queries involve aggregation

Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database

STORE KEY

Store Dimension

Store DescriptionCityStateDistrict IDDistrict Desc.Region_ IDRegion Desc.Regional Mgr.

District_ IDDistrict Desc.Region_ ID

Region_ ID

DollarsUnitsPrice

Store Fact Table

DollarsUnitsPrice

District Fact Table

UnitsPrice

RegionFact Table

Advantages of ROLAP Dimensional ModelingAdvantages of ROLAP Dimensional Modeling

Define complex, multi-dimensional data with simple model

Reduces the number of joins a query has to process Allows the data warehouse to evolve with rel. low

maintenance HOWEVER! Star schema and relational DBMS are not

the magic solution Query optimization is still problematic

Aggregates

sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

Aggregates

Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

ans date sum1 812 48

Another Example

Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId date amtp1 1 62p2 1 19p1 2 48

drill-down

rollup

Aggregates

Operators: sum, count, max, min, median, ave

“Having” clause Using dimension hierarchy

average by region (within store) maximum by month (within date)

ROLAP vs. MOLAP

ROLAP:Relational On-Line Analytical Processing

MOLAP:Multi-Dimensional On-Line Analytical Processing

The MOLAP Cube

sale prodId storeId amtp1 s1 12p2 s1 11p1 s3 50p2 s2 8

s1 s2 s3p1 12 50p2 11 8

Fact table view: Multi-dimensional cube:

dimensions = 2

3-D Cube

dimensions = 3

Multi-dimensional cube:Fact table view:

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

ExampleExample

M T W Th F S S

56 units of bread sold in LA on M

Dimensions:Time, Product, Store

Attributes:Product (upc, price, …)Store ……

Hierarchies:Product Brand …Day Week QuarterStore Region Country

roll-up to week

roll-up to brand

roll-up to region

Cube Aggregation: Roll-up

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

s1 s2 s3p1 56 4 50p2 11 8

s1 s2 s3sum 67 12 50

sump1 110p2 19

drill-down

rollup

Example: computing sums

Cube Operators for Roll-up

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

s1 s2 s3p1 56 4 50p2 11 8

s1 s2 s3sum 67 12 50

sump1 110p2 19

sale(s1,*,*)

sale(*,*,*)sale(s2,p2,*)

s1 s2 s3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129

Extended CubeExtended Cube

day 2 s1 s2 s3 *p1 44 4 48p2* 44 4 48s1 s2 s3 *

p1 12 50 62p2 11 8 19* 23 8 50 81

sale(*,p2,*)

Aggregation Using Hierarchies

region A region Bp1 56 54p2 11 8

region

country

(store s1 in Region A;stores s2, s3 in Region B)

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

Slicing

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

s1 s2 s3p1 12 50p2 11 8

TIME = day 1

Productsd1 d2

Store s1 Electronics $5.2Toys $1.9

Clothing $2.3Cosmetics $1.1

Store s2 Electronics $8.9Toys $0.75

Clothing $4.6Cosmetics $1.5

ProductsStore s1 Store s2

Store s1 Electronics $5.2 $8.9Toys $1.9 $0.75

Clothing $2.3 $4.6Cosmetics $1.1 $1.5

Store s2 ElectronicsToys

Clothing

($ millions)d1

Sales($ millions)

Slicing &Pivoting

Summary of OperationsSummary of Operations

Aggregation (roll-up) aggregate (summarize) data to the next higher dimension element e.g., total sales by city, year total sales by region, year

Navigation to detailed data (drill-down) Selection (slice) defines a subcube

e.g., sales where city =‘Gainesville’ and date = ‘1/15/90’ Calculation and ranking

e.g., top 3% of cities by average income Visualization operations (e.g., Pivot) Time functions

e.g., time average

Query & Analysis Tools Query Building Report Writers (comparisons, growth, graphs,…)

Spreadsheet Systems Web Interfaces Data Mining

data models for warehouse session-12/13 data management for decision support

reference data

structure of data

data management

warehouse data

elemental data

multidimensional data

datamodeling technique

business pos

Documents

introduction, or what is data mining? introduction, or what...

the data warehouse environment - building the data warehouse

1 data warehousing and decision support. 2 data warehousing...

decision support system bus 782. decision supports systems...

arhitecturi...

database and data warehouse 2. data warehouse and...

data warehouse process managementquix/papers/is2001.pdfdata...

1 unit – i data warehouse and business analysis what is...

data warehousing and decision support · • the purpose of...

discovering data lineage in data warehouse… · abstract a...

อาจาร%โกเมศ...

data warehouse características de un data warehouse

data warehouse - oicstatcom.orgsome definitions of data...

data warehouse overview - wordpress.com · data warehouse...

syamsul data warehouse bagian i -...

data warehouse why data warehouse and olap? the intelligence...

oracle data warehouse pack - sematec · oracle data...

oracle retail data warehouse user guide release 12.0.0.1...

sap data warehouse strategy & data warehouse roadmap

warehouse mobile decision support white paper · warehouse...