data models for warehouse session-12/13 data management for decision support
Post on 13-Jan-2016
218 Views
Preview:
TRANSCRIPT
Data Models for WarehouseData Models for Warehouse
Session-12/13
Data Management for Decision Support
Data ModelsData Models
Data Models relations stars & snowflakes cubes
Operators slice & dice roll-up, drill down pivoting other
Data ModelsData Models
Star schemas are database schemas that exploit the structure of data for decision support query Queries in DSS tend to
Examine a set of factual transactions- POS, Customer events
Facts are analyzed in variety of ways - POS transaction by week, or store
For example a retail store POS is at the center Product information - SKU, hierarchy of ( section dept, BU) Time information - day, week, month, year Stores - Store-id, hierarchy (regions, city, locality) Suppliers- Sup-id, location, discounts
Data ModelsData Models
Sales Transactions
Products Time
SuppliersStores
Information is split between two classes- Factual information and Reference information
FACT DATAFACT DATA
Fact data records the information on factual event that occurred in the business- POS, Phone calls, Banking transactions
Typically 70% of Warehouse data is Fact data Important to identify and define structure right in
the first place as restructuring is an expensive process
Detail content of FACT is derived from the business requirement
Recorded Facts do not change as they are events of past
Dimension DataDimension Data
Information that is used for analyzing the elemental data, for example, product hierarchy, time periods, customers, stores
It is the reference data used for analysis of Facts
Organizing the information in separate reference tables offers better query performance
It differs from Fact data as it changes over time, due to changes in business, reorganization
It should be structured to permit rapid changes
FACT and Dimensions FACT and Dimensions
Millions to billions of rows
Multiple foreign keys Numeric Does not change
Tens to millions of rows
One primary key Textual decription Frequently modifies
Decision Support QueriesDecision Support Queries
Examples Average number of sales of Haldiram per store
over last month (various types within the brand) Projected sales of Deepavali gift packs against
the actual The top 20% customers (spending) over last
quarter The customers with average balance in excess of
Rs. 25000 for past one year ==> Each of these queries is based on Factual
data
Decision Support QueriesDecision Support Queries
Examples
POS Transaction
Membership card Transaction
Account transactions
Sales of Haldiram
Customer Spend
Account Balance
Quantity SoldProductStore Date, TimeRevenue Realized
Customer-IdStoreTransaction ValueDate and Time
CustomerAC numbertype of transactionamount
Star SchemaStar Schema
The star schema is a data-modeling technique used to map multidimensional decision support into a relational database.
Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database.
Four Components: Facts Dimensions Attributes Attribute hierarchies
A Simple Star Schema
Star SchemaStar Schema
Facts Facts are numeric measurements (values) that
represent a specific business aspect or activity.
The fact table contains facts that are linked through their dimensions.
Facts can be computed or derived at run-time (metrics).
Dimensions Dimensions are qualifying characteristics that provide
additional perspectives to a given fact.
Dimensions are stored in dimension tables.
Identifying Facts and DimensionsIdentifying Facts and Dimensions
Elemental Transaction
Determine Key Dimensions
Check if Fact is a dimension
Check if dimensions is a Fact
Identification: Step 1Identification: Step 1
Examine the enterprise model and identify the transaction that or of interest- driven by business requirement analysis
These will be transaction that describes events fundamental to the business e.g., #calls for Telecom, account transactions in banking
For each potential Fact ask a question- Is this information operated upon by business process? Daily sales versus POS, even if system reports daily sales POS may be the FACT
The limit of current recording should not influence Warehouse design
Identification: Step 1Identification: Step 1
Sector and Business
Retail
SalesShrinkage
Retail Banking Customer profiling ProfitabilityInsurance Product ProfitabilityTelecom Call Analysis Customer Analysis
Fact Table
POS Transaction
Stock movement and position
Customer eventsAccount transactions
Claims and receipts
Call eventsCustomer events(install, disconnect, payment)
Identification: Step 2Identification: Step 2
Look at the logical model to find the entities associated with entities in the fact table. List out all such logically associate entities.
These are candidate References, the task is to find key dimension entities that may not be directly associated.
For example, retail banking account transaction are candidate fact table. The account transaction is candidate reference. But, the customer I indirectly related to transaction. Although, a better choice.
Analyze account transaction by account? Analyze how customers use our services? You store both relationships but customer becomes a
dimension
Identification: Step3Identification: Step3
FACT is not actually a denormalized dimension table Consider the following:
house-details Cable-laid Sales-persons visit connected to the service promotional material sent subscription cancelled …
Home-details - candidate fact Operational events Report on number of connections quarter-to-date Time-lag between laying and subscrition
Identification: Step 4Identification: Step 4
Dimension is not a FACT Lot depends on DSS requirements-
Customer can be FACT or Dimension Promotions can be fact or dimensions
Ask questions using other dimensions- Using how many other dimensions, Can I view this entity.
Can I view promotion by Time? Can I view promotions by product? Can I view promotion by store? Can I vie promotions by suppliers?
If answer to these question is yes, then it is a FACT
Star SchemaStar Schema
Attributes Each dimension table contains attributes. Attributes are
often used to search, filter, or classify facts. Dimensions provide descriptive characteristics about
the facts through their attributes.
Possible Attributes For Sales Dimensions
Three Dimensional View Of Sales
Slice And Dice View Of Sales
Star SchemaStar Schema
Attribute Hierarchies
Attributes within dimensions can be ordered in a well-defined attribute hierarchy.
The attribute hierarchy provides a top-down data organization that is used for two main purposes:
Aggregation
Drill-down/roll-up data analysis
A Location Attribute Hierarchy
Attribute Hierarchies In Multidimensional Analysis
Star SchemaStar Schema
Star Schema Representation
Facts and dimensions are normally represented by physical tables in the data warehouse database.
The fact table is related to each dimension table in a many-to-one (M:1) relationship.
Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.
Star Schema For Sales
Orders Star Schema
The Multi-Dimensional ModelThe Multi-Dimensional Model
“Sales by product line over the past six months”
“Sales by store between 1990 and 1995”
Prod Code Time Code Store Code Sales Qty
Store Info
Product Info
Time Info
. . .
Numerical MeasuresKey columns joining fact table
to dimension tables
Fact table for measures
Dimension tables
Dimensional ModelingDimensional Modeling
Dimensions are organized into hierarchies E.g., Time dimension: days weeks quarters E.g., Product dimension: product product line brand
Dimensions have attributes
Dimension Hierarchies
Store Dimension Product Dimension
District
Region
Total
Brand
Manufacturer
Total
Stores Products
ROLAP: Dimensional Modeling Using Relational DBMS
ROLAP: Dimensional Modeling Using Relational DBMS
Special schema design: star, snowflake Special indexes: bitmap, multi-table join Special tuning: maximize query throughput Proven technology (relational model, DBMS), tend to
outperform specialized MDDB especially on large data sets Products
IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
MOLAP: Dimensional Modeling Using the Multi Dimensional Model
MOLAP: Dimensional Modeling Using the Multi Dimensional Model
MDDB: a special-purpose data model Facts stored in multi-dimensional arrays Dimensions used to index array Sometimes on top of relational DB Products
Pilot, Arbor Essbase, Gentia
Star Schema (in RDBMS)Star Schema (in RDBMS)
Star Schema ExampleStar Schema Example
Star Schema with Sample Data
The “Classic” Star Schema
A single fact table, with detail and summary data
Fact table primary key has only one key column per dimension
Each key is generated Each dimension is a single table,
highly denormalized
Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata
Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem
PERIOD KEY
Store Dimension Time Dimension
Product Dimension
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Period DescYearQuarterMonthDayCurrent FlagResolutionSequence
Fact Table
PRODUCT KEY
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level
Product Desc.BrandColorSizeManufacturerLevel
STORE KEY
The “Classic” Star Schema
The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error.
Example: Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A
where A.STORE_KEY in (select STORE_KEYfrom Store_Dimension Bwhere region = “North” and Level = 2)
and etc...
Level is neededwhenever aggregates are stored with detail facts.
PERIOD KEY
Store Dimension Time Dimension
Product Dimension
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Period DescYearQuarterMonthDayCurrent FlagResolutionSequence
Fact Table
PRODUCT KEY
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level
Product Desc.BrandColorSizeManufacturerLevel
STORE KEY
The “Level” Problem
Level is a problem because because it causes potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answers can occur.
One alternative: the FACT CONSTELLATION model...
The “Fact Constellation” Schema
DollarsUnitsPrice
District Fact Table
District_IDPRODUCT_KEYPERIOD_KEY
DollarsUnitsPrice
Region Fact Table
Region_IDPRODUCT_KEYPERIOD_KEY
PERIOD KEY
Store Dimension Time Dimension
Product Dimension
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Period DescYearQuarterMonthDayCurrent FlagSequence
Fact Table
PRODUCT KEY
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.
Product Desc.BrandColorSizeManufacturer
STORE KEY
The “Fact Constellation” Schema
In the Fact Constellations, aggregate tables are created separately from the detail, therefor it is impossible to pick up, forexample, Store detail when queryingthe District Fact Table.
Major Advantage: No need for the “Level” indicator in the dimension tables, since no aggregated data is stored with lower-level detail
Disadvantage: Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata
DollarsUnitsPrice
District Fact Table
District_IDPRODUCT_KEYPERIOD_KEY
DollarsUnitsPrice
Region Fact Table
Region_IDPRODUCT_KEYPERIOD_KEY
PERIOD KEY
Store Dimension Time Dimension
Product Dimension
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Period DescYearQuarterMonthDayCurrent FlagSequence
Fact Table
PRODUCT KEY
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.
Product Desc.BrandColorSizeManufacturer
STORE KEY
Another Alternative to “Level”
Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay.
An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...
The “Snowflake” Schema
STORE KEY
Store Dimension
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.
District_IDDistrict Desc.Region_ID
Region_ID
Region Desc.Regional Mgr.
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Store Fact Table
DollarsUnitsPrice
District Fact Table
District_IDPRODUCT_KEYPERIOD_KEY Dollars
UnitsPrice
RegionFact Table
Region_IDPRODUCT_KEYPERIOD_KEY
The “Snowflake” Schema
No LEVEL in dimension tables Dimension tables are normalized by
decomposing at the attribute level Each dimension table has one key for each
level of the dimensionís hierarchy The lowest level key joins the dimension table
to both the fact table and the lower level attribute table
How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.
STORE KEY
Store Dimension
Store DescriptionCityStateDistrict IDDistrict Desc.Region_ IDRegion Desc.Regional Mgr.
District_ IDDistrict Desc.Region_ ID
Region_ ID
Region Desc.Regional Mgr.
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Store Fact Table
DollarsUnitsPrice
District Fact Table
District_IDPRODUCT_KEYPERIOD_KEY Dollars
UnitsPrice
RegionFact Table
Region_IDPRODUCT_KEYPERIOD_KEY
The “Snowflake” Schema
Additional features: The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content.
In practice, start with a Star Schema and create the “snowflakes” with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table.
Advantage: Best performance when queries involve aggregation
Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database
STORE KEY
Store Dimension
Store DescriptionCityStateDistrict IDDistrict Desc.Region_ IDRegion Desc.Regional Mgr.
District_ IDDistrict Desc.Region_ ID
Region_ ID
Region Desc.Regional Mgr.
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Store Fact Table
DollarsUnitsPrice
District Fact Table
District_IDPRODUCT_KEYPERIOD_KEY Dollars
UnitsPrice
RegionFact Table
Region_IDPRODUCT_KEYPERIOD_KEY
Advantages of ROLAP Dimensional ModelingAdvantages of ROLAP Dimensional Modeling
Define complex, multi-dimensional data with simple model
Reduces the number of joins a query has to process Allows the data warehouse to evolve with rel. low
maintenance HOWEVER! Star schema and relational DBMS are not
the magic solution Query optimization is still problematic
Aggregates
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
81
Aggregates
Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
ans date sum1 812 48
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Another Example
Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId date amtp1 1 62p2 1 19p1 2 48
drill-down
rollup
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Aggregates
Operators: sum, count, max, min, median, ave
“Having” clause Using dimension hierarchy
average by region (within store) maximum by month (within date)
ROLAP vs. MOLAP
ROLAP:Relational On-Line Analytical Processing
MOLAP:Multi-Dimensional On-Line Analytical Processing
The MOLAP Cube
sale prodId storeId amtp1 s1 12p2 s1 11p1 s3 50p2 s2 8
s1 s2 s3p1 12 50p2 11 8
Fact table view: Multi-dimensional cube:
dimensions = 2
3-D Cube
dimensions = 3
Multi-dimensional cube:Fact table view:
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
ExampleExample
Store
Pro
duct
Time
M T W Th F S S
Juice
Milk
Coke
Cream
Soap
Bread
NYSF
LA
10
34
56
32
12
56
56 units of bread sold in LA on M
Dimensions:Time, Product, Store
Attributes:Product (upc, price, …)Store ……
Hierarchies:Product Brand …Day Week QuarterStore Region Country
roll-up to week
roll-up to brand
roll-up to region
Cube Aggregation: Roll-up
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 56 4 50p2 11 8
s1 s2 s3sum 67 12 50
sump1 110p2 19
129
. . .
drill-down
rollup
Example: computing sums
Cube Operators for Roll-up
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 56 4 50p2 11 8
s1 s2 s3sum 67 12 50
sump1 110p2 19
129
. . .
sale(s1,*,*)
sale(*,*,*)sale(s2,p2,*)
s1 s2 s3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129
Extended CubeExtended Cube
day 2 s1 s2 s3 *p1 44 4 48p2* 44 4 48s1 s2 s3 *
p1 12 50 62p2 11 8 19* 23 8 50 81
day 1
*
sale(*,p2,*)
Aggregation Using Hierarchies
region A region Bp1 56 54p2 11 8
store
region
country
(store s1 in Region A;stores s2, s3 in Region B)
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
Slicing
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 12 50p2 11 8
TIME = day 1
Productsd1 d2
Store s1 Electronics $5.2Toys $1.9
Clothing $2.3Cosmetics $1.1
Store s2 Electronics $8.9Toys $0.75
Clothing $4.6Cosmetics $1.5
ProductsStore s1 Store s2
Store s1 Electronics $5.2 $8.9Toys $1.9 $0.75
Clothing $2.3 $4.6Cosmetics $1.1 $1.5
Store s2 ElectronicsToys
Clothing
($ millions)d1
Sales($ millions)
Time
Sales
Slicing &Pivoting
Summary of OperationsSummary of Operations
Aggregation (roll-up) aggregate (summarize) data to the next higher dimension element e.g., total sales by city, year total sales by region, year
Navigation to detailed data (drill-down) Selection (slice) defines a subcube
e.g., sales where city =‘Gainesville’ and date = ‘1/15/90’ Calculation and ranking
e.g., top 3% of cities by average income Visualization operations (e.g., Pivot) Time functions
e.g., time average
Query & Analysis Tools Query Building Report Writers (comparisons, growth, graphs,…)
Spreadsheet Systems Web Interfaces Data Mining
top related