data warehousing and olap toon calders [email protected]
TRANSCRIPT
![Page 2: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/2.jpg)
Motivation
• « Traditional » relational databases are geared towards online transaction processing:• bank terminal
• flight reservations
• student administration
• Decision support systems have different requirements
![Page 3: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/3.jpg)
Transaction Processing
Transaction processing• Operational setting• Up-to-date = critical
• Simple data
• Simple queries; only « touch » a small part of the database
Flight reservations• ticket sales• do not sell a seat
twice• reservation, date,
name• Give flight details of
X, List flights to Y
![Page 4: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/4.jpg)
Transaction Processing
• Database must support• simple data
− tables
• simple queries
− select from where …
• consistency & integrity CRITICAL
• concurrency
• Relational databases, Object-Oriented, Object-Relational
![Page 5: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/5.jpg)
Decision Support
Decision support• Off-line setting• « Historical » data• Summarized data
• Integrate different databases
• Statistical queries
Flight company• Evaluate ROI flights• Flights of last year• # passengers per
carrier for destination X• Passengers, fuel costs,
maintenance info• Average % of seats
sold/month/destination
![Page 6: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/6.jpg)
PART I Concepts
![Page 7: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/7.jpg)
Outline
Online Analytical Processing• Data Warehouses• Conceptual model: Data Cubes• Query languages for supporting OLAP
• SQL extensions
• MDX
• Database Explosion Problem
![Page 8: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/8.jpg)
A decision support DB maintained separately from the operational databases.
Why Separate Data Warehouse?• Different functions
– DBMS— tuned for OLTP – Warehouse—tuned for OLAP
• Different data– Decision support requires historical data
• Integration of data from heterogeneous sources
Data Warehouse
![Page 9: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/9.jpg)
Three-Tier Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
other
sources
Data Storage
OLAP Server
Analysis
Query/Reporting
Data Mining
ROLAPServer
![Page 10: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/10.jpg)
Three-Tier Architecture
• Extract-Transform-Load• Get data from different sources; data integration
• Cleaning the data
• Takes significant part of the effort (up to 80%!)
• Refresh• Keep the data warehouse up to date when source data
changes
![Page 11: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/11.jpg)
Three-Tier Architecture
• Data storage• Optimized for OLAP
• Specialized data structures
• Specialized indexing structures
• Data marts• common term to refer to “less ambitious data
warehouses”
• Task oriented, departmental level
![Page 12: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/12.jpg)
OLAP
• OLAP = OnLine Analytical Processing• Online = no waiting for answers
• OLAP system = system that supports analytical queries that are dimensional in nature.
![Page 13: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/13.jpg)
Outline
Online Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP
• SQL extensions
• MDX
• Database Explosion Problem
![Page 14: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/14.jpg)
Examples of Queries
• Flight company: evaluate ticket sales• give total, average, minimal, maximal amount
• per date: week, month, year
• by destination/source port/country/continent
• by ticket type
• by # of connections
• …
![Page 15: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/15.jpg)
Common Characteristics
• One (or few) special attribute(s): amount measure
• Other attributes: select relevant regions dimensions
• Different levels of generality (month, year) hierarchies
• Measurement data is summarized: sum, min, max, average
aggregations
![Page 16: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/16.jpg)
Supermarket Example
• Evaluate the sales of products• Product cost in $
• Customer: ID, city, state, country,
• Store: chain, size, location,
• Product: brand, type, …
• …
• What are the measure and dimensional attributes, where are the hierarchies?
measure
Dim.
hierarchies
![Page 17: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/17.jpg)
Why Dimensions?
customer
store
product
Cost in $
• Multidimensional view on the data
![Page 18: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/18.jpg)
Cross Tabulation
• Cross-tabulations are highly useful• Sales of clothes JuneAugust ‘06
Blue Red Orange Total
June 51 25 158 234
July 58 20 120 198
August 65 22 51 138
Total 174 67 329 570
Product: color
Date:month,JuneAugust2006
![Page 19: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/19.jpg)
Data Cubes
• Extension of Cross-Tables to multiple dimensions• Conceptual notion
Blue Red Orange Total
June 51 25 158 234
July 58 20 120 198
August 65 22 51 138
Total 174 67 329 570
Dimensions
Data Points/1st level of aggregation
Aggregatedw.r.t. X-dim
Aggregatedw.r.t. Y-dim
Aggregatedw.r.t. X and Y
![Page 20: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/20.jpg)
Data Cubes
Date
Product
Countrysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
![Page 21: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/21.jpg)
Data Cubes
1) #TV’s sold in the 2Qtr?2) Total sales in 3Qtr?3) #products sold in France this year4) #PC’s in Ireland in 2Qtr?
sum 1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
sum
TV
VCRPC
![Page 22: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/22.jpg)
Data Cubes
sum
sum
4
2
3
TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
1) #TV’s sold in the 2Qtr?2) Total sales in 3Qtr?3) #products sold in France this year4) #PC’s in Ireland in 2Qtr?
In the back, at the bottom, second column
![Page 23: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/23.jpg)
Outline
Online Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP
• SQL extensions
• MDX
• Database Explosion Problem
![Page 24: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/24.jpg)
Operations with Data Cubes
• What operations can you think of that an analyst might find useful? (e.g., store)
![Page 25: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/25.jpg)
Operations with Data Cubes
• What operations can you think of that an analyst might find useful? (e.g., store)• only look at stores in the Netherlands
• look at cities instead of individual stores
• look at the cross-table for product-date
• restrict analysis to 2006, product O1
• go back to a finer granularity at the store level
![Page 26: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/26.jpg)
Roll-Up
• Move in one dimension from a lower granularity to a higher one• store city
• cities country
• product product type
• Quarter Semester
![Page 27: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/27.jpg)
Roll-UpProduct
Countrysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
Date
![Page 28: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/28.jpg)
Roll-UpProduct
Countrysum
sum TV
VCRPC
1st semester
Ireland
France
Germany
sum
Date2nd semester
![Page 29: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/29.jpg)
Drill-down
• Inverse operation• Move in one dimension from a higher granularity to a
lower one• city store• country cities• product type product
• Drill-through:• go back to the original, individual data records
![Page 30: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/30.jpg)
Pivoting
• Change the dimensions that are “displayed”; select a cross-tab.• look at the cross-table for product-date
• display cross-table for date-customer
![Page 31: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/31.jpg)
Pivoting
• Change the dimensions that are “displayed”; select a cross-tab.• look at the cross-table for product-date
• display cross-table for date-customer
Sales Date 1st sem 2nd sem Total
Ireland 20 23 43 France 126 138 264
Country Germany 56 48 104
Total 202 209 411
![Page 32: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/32.jpg)
Slice & dice
• Roll-up on multiple dimensions at once
![Page 33: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/33.jpg)
SelectProduct
Countrysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
![Page 34: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/34.jpg)
Select
• Select a part of the cube by restricting one or more dimensions• restrict analysis to Ireland and VCR
sum1Qtr 2Qtr 3Qtr 4Qtr
![Page 35: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/35.jpg)
Outline
Online Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP
• SQL extensions
• MDX
• Database Explosion Problem
![Page 36: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/36.jpg)
Extended Aggregation
• SQL-92 aggregation quite limited• Many useful aggregates are either very hard or
impossible to specify
− Data cube
− Complex aggregates (median, variance)
− binary aggregates (correlation, regression curves)
− ranking queries (“assign each student a rank based on the total marks”)
• SQL:1999 OLAP extensions
![Page 37: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/37.jpg)
Representing the Cube
• Special value « null » is used:
Sales Date 1st sem 2nd sem Total
Ireland 20 23 43 France 126 138 264
Country Germany 56 48 104
Total 202 209 411
![Page 38: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/38.jpg)
Representing the Cube
• Special value « null » is used:
Date Country Sales
1st semester Ireland 20
1st semester France 126
1st semester Germany 56
1st semester null 202
2nd semester Ireland 23
2nd semester France 138
2nd semester Germany 48
2nd semester null 209
null Ireland 43
null France 264
null Germany 104
null null 411
![Page 39: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/39.jpg)
Group by Cube
• group by cube:
select item-name, color, size, sum(number)
from salesgroup by cube(item-name, color, size)
Computes the union of eight different groupings of the sales relation:
{ (item-name, color, size), (item-name, color), (item-name, size),(color, size),(item-name), (color),(size), ( ) }
![Page 40: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/40.jpg)
Group by Cube
• Relational representation of the date-country-sales cube can be computed as follows:
select semester as date, country, sum(sales)from salesgroup by cube(semester,country)
• grouping() and decode() can be applied to replace “null” by other constant:
− decode(grouping(semester), 1, ‘all’, semester)
![Page 41: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/41.jpg)
Group by Rollup
• rollup construct generates union on every prefix of specified list of attributes
•
select item-name, color, size,sum(number)from salesgroup by rollup(item-name, color, size)
Generates union of four groupings:
{ (item-name, color, size), (item-name, color), (item-name), ( ) }
![Page 42: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/42.jpg)
Group by Rollup
• Rollup can be used to generate aggregates at multiple levels.
• E.g., suppose itemcategory(item-name, category) gives category of each item.
select category, item-name, sum(number) from sales, itemcategory where sales.item-name = itemcategory.item-name group by rollup(category, item-name)
gives a hierarchical summary by item-name and by category.
![Page 43: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/43.jpg)
Group by Cube & Rollup
• Multiple rollups and cubes can be used in a single group by clause• Each generates set of group by lists, cross product of
sets gives overall set of group by lists
![Page 44: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/44.jpg)
Example
select item-name, color, size, sum(number)from salesgroup by rollup(item-name), rollup(color, size)
generates the groupings
{item-name, ()} X {(color, size), (color), ()}
=
{ (item-name, color, size), (item-name, color), (item-name),(color, size), (color), ( ) }
![Page 45: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/45.jpg)
MDX
• Multidimensional Expressions (MDX) is a query language for cubes• Supported by many data warehouses• Input and output are cubes
SELECT { [Measures].[Store Sales] } ON COLUMNS, { [Date].[2002], [Date].[2003] } ON ROWS FROM Sales
WHERE ( [Store].[USA].[CA] )
![Page 46: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/46.jpg)
Outline
Online Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP
• SQL extensions
• MDX
• Database Explosion Problem
![Page 47: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/47.jpg)
Implementation
• To make query answering more efficient: consolidate (materialize) all aggregations
• Early implementations used a multidimensional array.• Fast lookup: cell(prod. p, date d, prom. pr):
− look up index of p1, index of d, index of pr:
index = (p x D x PR) + (d x PR) + pr
![Page 48: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/48.jpg)
Implementation
• Multidimensional array• obvious problem: sparse data
can easily be solved, though.
Example:
binary search tree, key on index
hash table.
![Page 49: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/49.jpg)
Implementation
• However: very quickly people were confronted with the Data Explosion Problem
Consolidating the summaries blows the data enormously !
Reasons are often misunderstood and confusing.
![Page 50: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/50.jpg)
Data Explosion Problem
• Why?
Suppose:• n dimensions, every dimension has d values
• dn possible tuples.
• Number of cells in the cube: (d+1)n
• So, this is not the problem
![Page 51: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/51.jpg)
Data Explosion Problem
• Why?
Suppose• n dimensions, every dimension has d values
• every dimension has a hierarchy
• most extreme case: binary tree
2d possibilities/dimension
![Page 52: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/52.jpg)
Data Explosion Problem
• Why?
Suppose• n dimensions, every dimension has d values
• every dimension has a hierarchy
• most extreme case: binary tree
2d possibilities/dimension 2n x dn cells
Only partial explanation (factor 2n comes from an extremely pathological case)
![Page 53: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/53.jpg)
Data Explosion Problem
• Why?• The problem is that most data is not dense, but sparse.
• Hence, not all dn combinations are possible.
Example: 10 dimensions with 10 values• 10 000 000 000 possibilities
Suppose « only » 1 000 000 are present
![Page 54: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/54.jpg)
Data Explosion Problem
Example: 10 dimensions with 10 values• 10 000 000 000 possibilities
Suppose « only » 1 000 000 are present
Every tuple increases count of 210 cells !
With hierarchies: effect even worse!
If every hierarchy has 5 items:
510 = 9 765 625 cells!
![Page 55: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/55.jpg)
Summary Part I
• Datawarehouses supporting OLAP for decision support• Data Cubes as a conceptual model
• Measurement, dimensions, hierarchy, aggregation
• Queries• Roll-up, Drill-down, Slice and dice, pivoting…
• SQL:1999 extensions for supporting OLAP
• Straightforward implementation is problematic
![Page 56: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/56.jpg)
PART II Technology
II.1 View materializationII.2 Data Storage and Indexing
![Page 57: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/57.jpg)
View materialization
Is it possible to get performance gains when only partially materializing the cube?
Which parts should we materialize in order to get the best performance?
V. Harinarayan, A. Rajaraman, and J. D. Ullman: Implementing data cubes efficiently. In: ACM SIGMOD 1996.
![Page 58: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/58.jpg)
Overview
• Problem statement• Formal model of computation
• Partial order on Queries
• Cost model
• Greedy solution• Performance guarantee
• Conclusion
![Page 59: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/59.jpg)
The problem: informally
• Relation Sales• Dimensions: part, supplier, customer
• Measure: sales
• Aggregation: sum
• The data cube contains all aggregations on part, supplier, customer; e.g.:• (p1,s1,<all>,165)
• (p1,s2,<all>,145)
![Page 60: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/60.jpg)
Queries
SELECT part, supplier, sum(sales)FROM SalesGROUP BY part, supplier
SELECT part, sum(sales)FROM SalesGROUP BY part
• We are interested in answering the following type of queries efficiently:
SELECT supplier, sum(sales)FROM SalesGROUP BY supplier
SELECT customer, part, sum(sales)FROM SalesGROUP BY customer, part
( part, supplier ) ( part )
( part, customer )( supplier )
![Page 61: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/61.jpg)
Queries
• The queries are quantified by their grouping attributes
• These queries select disjoint parts of the cube.• The union of all these queries is the complete cube.
![Page 62: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/62.jpg)
Queries
part
customer
supplier
![Page 63: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/63.jpg)
Queries
part
customer
supplier
(part,supplier,customer)
![Page 64: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/64.jpg)
Queries
part
customer
supplier
(supplier,customer)
![Page 65: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/65.jpg)
Queries
part
customer
supplier
(customer)
![Page 66: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/66.jpg)
Materialization
• Materializing everything is impossible• Cost of evaluating following query directly:
• in practice roughly linear in the size of R( worst case: |R| log(|R|) )
• Materializing some queries can help the other queries too
SELECT D1, …, Dk, sum(M)FROM RGROUP BY D1, …, Dk
![Page 67: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/67.jpg)
Materialization
Example:
SELECT customer, part, sum(sales)FROM SalesGROUP BY customer, part
( part, customer )
SELECT part, sum(sales)FROM SalesGROUP BY part
( part )
| Sales |
| Sales |
![Page 68: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/68.jpg)
Materialization
Example:
SELECT customer, part, sum(sales)FROM SalesGROUP BY customer, part
( part, customer )
SELECT part, sum(sales)FROM SalesGROUP BY part
( part )
materialized as PC
| Sales |PC |PC|
| Sales ||PC|
![Page 69: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/69.jpg)
Example
Query Answer• (part,supplier,customer) 6M • (part,customer) 6M • (part,supplier) 0.8M• (supplier,customer) 6M• (part) 0.2M• (supplier) 0.01M• (customer) 0.1M• () 1
![Page 70: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/70.jpg)
Example: nothing materialized
Query Answer Cost• (part,supplier,customer) 6M 6M• (part,customer) 6M 6M• (part,supplier) 0.8M 6M• (supplier,customer) 6M 6M• (part) 0.2M 6M• (supplier) 0.01M 6M• (customer) 0.1M 6M• () 1 6M
Total cost: 48M
![Page 71: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/71.jpg)
Example: some materialized
Query Answer Cost• (part,supplier,customer) 6M• (part,customer) 6M• (part,supplier) 0.8M• (supplier,customer) 6M • (part) 0.2M• (supplier) 0.01M• (customer) 0.1M• () 1
![Page 72: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/72.jpg)
Example: some materialized
Query Answer Cost• (part,supplier,customer) 6M 6M• (part,customer) 6M 6M• (part,supplier) 0.8M 0.8M• (supplier,customer) 6M 6M• (part) 0.2M 0.8M• (supplier) 0.01M 0.8M• (customer) 0.1M 0.1M• () 1 0.1M
Total cost: 20.6M
![Page 73: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/73.jpg)
Research Questions
• What is the optimal set of views to materialize?
• How many views must we materialize to get reasonable performance?
• Given bounded space S, what is the optimal choice of views to materialize?
![Page 74: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/74.jpg)
Central Question
• Given: • for every view its size
• a number k
• Find:• which k materialized views give the most gain
This problem is NP-complete• greedy algorithm
![Page 75: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/75.jpg)
Overview
• Problem statement• Formal model of computation
• Partial order on Queries
• Cost model
• Greedy solution• Performance guarantee
• Conclusion
![Page 76: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/76.jpg)
Partial order on queries
• Q1 Q2 : • Q1 can be answered using the query results of Q2
• A materialized view of Q2 can be used to answer Q1
is a partial order on the views
![Page 77: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/77.jpg)
Partial order on queries
(part, supplier, customer)
(part, supplier) (part, customer) (supplier, customer)
(part) (supplier) (customer)
()
![Page 78: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/78.jpg)
Partial order on queries
• Can be generalized to hierarchies:(product, country, -) (product, city,-)
(-, country, -) (product, city, -)
(-, -, -) (product, city, -)
(-, country, year) (-, city, month)
(a1, …, an) (b1, …, bn) if for all i = 1…n, ai higher than or equal to bi in the hierarchy of the ith dimension (ai more general than bi)
![Page 79: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/79.jpg)
Cost Model
• Given a set of materialized views S = {V1,…,Vn}, the cost of query Q is
costS(Q) := min( { |Vi| : i = 1…n, Q Vi } )
• The benefit of view W for computing Q w.r.t. S
BW (Q | S) = costS{W}(Q) - costS(Q)
• Total benefit of W w.r.t. S
BW(S) = Q BW (Q | S)
![Page 80: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/80.jpg)
Cost Model: Example
Q CS CSU{W} BW
psc 6M 6M 0
Ps 6M 0.8M 5.2M
Pc 6M 6M 0
p 6M 0.8M 5.2M
s 6M 0.8M 5.2M
c 0.1 0.1 0
() 0.1M 0.1M 0
psc (6M)
ps (0.8M) pc (6M) sc (6M)
p (0.2M) s (0.01M) c (0.1M)
(1)
![Page 81: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/81.jpg)
Overview
• Problem statement• Formal model of computation
• Partial order on Queries
• Cost model
• Greedy solution• Performance guarantee
• Conclusion
![Page 82: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/82.jpg)
Greedy algorithm
• Input:• Size for every view and constant k
• In each step• Select the view with the most benefit
• Add it to the result
• Until we have k views
![Page 83: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/83.jpg)
Greedy Algorithm
AlgorithmS := { top view };
for i:=1 to k {
select view W such that BW(S) is maximal
S := S {W}
}
return S;
![Page 84: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/84.jpg)
Example
a
b c
d e f
g h
100
50 75
20 40
30
1 10
• E.g. The cost of constructing view b given the view A is 100
• If we choose b to materialize, the new cost of constructing view b is 50.
![Page 85: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/85.jpg)
First round
a
b c
d e f
g h
100
50 75
20 40
30
1 10
• Notice that not only b, but also d, e, g and h can be calculated from b
• So the total benefit is (100 – 50) x 5 = 250
![Page 86: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/86.jpg)
Continue…
• Similarly, the benefit of materializing c is (100 – 75) x 5 = 125a
b c
d e f
g h
100
50 75
20 40
30
1 10
Benefit
b 250
c 125
![Page 87: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/87.jpg)
Not yet finish…
• For e, Benefit =
(100-30) x 3 = 210a
b c
d e f
g h
100
50 75
20 40
30
1 10
Benefit
b 250
c 125
e 210
![Page 88: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/88.jpg)
Let’s choose b!
a
b c
d e f
g h
100
50 75
20 40
30
1 10
• For d and f ,
Benefit =
(100-20) x 2= 160 and
(100-40) x 2 =120Benefit
b 250
c 125
d 160
e 210
f 120
![Page 89: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/89.jpg)
Next round?
• Seems we should choose e, as it has the second largest benefit.
• Let’s see what will happen in the second round.Benefit
b 250
c 125
d 160
e 210
f 120
![Page 90: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/90.jpg)
Second round!
a
b c
d e f
g h
100
50 75
20 40
30
1 10
• Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b)
• Benefit
= (100 – 75) x 2 = 50
Benefit
c 50
![Page 91: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/91.jpg)
How about choosing f?
a
b c
d e f
g h
100
50 75
40
30
1 10
• If we choose f, we found that h can be effectively calculated by using f instead of b.
• Benefit
= (100 – 40) + (50 – 40)
Benefit
c 50
f 7020
![Page 92: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/92.jpg)
Easy to work out others
• Benefit of d
= (50 – 20) x 2 = 60• Benefit of e
= (50 – 30) x 3 = 60• Benefit of g
= 50 – 1 = 49• Benefit of h
= 50 – 10 = 40
a
b c
d e f
g h
100
50 75
20 40
30
1 10
![Page 93: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/93.jpg)
Observation
• In the first round, the benefit of choosing f (only 120) is far from the best choice (250)
• But in second round, choosing f gives the maximum benefit!
1st round Benefit
b 250
c 125
d 160
e 210
f 120
2nd round Benefit
c 50
d 60
e 70
f 70
g 49
![Page 94: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/94.jpg)
Overview
• Problem statement• Formal model of computation
• Partial order on Queries
• Cost model
• Greedy solution• Performance guarantee
• Conclusion
![Page 95: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/95.jpg)
Simple? Optimal?
• Trade off! This simple algorithm is not optimal in all cases!
• Consider the following case…
![Page 96: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/96.jpg)
Bad example
a
b dc
200
100 100
20 nodes
Total 1000
99
![Page 97: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/97.jpg)
Bad example
a
b dc
200
100 100
20 nodes
Total 1000
99
• Choose c• Benefit
= (200-99) x (1 + 20 + 20)= 4141= maximum
![Page 98: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/98.jpg)
Bad example
a
b dc
200
100 100
20 nodes
Total 1000
99
• Now choose either 1 of b and d (same benefit)
![Page 99: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/99.jpg)
Bad example
a
b dc
200
100 100
20 nodes
Total 1000
99
• How about these?• Very expensive!!!
![Page 100: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/100.jpg)
Optimal solution should be…
a
b dc
200
100 100
20 nodes
Total 1000
99
• Only c is a little bit expensive.
![Page 101: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/101.jpg)
Some theoretical results
• It can be proved that we can get at least (e – 1 ) / e (which is about 63%) of the benefit of the optimal algorithm.
• When selecting k views, bound is:
Bgreedy / Bopt
There are lattices for which this ratio is arbitrarily close to this bound.
k
k
k
1
1
![Page 102: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/102.jpg)
Extensions (1)
• Problem• The views in a lattice are unlikely to have the same
probability of being requested in a query.
• Solution:• We can weight each benefit by its probability.
![Page 103: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/103.jpg)
Extensions (2)
• Problem• Instead of asking for some fixed number (k) of views to
materialize, we might instead allocate a fixed amount of space to views.
• Solution• We can consider the “benefit of each view per unit
space”.
![Page 104: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/104.jpg)
Conclusions Cube Materialization
• Materialization of views is an essential query optimization strategy for decision-support applications.
• Reason to materialize some part of the data cube but not all of the cube.
• A lattice framework that models multidimensional analysis very well.
![Page 105: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/105.jpg)
Conclusions Cube Materialization
• Finding optimal solution is NP-hard.• Introduction of greedy algorithm• Greedy algorithm works on this lattice and picks the
almost right views to materialize.• There exists cases which greedy algorithm fails to
produce optimal solution.• But greedy algorithm has guaranteed performance• Expansion of greedy algorithm.
![Page 106: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/106.jpg)
II.2 Data storage and indexing
• How is the data stored?• relational database (ROLAP)
• Specialized structures (MOLAP)
• How can we speed up computation?• Indexing structures
− bitmap index
− join index
![Page 107: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/107.jpg)
Implementation
Nowadays systems can be divided in three categories:• ROLAP (Relational OLAP)
− OLAP supported on top of a relational database
• MOLAP (Multi-Dimensional OLAP)
− Use of special multi-dimensional data structures
• HOLAP: (Hybrid)
− combination of previous two
![Page 108: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/108.jpg)
ROLAP
• Typical database scheme:• star schema
− fact table is central
− links to dimensional tables
• Extensions:
− snowflake schema
− dimensions have hierarchy/extra information attached
− Star constellation
− multiple star schemas sharing dimensions
![Page 109: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/109.jpg)
Example of a Star Schema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer Customer NameName
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNaSalespersonNameme
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryDescriptionCategoryDescription
UnitPriceUnitPrice
DateKeyDateKey
DateDate
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersoSalespersonn
CityCity
DateDate
ProductProduct
Fact TableFact Table
![Page 110: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/110.jpg)
Example of a Snowflake Schema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer Customer NameName
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNaSalespersonNameme
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryNameCategoryName
UnitPriceUnitPrice
DateKeyDateKey
DateDate
MonthMonth
CityNameCityName
StateNameStateName
OrderOrder
CustomerCustomer
SalespersoSalespersonn
CityCity
DateDate
ProductProduct
Fact TableFact Table
CategoryNaCategoryNameme
CategoryDeCategoryDescrscr
MontMonthh
YearYear
StateNameStateName
CountryCountry
CategoryCategory
StateState
MonthMonth
![Page 111: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/111.jpg)
branch_keybranch_namebranch_type
time_keydayday_of_the_weekmonthquarteryear
Measures
Branch
Time
item_keyitem_namebrandtypesupplier_key
Item
location_keystreetcityProvince/streetcountry
Location
Sales Fact Table
Avg_sales
Euros_sold
Unit_sold
Location_key
Branch_key
Item_key
Time_key
shipper_keyshipper_namelocation_keyshipper_type
shipper
unit_shipped
Euros_sold
to_location
from_location
shipper_key
Item_key
Time_key
Shipping Fact TableMultiple fact tables share dimension tables
Example of Fact Constellation
![Page 112: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/112.jpg)
This Lecture
• How is the data stored?• Relational database (ROLAP)
• Specialized structures (MOLAP)
• How can we speed up computation?• Indexing structures
− bitmap index
− join index
![Page 113: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/113.jpg)
MOLAP
• Not on top of relational database• most popular design
• specialized data structures
− Multicubes vs Hypercubes
• Not all subcubes are materialized
![Page 114: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/114.jpg)
Storing the cube
• User identifies set of sparse attributes S, and a set of dense attributes D.
• Index tree is constructed on sparse dimensions.• Each leaf points to a multidimensional array indexed
by D.
![Page 115: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/115.jpg)
Example
• product, store are sparse dimensions• date and customer-type are dense
…
…
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
prod. pstore s1
prod. pstore s2
prod. p
![Page 116: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/116.jpg)
Example
• product, store are sparse dimensions• date and customer-type are dense
…
…
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
prod. pstore s1
prod. pstore s2
prod. p
E.g., B-tree, R-tree, …
![Page 117: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/117.jpg)
Example
• product, store are sparse dimensions• date and customer-type are dense
…
…
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
prod. pstore s1
prod. pstore s2
prod. p
E.g., B-tree, R-tree, …
2D arrayDirect access
![Page 118: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/118.jpg)
Example
• product, store are sparse dimensions• date and customer-type are dense
…
…
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
prod. pstore s1
prod. pstore s2
prod. p
E.g., B-tree, R-tree, …
2D arrayDirect access
Linked list
![Page 119: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/119.jpg)
Queries
• Efficiency depends on:• does index on sparse dimensions fit into memory?
• Type of queries:
− Restrictions on all dimensions
− Restrictions only on dense
− Restrictions only on some sparse and dense
![Page 120: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/120.jpg)
Queries
• Selection on all attributes: (p,s1,ret,all)
…
…
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
prod. pstore s1
prod. pstore s2
prod. p
![Page 121: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/121.jpg)
Queries
• Only on dense attributes: (-,-,ret,”2/1/07”)
…
…
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
prod. pstore s1
prod. pstore s2
prod. p
![Page 122: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/122.jpg)
Queries
• Only some sparse and dense attributes: (-,s1,ret,”2/1/07”)
…
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
prod. pstore s1
prod. pstore s2
prod. p
prod. p2store s1
![Page 123: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/123.jpg)
Queries
• Only some sparse and dense attributes: (p,-,-,”2/1/07”)
…
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
1time ret reg Total
1/1/07 51 25 158 234
2/1/07 58 20 120 198
… 65 22 51 138
Total 174 67 329 570
prod. pstore s1
prod. pstore s2
prod. p
prod. pstore s1
![Page 124: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/124.jpg)
Storing the Cube
• Dense combinations of dimensions can be stored in multi-dimensional arrays
• For every combination of sparse dimensions• one sub-cube
• Sub-cubes indexed by sparse dimensions• E.g., B-tree
• Order of the dimensions plays a role
![Page 125: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/125.jpg)
This Lecture
• How is the data stored?• relational database (ROLAP)
• Multi-dimensional structure (MOLAP)
• How can we speed up computation?• Indexing structures
− bitmap index
− join index
![Page 126: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/126.jpg)
Specialized Indexing Structures
• B-trees, (covered in other courses)• Bitmapped indices,• Join indices,• Spatial data structures
![Page 127: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/127.jpg)
• Indexing principle: • mapping key values to records for associative direct
access
Most popular indexing techniques in relational database: B+-trees
For multi-dimensional data, a large number of indexing techniques have been developed: R-trees
Index Structures
![Page 128: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/128.jpg)
• Bitmap index: indexing technique that has attracted attention in multi-dimensional DB implementation
table Customer City Carc1 Detroit Fordc2 Chicago Hondac3 Detroit Hondac4 Poznan Fordc5 Paris BMWc6 Paris Nissan
Bitmap Indexes
![Page 129: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/129.jpg)
• The index consists of bitmaps:
bitmaps
ec1 Chicago Detroit Paris Poznan1 0 1 0 02 1 0 0 03 0 1 0 04 0 0 0 15 0 0 1 06 0 0 1 0
ec1 BMW Ford Honda Nissan1 0 1 0 02 1 0 1 03 0 0 1 04 0 1 0 05 1 0 0 06 0 0 0 1
bitmaps•Index on a particular column•Index consists of a number of bit vectors - bitmaps•Each value in the indexed column has a bit vector (bitmaps)•The length of the bit vector is the number of records in the base table•The i-th bit is set if the i-th row of the base table has the value for the indexed column
Bitmap Indexes
![Page 130: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/130.jpg)
Index on a particular column
Index consists of a number of bit vectors - bitmaps
Each value in the indexed column has a bit vector (bitmaps)
The length of the bit vector is the number of records in the base table
The i-th bit is set if the i-th row of the base table has the value for the indexed column
Bitmap Indexes
![Page 131: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/131.jpg)
2023
1819
202122
232526
id name age1 joe 202 fred 203 sally 214 nancy 205 tom 206 pat 257 dave 218 jeff 26
. .
.
ageindex
bitmaps
datarecords
110110000
0010001011
Query: Get people with age = 20 and name = “fred”
List for age = 20: 1101100000List for name = “fred”: 0100000001Answer is intersection: 0100000000
Suited well for domains with small cardinality
Bitmap Indexes
![Page 132: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/132.jpg)
Bitmap Index
• Size of bitmaps can be further reduced• use run-length encoding
1111000111100000001111000 is encoded as
4x1;3x0;4x1;7x0;4x1;3x0
• Can reduce the storage space significantly
• Logical operations can work directly on the encoding
![Page 133: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/133.jpg)
With efficient hardware support for bitmap operations (AND, OR, XOR, NOT), bitmap index offers better access methods for certain queries
e.g., selection on two attributes
Some commercial products have implemented bitmap index
Works poorly for high cardinality domains since the number of bitmaps increases
Difficult to maintain - need reorganization when relation sizes change (new bitmaps)
Bitmap Index – Summary
![Page 134: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/134.jpg)
This Lecture
• How is the data stored?• relational database (ROLAP)
• Specialized structures (MOLAP)
• How can we speed up computation?• Indexing structures
− bitmap index
− join index
![Page 135: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/135.jpg)
Traditional indexes: value rids. Join indices: tuples in the join to rids in the source tables.
Data warehouse:
values of dimensions of star schema rows in fact table.
Join indexes can span multiple dimensions
Join Indexes
![Page 136: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/136.jpg)
• “Combine” SALE, PRODUCT relations• In SQL: SELECT * FROM SALE, PRODUCT
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
product id name pricep1 bolt 10p2 nut 5
joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4
Join
![Page 137: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/137.jpg)
product id name price jIndexp1 bolt 10 r1,r3,r5,r6p2 nut 5 r2,r4
sale rId prodId storeId date amtr1 p1 c1 1 12r2 p2 c1 1 11r3 p1 c3 1 50r4 p2 c2 1 8r5 p1 c1 2 44r6 p1 c2 2 4
join index
Join Indexes
![Page 138: Data Warehousing and OLAP Toon Calders t.calders@tue.nl](https://reader033.vdocuments.mx/reader033/viewer/2022061517/56649e4a5503460f94b3da64/html5/thumbnails/138.jpg)
Summary
• Data warehouse is a specialized database to support analytical queries = OLAP queries
• Data cube as conceptual model
• Implementation of Data Cube• View selection problem• Explosion problem• ROLAP vs. MOLAP• Indexing structures