![Page 1: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/1.jpg)
An Introduction to Data WarehousingAn Introduction to Data Warehousing
Yannis KotidisAT&T Labs-Research
![Page 2: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/2.jpg)
Yannis Kotidis 2
Roadmap
• What is the data warehouse
• Multi-dimensional data modeling
• Data warehouse design– the star schema, bitmap indexes
• The Data Cube operator– semantics and computation
• Aggregate View Selection
• Other Issues
![Page 3: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/3.jpg)
Yannis Kotidis 3
What is Data Warehouse?
• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon
• 3 Billion market worldwide [1999 figure, olapreport.com]
– Retail industries: user profiling, inventory management– Financial services: credit card analysis, fraud detection– Telecommunications: call analysis, fraud detection
![Page 4: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/4.jpg)
Yannis Kotidis 4
Data Warehouse Initiatives
• Organized around major subjects: customer, product, sales – integrate multiple, heterogeneous data sources
– exclude data that are not useful in the decision support process
• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
• Large time horizon for trend analysis (current and past data)
• Non-Volatile store– physically separate store from the operational environment
![Page 5: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/5.jpg)
Yannis Kotidis 5
Data Warehouse Architecture
• Extract data from operational data sources– clean, transform
• Bulk load/refresh– warehouse is offline
• OLAP-server provides multidimensional view
• Multidimensional-olap (Essbase, oracle express)
• Relational-olap(Redbrick, Informix, Sybase,
SQL server)
![Page 6: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/6.jpg)
Yannis Kotidis 6
Why not Using Existing DB?
• Operational databases are for On Line Transaction Processing– automate day-to-day operations (purchasing, banking etc)
– transactions access (and modify!) a few records at a time
– database design is application oriented
– metric: transactions/sec
• Data Warehouse is for On Line Analytical Processing (OLAP)– complex queries that access millions of records
– long scans would interfere with normal operations
– need historical data for trend analysis
– metric: query response time
![Page 7: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/7.jpg)
Yannis Kotidis 7
Examples of OLAP
• Comparisons (this period v.s. last period)– Show me the sales per store for this year and compare it to that of
the previous year to identify discrepancies
• Ranking and statistical profiles (top N/bottom N)– Show me sales, profit and average call volume per day for my 10
most profitable salespeople
• Custom consolidation (market segments, ad hoc groups)– Show me an abbreviated income statement by quarter for the last
four quarters for my northeast region operations
![Page 8: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/8.jpg)
Yannis Kotidis 8
Multidimensional Modeling
• Example: compute total sales volume per product and store
Product Total Sales 1 2 3 4
1 $454 - - $925
2 $468 $800 - -
3 $296 - $240 - Sto
re
4 $652 - $540 $745
Product
Store
800
![Page 9: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/9.jpg)
Yannis Kotidis 9
Dimensions and Hierarchiespr
oduc
tcit
y
month
category region year
product country quarter
state month week
city day
store
PRODUCT LOCATION TIMENY
DVD
Aug
ust
Sales of DVDs in NY in August
• A cell in the cube may store values (measurements) relative to the combination of the labeled dimensions
DIMENSIONS
![Page 10: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/10.jpg)
Yannis Kotidis 10
Common OLAP Operations
• Roll-up: move up the hierarchy– e.g given total sales per city, we
can roll-up to get sales per state
• Drill-down: move down the hierarchy– more fine-grained aggregation
category region year
product country quarter
state month week
city day
store
PRODUCT LOCATION TIME
![Page 11: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/11.jpg)
Yannis Kotidis 11
Pivoting
• Pivoting: aggregate on selected dimensions– usually 2 dims (cross-tabulation)
Product Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
3 296 - 240 - 536
4 652 - 540 745 1937
Sto
re
ALL 1870 800 780 1670 5120
![Page 12: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/12.jpg)
Yannis Kotidis 13
Roadmap
• What is the data warehouse
• Multi-dimensional data modeling
• Data warehouse design– the star schema, bitmap indexes
• The Data Cube operator– semantics and computation
• Aggregate View Selection
• Other Issues
![Page 13: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/13.jpg)
Yannis Kotidis 15
The Star Schema
time_keydayday_of_the_weekmonthquarteryear
TIME
location_keystorestreet_addresscitystatecountryregion
LOCATION
SALES
product_keyproduct_namecategorybrandcolorsupplier_name
PRODUCT
time_key
product_key
location_key
units_sold
amount{measures
![Page 14: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/14.jpg)
Yannis Kotidis 16
Advantages of Star Schema
• Facts and dimensions are clearly depicted– dimension tables are relatively static, data is loaded (append
mostly) into fact table(s)
– easy to comprehend (and write queries)
“Find total sales per product-category in our stores in Europe”
SELECT PRODUCT.category, SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key
AND SALES.location_key = LOCATION.location_key
AND LOCATION.region=“Europe”
GROUP BY PRODUCT.category
![Page 15: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/15.jpg)
Yannis Kotidis 17
Star Schema Query Processing
time_keydayday_of_the_weekmonthquarteryear
TIME
location_keystorestreet_addresscity statecountryregion
LOCATION
SALES
product_keyproduct_namecategorybrandcolorsupplier_name
PRODUCT
time_key
product_key
location_key
units_sold
amount{measures
JOIN
JOIN
Sregion=“Europe”
Pcategory
![Page 16: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/16.jpg)
Yannis Kotidis 18
Bitmap Index
• Each value in the column has a bit vector: – The i-th bit is set if the i-th row of the base table has the value for the
indexed column
– The length of the bit vector: # of records in the base table
location_key RegionL1 AsiaL2 EuropeL3 AsiaL4 AmericaL5 Europe
Asia Europe America1 0 00 1 01 0 00 0 10 1 0
LOCATION Index on Region
![Page 17: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/17.jpg)
Yannis Kotidis 19
R102
R117
R118
R124
Join-Index
• Join index relates the values of the dimensions of a star schema to rows in the fact table.
– a join index on region maintains for each distinct region a list of ROW-IDs of the tuples recording the sales in the region
• Join indices can span multiple dimensions OR– can be implemented as bitmap-
indexes (per dimension)
– use bit-op for multiple-joins
SALES
region = Africaregion = Americaregion = Asiaregion = Europe
LOCATION
1
1
1
1
![Page 18: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/18.jpg)
Yannis Kotidis 20
Unresolved: Coarse-grain Aggregations
• “Find total sales per product-category in our stores in Europe”
– Join-index will prune ¾ of the data (uniform sales), but the remaining ¼ is still large (several millions transactions)
• Index is unclustered
• High-level aggregations are expensive!!!!! region
country
state
city
store
LOCATON
Long Query Response Times
Pre-computation is necessary
![Page 19: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/19.jpg)
Yannis Kotidis 21
Unresolved: Multiple Simultaneous Aggregates
Product Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
3 296 - 240 - 536
4 652 - 540 745 1937
Sto
re
ALL 1870 800 780 1670 5120
Cross-Tabulation (products/store)
Sub-totals per store
Sub-totals per productTotal sales
4 Group-bys here:(store,product)(store)(product)()
Need to write 4 queries!!!
![Page 20: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/20.jpg)
Yannis Kotidis 22
The Data Cube Operator (Gray et al)
• All previous aggregates in a single query:
SELECT LOCATION.store, SALES.product_key, SUM (amount)
FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.store
Challenge: Optimize Aggregate Computation
![Page 21: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/21.jpg)
Yannis Kotidis 23
Store Product_key sum(amout)1 1 4541 4 9252 1 4682 2 8003 1 2963 3 2404 1 6254 3 2404 4 7451 ALL 13792 ALL 12683 ALL 5364 ALL 1937ALL 1 1870ALL 2 800ALL 3 780ALL 4 1670ALL ALL 5120
Relational View of Data Cube
Product Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
3 296 - 240 - 536
4 652 - 540 745 1937
Sto
re
ALL 1870 800 780 1670 5120
SELECT LOCATION.store, SALES.product_key, SUM (amount)
FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.store
![Page 22: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/22.jpg)
Yannis Kotidis 26
Data Cube Computation
• Model dependencies among the aggregates:
most detailed “view”
can be computed from view (product,store,quarter) by summing-up all quarterly sales
product,store,quarter
product storequarter
none
store,quarterproduct,quarter product, store
![Page 23: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/23.jpg)
Yannis Kotidis 27
Computation Directives
• Hash/sort based methods (Agrawal et. al. VLDB’96)1. Smallest-parent
2. Cache-results
3. Amortize-scans
4. Share-sorts
5. Share-partitions
product,store,quarter
product storequarter
none
store,quarterproduct,quarter product, store
![Page 24: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/24.jpg)
Yannis Kotidis 28
Alternative Array-based Approach
• Model data as a sparse multidimensional array– partition array into chunks (a small sub-cube which fits in memory).
– fast addressing based on (chunk_id, offset)
• Compute aggregates in “multi-way” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.
BWhat is the best traversing order to do multi-way aggregation?
![Page 25: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/25.jpg)
Yannis Kotidis 29
Roadmap
• What is the data warehouse
• Multi-dimensional data modeling
• Data warehouse design– the star schema, bitmap indexes
• The Data Cube operator– semantics and computation
• Aggregate View Selection
• Other Issues
![Page 26: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/26.jpg)
Yannis Kotidis 30
View Selection Problem
• Use some notion of benefit per view
• Limit: disk space
product,store,quarter
product storequarter
none
store,quarterproduct,quarter product, store
Hanarayan et al SIGMOD’96:
Pick views greedily until space is filled
)()(,:
))()((),(uCuCvuu
vSSv
uCuCSvB
![Page 27: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/27.jpg)
Yannis Kotidis 32
Problem Generalization
• Materialize and maintain the right subset of views with respect to the workload and the available resources
• What is the workload?– “Farmers” v.s. “Explorers” [Inmon99]
– Pre-compiled queries (report generating tools, data mining)
– Ad-hoc analysis (unpredictable)
• What are the resources?– Disk space (getting cheaper)
– Update window (getting smaller)
![Page 28: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/28.jpg)
Yannis Kotidis 33
View Selection Problem
• Selection is based on a workload estimate (e.g. logs) and a given constraint (disk space or update window)
• NP-hard, optimal selection can not be computed > 4-5 dimensions– greedy algorithms (e.g. [Harinarayan96]) run at least in polynomial time in
the number of views i.e exponential in the number of dimensions!!!• Optimal selection can not be approximated [Karloff99]
– greedy view selection can behave arbitrary bad
• Alternatives: use query result caching techniques and reuse prior computations (WATCHMAN, DynaMat)
![Page 29: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/29.jpg)
Yannis Kotidis 34
Dynamic View Management (DynaMat)
• Maintain a cache of View Fragments (DynaMat)
• Engage cached data whenever possible for incoming queries
– e.g. infer monthly sales out of pre-computed daily sales
• Exploit dependencies among the views to maintain the best subset of them within the given update window
Aggregate
Locator
Query
Interface
Admission
Control
View Pool
User
DW base tables
![Page 30: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/30.jpg)
Yannis Kotidis 35
The Space & Time Bounds
• Pool utilization increases between updates
• Space bound: new results compete with stored aggregates for the limited space
• Time bound: results are evicted from the pool due to the update-time window
Time bound
Space bound
![Page 31: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/31.jpg)
Yannis Kotidis 36
Many Other Issues
• Fact+Dimension tables in the DW are views of tables stored in the sources; lots of view maintenance problems– correctly reflect asynchronous changes at the sources
– making views self-maintainable
• Approximate v.s exact answers– Interactive queries (on-line aggregation)
• show running-estimates + confidence intervals
– histogram, wavelet, sampling based techniques (e.g. AQUA)
![Page 32: An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649e155503460f94affe78/html5/thumbnails/32.jpg)
Yannis Kotidis 37
The End
• Thank you!