recall … cs 245: database system principlesi.stanford.edu/~ullman/dbsi/sum01/notes/notes13.pdf ·...
Post on 27-Aug-2018
224 Views
Preview:
TRANSCRIPT
1
CS 245 Notes131
CS 245: Database System Principles
Notes 13: Data Warehousing
Hector Garcia-Molina(Some modifications by Chris Olston)
CS 245 Notes132
Recall …
● Three approaches to information integration:◆ Federated databases did teaser◆ Data warehousing next◆ Mediation
CS 245 Notes133
Warehousing
● Growing industry: $8 billion in 1998● Range from desktop to huge:
◆ Walmart: 900-CPU, 2,700 disk, 23TBTeradata system
● Lots of buzzwords, hype◆ slice & dice, rollup, MOLAP, pivot, ...
CS 245 Notes134
Outline
● What is a data warehouse?● Why a warehouse?● Models & operations
● Implementing a warehouse● Future directions
CS 245 Notes135
What is a Warehouse?
● Collection of diverse data◆ subject oriented◆ aimed at executive, decision maker◆ often a copy of operational data◆ with value-added data (e.g., summaries, history)
◆ integrated◆ time-varying◆ non-volatile
more
CS 245 Notes136
What is a Warehouse?
● Collection of tools◆ gathering data◆ cleansing, integrating, ...◆ querying, reporting, analysis◆ data mining◆ monitoring, administering warehouse
2
CS 245 Notes137
Warehouse Architecture
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
CS 245 Notes138
Motivating Examples
● Forecasting● Comparing performance of units● Monitoring, detecting fraud
● Visualization
CS 245 Notes139
Why a Warehouse?
● Two Approaches:◆ Mediation (Lazy)◆ Warehouse (Eager)
Source Source
?
CS 245 Notes1310
Mediation Approach
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
CS 245 Notes1311
Advantages of Warehousing
● High query performance● Queries not visible outside warehouse● Local processing at sources unaffected
● Can operate when sources unavailable● Easier to query data not stored in a DBMS● Extra information at warehouse
◆ Modify, summarize (store aggregates)◆ Add historical information
CS 245 Notes1312
Advantages of Mediation
● No need to copy data◆ less storage◆ no need to purchase data
● More up-to-date data● Query needs can be unknown● Only query interface needed at sources● May be less draining on sources
3
CS 245 Notes1313
OLTP vs. OLAP
● OLTP: On Line Transaction Processing◆ Describes processing at operational sites
● OLAP: On Line Analytical Processing◆ Describes processing at warehouse
CS 245 Notes1314
OLTP vs. OLAP
● Mostly updates● Many small transactions● Mb-Tb of data● Raw data● Clerical users● Up-to-date data● Consistency,
recoverability critical
● Mostly reads● Queries long, complex● Gb-Tb of data● Summarized,
consolidated data● Decision-makers,
analysts as users
OLTP OLAP
CS 245 Notes1315
Data Marts
● Smaller warehouses● Spans part of organization
◆ e.g., marketing (customers, products, sales)
● Do not require enterprise-wide consensus◆ but long term integration problems?
CS 245 Notes1316
Warehouse Models & Operators
● Data Models◆ relations◆ stars & snowflakes◆ cubes
● Operators◆ slice & dice◆ roll-up, drill down◆ pivoting◆ other
CS 245 Notes1317
Star
customer custId name address city53 joe 10 main sfo81 fred 12 main sfo
111 sally 80 willow la
product prodId name pricep1 bolt 10p2 nut 5
store storeId cityc1 nycc2 sfoc3 la
sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50
CS 245 Notes1318
Star Schema
saleorderId
datecustIdprodIdstoreId
qtyamt
customercustIdname
addresscity
productprodIdnameprice
storestoreId
city
4
CS 245 Notes1319
Terms
● Fact table● Dimension tables● Measures
saleorderId
datecustIdprodIdstoreId
qtyamt
customercustIdname
addresscity
productprodIdnameprice
storestoreId
city
CS 245 Notes1320
Dimension Hierarchies
store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy
city cityId pop regIdsfo 1M northla 5M south
region regId weathernorth fog / heatsouth heat
sType tId size locationt1 small downtownt2 large suburbs
storesType
city region
Í snowflake schemaÍ constellations
CS 245 Notes1321
Cube
sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8
c1 c2 c3p1 12 50p2 11 8
Fact table view: Multi-dimensional cube:
dimensions = 2
CS 245 Notes1322
3-D Cube
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
dimensions = 3
Multi-dimensional cube:Fact table view:
CS 245 Notes1323
Aggregates
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
81
CS 245 Notes1324
Aggregates
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
ans date sum1 812 48
5
CS 245 Notes1325
Another Example
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId date amtp1 1 62p2 1 19p1 2 48
drill-down
rollup
CS 245 Notes1326
Aggregates
● Operators: sum, count, max, min, median, avg
● “Group by” clause● “Having” clause
CS 245 Notes1327
Cube Aggregation
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3sum 67 12 50
sump1 110p2 19
129
. . .
drill-down
rollup
Example: computing sums
CS 245 Notes1328
Cube Operators
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3sum 67 12 50
sump1 110p2 19
129
. . .
sale(c1,*,*)
sale(*,*,*)sale(c2,p2,*)
CS 245 Notes1329
c1 c2 c3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129
Extended Cube
day 2 c1 c2 c3 *p1 44 4 48p2* 44 4 48
c1 c2 c3 *p1 12 50 62p2 11 8 19* 23 8 50 81
day 1
*
sale(*,p2,*)
CS 245 Notes1330
Pivoting
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
Multi-dimensional cube:Fact table view:
c1 c2 c3p1 56 4 50p2 11 8
6
CS 245 Notes1331
Query & Analysis Tools
● Query Building● Report Writers (comparisons, growth, graphs,…)
● Spreadsheet Systems
● Web Interfaces● Data Mining
CS 245 Notes1332
Other Operations
● Time functions◆ e.g., time average
● Computed Attributes◆ e.g., commission = sales * rate
● Text Queries◆ e.g., find documents with words X AND B◆ e.g., rank documents by frequency of
words X, Y, Z
CS 245 Notes1333
Data Mining
● Clustering● Association Rules
● Also:◆ Decision trees, time-series, …
CS 245 Notes1334
Clustering
age
income
education
CS 245 Notes1335
Another Example: Text
● Each document is a vector◆ e.g., <100110...> contains words 1,4,5,...
● Clusters contain “similar” documents● Useful for understanding, searching
documents
internationalnews
sports
business
CS 245 Notes1336
Issues
● Given desired number of clusters?● Finding “best” clusters● Are clusters semantically meaningful?
◆ e.g., “yuppies’’ cluster?
● Using clusters for disk storage
7
CS 245 Notes1337
Association Rule Mining
tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9
transactio
n
id customer
id products
bought
salesrecords:
• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9
market-basketdata
CS 245 Notes1338
Association Rule
● Rule: {p1, p3, p8}● Support: number of baskets where these
products appear
● High-support set: support ≥ threshold s● Problem: find all high support sets
CS 245 Notes1339
Finding High-Support Pairs
● Baskets(basket, item)● SELECT I.item, J.item, COUNT(I.basket)
FROM Baskets I, Baskets JWHERE I.basket = J.basket AND
I.item < J.itemGROUP BY I.item, J.itemHAVING COUNT(I.basket) >= s;
WHY?
CS 245 Notes1340
Example
basket itemt1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...
basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...
check ifcount ≥ s
CS 245 Notes1341
Issues
● Performance for size 2 rulesbasket item
t1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...
basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...
bigevenbigger!
● Performance for size k rules
CS 245 Notes1342
Implementing a Warehouse
● Monitoring: Sending data from sources● Integrating: Loading, cleansing,...● Processing: Query processing, indexing, ...
● Managing: Metadata, Design, ...
8
CS 245 Notes1343
Monitoring
● Source Types: relational, flat file, IMS, WWW, news-wire, …
● Incremental vs. Periodic Refresh
customer id name address city53 joe 10 main sfo81 fred 12 main sfo
111 sally 80 willow la new
CS 245 Notes1344
Monitoring Techniques
● Periodic snapshots● Database triggers● Log shipping
● Data shipping (replication service)● Transaction shipping● Polling (queries to source)● Screen scraping● Application level monitoring
Í
Adv
anta
ges
& D
isad
vant
ages
!!
CS 245 Notes1345
Monitoring Issues
● Frequency◆ periodic: daily, weekly, …◆ triggered: on “big” change, lots of changes, ...
● Data transformation◆ convert data to uniform format◆ remove & add fields (e.g., add date to get history)
● Standards (e.g., ODBC)
● Gateways
CS 245 Notes1346
Integration
● Data Cleaning● Data Loading● Derived Data Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
CS 245 Notes1347
Data Cleaning
● Migration (e.g., yen Õ dollars)● Scrubbing: use domain-specific knowledge (e.g.,
social security numbers)● Fusion (e.g., mail list, customer merging)
● Auditing: discover rules & relationships(like data mining)
billing DB
service DB
customer1(Joe)
customer2(Joe)
merged_customer(Joe)
CS 245 Notes1348
Loading Data
● Incremental vs. periodic refresh● Off-line vs. on-line ● Frequency of loading
◆ At night, 1x a week/month, continuously
● Parallel/Partitioned load
9
CS 245 Notes1349
Derived Data
● Derived Warehouse Data◆ indexes◆ aggregates◆ materialized views (next slide)
● When to update derived data?● Incremental vs. periodic refresh
CS 245 Notes1350
Materialized Views
● Define new warehouse relations using SQL expressions
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
product id name pricep1 bolt 10p2 nut 5
joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4
does not existat any source
CS 245 Notes1351
Processing
● ROLAP servers vs. MOLAP servers● Index Structures● What to Materialize?
● Algorithms Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
CS 245 Notes1352
ROLAP Server
● Relational OLAP Server
relationalDBMS
ROLAPserver
tools
utilities
sale prodId date sump1 1 62p2 1 19p1 2 48
Special indices, tuning;Schema is “denormalized”
CS 245 Notes1353
MOLAP Server
● Multi-Dimensional OLAP Server
multi-dimensional
server
M.D. tools
utilitiescould also
sit onrelational
DBMS
Pro
du
ctCity
Date1 2 3 4
milk
soda
eggs
soap
AB
Sales
CS 245 Notes1354
Index Structures
● Traditional Access Methods◆ B-trees, hash tables, R-trees, grids, …
● Popular in Warehouses◆ inverted lists◆ bit map indexes◆ join indexes◆ text indexes
10
CS 245 Notes1355
Inverted Lists
2023
1819
202122
232526
r4r18r34r35
r5r19r37r40
rId name ager4 joe 20
r18 fred 20r19 sally 21r34 nancy 20r35 tom 20r36 pat 25r5 dave 21
r41 jeff 26
. . .
ageindex
invertedlists
datarecords
CS 245 Notes1356
Using Inverted Lists
● Query: ◆ Get people with age = 20 and name = “fred”
● List for age = 20: r4, r18, r34, r35
● List for name = “fred”: r18, r52
● Answer is intersection: r18
CS 245 Notes1357
Bit Maps
2023
1819
202122
232526
id name age1 joe 202 fred 203 sally 214 nancy 205 tom 206 pat 257 dave 218 jeff 26
. . .
ageindex
bitmaps
datarecords
110110000
0010001011
CS 245 Notes1358
Using Bit Maps
● Query: ◆ Get people with age = 20 and name = “fred”
● List for age = 20: 1101100000
● List for name = “fred”: 0100000001
● Answer is intersection: 010000000000
● Good if domain cardinality small● Bit vectors can be compressed
CS 245 Notes1359
Join
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• “Combine” SALE, PRODUCT relations• In SQL: SELECT * FROM SALE, PRODUCT …
product id name pricep1 bolt 10p2 nut 5
joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4
CS 245 Notes1360
Join Indexes
product id name price jIndexp1 bolt 10 r1,r3,r5,r6p2 nut 5 r2,r4
sale rId prodId storeId date amtr1 p1 c1 1 12r2 p2 c1 1 11r3 p1 c3 1 50r4 p2 c2 1 8r5 p1 c1 2 44r6 p1 c2 2 4
join index
11
CS 245 Notes1361
What to Materialize?
● Store in warehouse results useful for common queries
● Example:day 2 c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
c1p1 110p2 19
129
. . .
total sales
materialize
CS 245 Notes1362
Materialization Factors
● Type/frequency of queries● Query response time● Storage cost
● Update cost
CS 245 Notes1363
Cube Aggregates Lattice
city, product, date
city, product city, date product, date
city product date
all
day 2 c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
129
Need analgorithm todecide whatto materialize
CS 245 Notes1364
Managing
● Metadata● Warehouse Design● Tools
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
CS 245 Notes1365
Metadata
● Administrative◆ definition of sources, tools, ...◆ schemas, dimension hierarchies, …◆ rules for extraction, cleaning, …◆ refresh, purging policies◆ user profiles, access control, ...
CS 245 Notes1366
Metadata
● Business◆ business terms & definition◆ data ownership, charging
● Operational◆ data lineage◆ data currency (e.g., active, archived, purged)◆ use stats, error reports, audit trails
12
CS 245 Notes1367
Design
● What data is needed?● Where does it come from?● How to clean data?
● How to represent in warehouse (schema)?● What to summarize?● What to materialize?● What to index?
CS 245 Notes1368
Tools
● Development◆ design & edit: schemas, views, scripts, rules, queries, reports
● Planning & Analysis◆ what-if scenarios (schema changes, refresh rates), capacity planning
● Warehouse Management◆ performance monitoring, usage patterns, exception reporting
● System & Network Management◆ measure traffic (sources, warehouse, clients)
● Workflow Management◆ “reliable scripts” for cleaning & analyzing data
CS 245 Notes1369
Current State of Industry
● Extraction and integration done off-line◆ Usually in large, time-consuming, batches
● Everything copied at warehouse◆ Not selective about what is stored◆ Query benefit vs storage & update cost
● Query optimization aimed at OLTP◆ High throughput instead of fast response◆ Process whole query before displaying
anythingCS 245 Notes13
70
Future Directions
● Better performance● Larger warehouses● Easier to use
● What are companies & research labs working on?
CS 245 Notes1371
Research (1)
● Incremental Maintenance
● Data Consistency● Data Expiration● Recovery● Data Quality● Error Handling (Back Flush)
CS 245 Notes1372
Research (2)
● Rapid Monitor Construction● Temporal Warehouses● Materialization & Index Selection
● Data Fusion● Data Mining● Integration of Text & Relational Data
13
CS 245 Notes1373
Conclusions
● Massive amounts of data and complexity of queries will push limits of current warehouses
● Need better systems:◆ easier to use◆ provide quality information
top related