recall … cs 245: database system principlesi.stanford.edu/~ullman/dbsi/sum01/notes/notes13.pdf ·...

CS 245 Notes131

CS 245: Database System Principles

Notes 13: Data Warehousing

Hector Garcia-Molina(Some modifications by Chris Olston)

CS 245 Notes132

Recall …

● Three approaches to information integration:◆ Federated databases did teaser◆ Data warehousing next◆ Mediation

CS 245 Notes133

Warehousing

● Growing industry: $8 billion in 1998● Range from desktop to huge:

◆ Walmart: 900-CPU, 2,700 disk, 23TBTeradata system

● Lots of buzzwords, hype◆ slice & dice, rollup, MOLAP, pivot, ...

CS 245 Notes134

Outline

● What is a data warehouse?● Why a warehouse?● Models & operations

● Implementing a warehouse● Future directions

CS 245 Notes135

What is a Warehouse?

● Collection of diverse data◆ subject oriented◆ aimed at executive, decision maker◆ often a copy of operational data◆ with value-added data (e.g., summaries, history)

◆ integrated◆ time-varying◆ non-volatile

CS 245 Notes136

What is a Warehouse?

● Collection of tools◆ gathering data◆ cleansing, integrating, ...◆ querying, reporting, analysis◆ data mining◆ monitoring, administering warehouse

CS 245 Notes137

Warehouse Architecture

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

CS 245 Notes138

Motivating Examples

● Forecasting● Comparing performance of units● Monitoring, detecting fraud

● Visualization

CS 245 Notes139

Why a Warehouse?

● Two Approaches:◆ Mediation (Lazy)◆ Warehouse (Eager)

Source Source

CS 245 Notes1310

Mediation Approach

Client Client

Wrapper Wrapper Wrapper

Mediator

CS 245 Notes1311

Advantages of Warehousing

● High query performance● Queries not visible outside warehouse● Local processing at sources unaffected

● Can operate when sources unavailable● Easier to query data not stored in a DBMS● Extra information at warehouse

◆ Modify, summarize (store aggregates)◆ Add historical information

CS 245 Notes1312

Advantages of Mediation

● No need to copy data◆ less storage◆ no need to purchase data

● More up-to-date data● Query needs can be unknown● Only query interface needed at sources● May be less draining on sources

CS 245 Notes1313

OLTP vs. OLAP

● OLTP: On Line Transaction Processing◆ Describes processing at operational sites

● OLAP: On Line Analytical Processing◆ Describes processing at warehouse

CS 245 Notes1314

OLTP vs. OLAP

● Mostly updates● Many small transactions● Mb-Tb of data● Raw data● Clerical users● Up-to-date data● Consistency,

recoverability critical

● Mostly reads● Queries long, complex● Gb-Tb of data● Summarized,

consolidated data● Decision-makers,

analysts as users

OLTP OLAP

CS 245 Notes1315

Data Marts

● Smaller warehouses● Spans part of organization

◆ e.g., marketing (customers, products, sales)

● Do not require enterprise-wide consensus◆ but long term integration problems?

CS 245 Notes1316

Warehouse Models & Operators

● Data Models◆ relations◆ stars & snowflakes◆ cubes

● Operators◆ slice & dice◆ roll-up, drill down◆ pivoting◆ other

CS 245 Notes1317

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

CS 245 Notes1318

Star Schema

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

CS 245 Notes1319

● Fact table● Dimension tables● Measures

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

CS 245 Notes1320

Dimension Hierarchies

store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy

city cityId pop regIdsfo 1M northla 5M south

region regId weathernorth fog / heatsouth heat

sType tId size locationt1 small downtownt2 large suburbs

storesType

city region

Í snowflake schemaÍ constellations

CS 245 Notes1321

sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

c1 c2 c3p1 12 50p2 11 8

Fact table view: Multi-dimensional cube:

dimensions = 2

CS 245 Notes1322

3-D Cube

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

dimensions = 3

Multi-dimensional cube:Fact table view:

CS 245 Notes1323

Aggregates

• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE

WHERE date = 1

CS 245 Notes1324

Aggregates

• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE

GROUP BY date

ans date sum1 812 48

CS 245 Notes1325

Another Example

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE

GROUP BY date, prodId

sale prodId date amtp1 1 62p2 1 19p1 2 48

drill-down

rollup

CS 245 Notes1326

Aggregates

● Operators: sum, count, max, min, median, avg

● “Group by” clause● “Having” clause

CS 245 Notes1327

Cube Aggregation

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3sum 67 12 50

sump1 110p2 19

drill-down

rollup

Example: computing sums

CS 245 Notes1328

Cube Operators

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3sum 67 12 50

sump1 110p2 19

sale(c1,*,*)

sale(*,*,*)sale(c2,p2,*)

CS 245 Notes1329

c1 c2 c3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129

Extended Cube

day 2 c1 c2 c3 *p1 44 4 48p2* 44 4 48

c1 c2 c3 *p1 12 50 62p2 11 8 19* 23 8 50 81

sale(*,p2,*)

CS 245 Notes1330

Pivoting

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

Multi-dimensional cube:Fact table view:

c1 c2 c3p1 56 4 50p2 11 8

CS 245 Notes1331

Query & Analysis Tools

● Query Building● Report Writers (comparisons, growth, graphs,…)

● Spreadsheet Systems

● Web Interfaces● Data Mining

CS 245 Notes1332

Other Operations

● Time functions◆ e.g., time average

● Computed Attributes◆ e.g., commission = sales * rate

● Text Queries◆ e.g., find documents with words X AND B◆ e.g., rank documents by frequency of

words X, Y, Z

CS 245 Notes1333

Data Mining

● Clustering● Association Rules

● Also:◆ Decision trees, time-series, …

CS 245 Notes1334

Clustering

income

education

CS 245 Notes1335

Another Example: Text

● Each document is a vector◆ e.g., <100110...> contains words 1,4,5,...

● Clusters contain “similar” documents● Useful for understanding, searching

documents

internationalnews

sports

business

CS 245 Notes1336

Issues

● Given desired number of clusters?● Finding “best” clusters● Are clusters semantically meaningful?

◆ e.g., “yuppies’’ cluster?

● Using clusters for disk storage

CS 245 Notes1337

Association Rule Mining

tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9

transactio

id customer

id products

bought

salesrecords:

• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9

market-basketdata

CS 245 Notes1338

Association Rule

● Rule: {p1, p3, p8}● Support: number of baskets where these

products appear

● High-support set: support ≥ threshold s● Problem: find all high support sets

CS 245 Notes1339

Finding High-Support Pairs

● Baskets(basket, item)● SELECT I.item, J.item, COUNT(I.basket)

FROM Baskets I, Baskets JWHERE I.basket = J.basket AND

I.item < J.itemGROUP BY I.item, J.itemHAVING COUNT(I.basket) >= s;

CS 245 Notes1340

Example

basket itemt1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...

basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...

check ifcount ≥ s

CS 245 Notes1341

Issues

● Performance for size 2 rulesbasket item

t1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...

basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...

bigevenbigger!

● Performance for size k rules

CS 245 Notes1342

Implementing a Warehouse

● Monitoring: Sending data from sources● Integrating: Loading, cleansing,...● Processing: Query processing, indexing, ...

● Managing: Metadata, Design, ...

CS 245 Notes1343

Monitoring

● Source Types: relational, flat file, IMS, WWW, news-wire, …

● Incremental vs. Periodic Refresh

customer id name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la new

CS 245 Notes1344

Monitoring Techniques

● Periodic snapshots● Database triggers● Log shipping

● Data shipping (replication service)● Transaction shipping● Polling (queries to source)● Screen scraping● Application level monitoring

CS 245 Notes1345

Monitoring Issues

● Frequency◆ periodic: daily, weekly, …◆ triggered: on “big” change, lots of changes, ...

● Data transformation◆ convert data to uniform format◆ remove & add fields (e.g., add date to get history)

● Standards (e.g., ODBC)

● Gateways

CS 245 Notes1346

Integration

● Data Cleaning● Data Loading● Derived Data Client Client

Warehouse

Query & Analysis

Integration

Metadata

CS 245 Notes1347

Data Cleaning

● Migration (e.g., yen Õ dollars)● Scrubbing: use domain-specific knowledge (e.g.,

social security numbers)● Fusion (e.g., mail list, customer merging)

● Auditing: discover rules & relationships(like data mining)

billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

CS 245 Notes1348

Loading Data

● Incremental vs. periodic refresh● Off-line vs. on-line ● Frequency of loading

◆ At night, 1x a week/month, continuously

● Parallel/Partitioned load

CS 245 Notes1349

Derived Data

● Derived Warehouse Data◆ indexes◆ aggregates◆ materialized views (next slide)

● When to update derived data?● Incremental vs. periodic refresh

CS 245 Notes1350

Materialized Views

● Define new warehouse relations using SQL expressions

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

does not existat any source

CS 245 Notes1351

Processing

● ROLAP servers vs. MOLAP servers● Index Structures● What to Materialize?

● Algorithms Client Client

Warehouse

Query & Analysis

Integration

Metadata

CS 245 Notes1352

ROLAP Server

● Relational OLAP Server

relationalDBMS

ROLAPserver

utilities

sale prodId date sump1 1 62p2 1 19p1 2 48

Special indices, tuning;Schema is “denormalized”

CS 245 Notes1353

MOLAP Server

● Multi-Dimensional OLAP Server

multi-dimensional

server

M.D. tools

utilitiescould also

sit onrelational

ctCity

Date1 2 3 4

CS 245 Notes1354

Index Structures

● Traditional Access Methods◆ B-trees, hash tables, R-trees, grids, …

● Popular in Warehouses◆ inverted lists◆ bit map indexes◆ join indexes◆ text indexes

CS 245 Notes1355

Inverted Lists

202122

232526

r4r18r34r35

r5r19r37r40

rId name ager4 joe 20

r18 fred 20r19 sally 21r34 nancy 20r35 tom 20r36 pat 25r5 dave 21

r41 jeff 26

ageindex

invertedlists

datarecords

CS 245 Notes1356

Using Inverted Lists

● Query: ◆ Get people with age = 20 and name = “fred”

● List for age = 20: r4, r18, r34, r35

● List for name = “fred”: r18, r52

● Answer is intersection: r18

CS 245 Notes1357

Bit Maps

202122

232526

id name age1 joe 202 fred 203 sally 214 nancy 205 tom 206 pat 257 dave 218 jeff 26

ageindex

bitmaps

datarecords

110110000

0010001011

CS 245 Notes1358

Using Bit Maps

● Query: ◆ Get people with age = 20 and name = “fred”

● List for age = 20: 1101100000

● List for name = “fred”: 0100000001

● Answer is intersection: 010000000000

● Good if domain cardinality small● Bit vectors can be compressed

CS 245 Notes1359

• “Combine” SALE, PRODUCT relations• In SQL: SELECT * FROM SALE, PRODUCT …

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

CS 245 Notes1360

Join Indexes

product id name price jIndexp1 bolt 10 r1,r3,r5,r6p2 nut 5 r2,r4

sale rId prodId storeId date amtr1 p1 c1 1 12r2 p2 c1 1 11r3 p1 c3 1 50r4 p2 c2 1 8r5 p1 c1 2 44r6 p1 c2 2 4

join index

CS 245 Notes1361

What to Materialize?

● Store in warehouse results useful for common queries

● Example:day 2 c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

total sales

materialize

CS 245 Notes1362

Materialization Factors

● Type/frequency of queries● Query response time● Storage cost

● Update cost

CS 245 Notes1363

Cube Aggregates Lattice

city, product, date

city, product city, date product, date

city product date

day 2 c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

Need analgorithm todecide whatto materialize

CS 245 Notes1364

Managing

● Metadata● Warehouse Design● Tools

Client Client

Warehouse

Query & Analysis

Integration

Metadata

CS 245 Notes1365

Metadata

● Administrative◆ definition of sources, tools, ...◆ schemas, dimension hierarchies, …◆ rules for extraction, cleaning, …◆ refresh, purging policies◆ user profiles, access control, ...

CS 245 Notes1366

Metadata

● Business◆ business terms & definition◆ data ownership, charging

● Operational◆ data lineage◆ data currency (e.g., active, archived, purged)◆ use stats, error reports, audit trails

CS 245 Notes1367

Design

● What data is needed?● Where does it come from?● How to clean data?

● How to represent in warehouse (schema)?● What to summarize?● What to materialize?● What to index?

CS 245 Notes1368

● Development◆ design & edit: schemas, views, scripts, rules, queries, reports

● Planning & Analysis◆ what-if scenarios (schema changes, refresh rates), capacity planning

● Warehouse Management◆ performance monitoring, usage patterns, exception reporting

● System & Network Management◆ measure traffic (sources, warehouse, clients)

● Workflow Management◆ “reliable scripts” for cleaning & analyzing data

CS 245 Notes1369

Current State of Industry

● Extraction and integration done off-line◆ Usually in large, time-consuming, batches

● Everything copied at warehouse◆ Not selective about what is stored◆ Query benefit vs storage & update cost

● Query optimization aimed at OLTP◆ High throughput instead of fast response◆ Process whole query before displaying

anythingCS 245 Notes13

Future Directions

● Better performance● Larger warehouses● Easier to use

● What are companies & research labs working on?

CS 245 Notes1371

Research (1)

● Incremental Maintenance

● Data Consistency● Data Expiration● Recovery● Data Quality● Error Handling (Back Flush)

CS 245 Notes1372

Research (2)

● Rapid Monitor Construction● Temporal Warehouses● Materialization & Index Selection

● Data Fusion● Data Mining● Integration of Text & Relational Data

CS 245 Notes1373

Conclusions

● Massive amounts of data and complexity of queries will push limits of current warehouses

● Need better systems:◆ easier to use◆ provide quality information

recall … cs 245: database system principlesi.stanford.edu/~ullman/dbsi/sum01/notes/notes13.pdf ·...

Documents

cs 245 - additional notescs245/instructor...cs 245 -...

cs 245: logic and...

cs 245: database system principles notes 09: concurrency...

cs 245: principles of data-intensive systems streams of...

cs 245: database system principles notes 10: more tp

cs 245notes 131 cs 245: database system principles notes 13:...

cs 245: principles of data-intensive...

cs 3471 cs 347: parallel and distributed data management...

cs 245notes 091 cs 245: database system principles notes 09:...

cs 245notes 61 cs 245: database system principles notes 6:...

cs 245notes 081 cs 245: database system principles notes 08:...

cs 245notes 71 cs 245: database system principles notes 7:...

cs 245: database system principles notes 7: query...

cs 310.qxd:cs actu 245 - confédération...

cs 245: principles of data-intensive systemsassigned paper...

cs 245notes 141 cs 245: database system principles notes 14:...

cs 245: database system principles notes 03: disk...

cs 245notes 21 cs 245: database system principles notes 02:...

notes13 vs wisconsin

[xls]'diagram's' site catalogue - dyn list/world's...