dbms2001notes 10: information integration1 principles of database management systems 10: information...

32
DBMS2001 Notes 10: Information Int egration 1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

Upload: brice-wilcox

Post on 05-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

1

Principles of Database Management Systems

10: Information Integration

Pekka KilpeläinenUniversity of Kuopio

Page 2: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

2

Information Integration

• Large family of applications concerned with combining data from heterogeneous information sources

• Our goal: Get an overview of the – modes (or approaches, or architectures)– problems, and– some applications

of information integration

Page 3: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

3

Modes of information integration

• Federated databases– collection of independent, co-operating sources

• Warehousing– single DB containing physical copies of data

from several sources

• Mediation– system providing an integrated view to the

sources (non-materialized, virtual database)

Page 4: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

4

Example An integrated DB for dealer DBs of Aardvark

Automobile Co, to support locating requested models and market analysis

DB of Dealer1:

Cars(serialNo,model, color, autoTrans,cdPlayer, …) ---------------------------------------------------------- | 123 | BMV 1 | blue| yes | yes | … | 456 | MB 50 | red | yes | no | …

. . .

Page 5: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

5

Example (cont.)DB of Dealer2:

Autos(serial, model, color)

------------------------------ | 323 | MZ-6 | BL | | 555 | MB 50| Rd |

. . .Options(serial, option)

---------------------------------- | 323 | manualTrans | | 323 | cdPlayer |

. . .

Page 6: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

6

Problems of information integration

• Different names – Cars/Autos, serialNo/serial, …– renaming straightforward

• Different data types– variable vs. fixed-length strings vs. integers vs.

• Conceptual differences– Are trucks included in ’Cars’?– Are station wagons classified as minivans?

• Treatment of missing values– represent with NULLs or some default values?

Page 7: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

7

Information integration architectures

1. Federated database– one-to-one connections between independent

databases

DB1 DB2

DB3 DB4

– Requires n(n-1) interfaces to fully connect n systems together

Page 8: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

8

Dealer1 wants to search requested cars also from the DB of Dealer2, for requests stored in relation NeededCars(model,color,autoTrans)

• Needs to query relations Autos(serial, model, color) and Options(serial, option)

of Dealer2 (with the following embedded SQL):

Example (of federation)

yes/no

Page 9: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

9

for (each tuple (:m, :c, :a) in NeededCars) {

if (:a=’yes’) { // autoTrans wanted

SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND

option=’autoTrans’ AND

model = :m AND color = :c;

} else { // manual transmission wanted SELECT serial FROM Autos WHERE model = :m AND color = :c AND

NOT EXISTS ( SELECT * FROM OptionsWHERE serial = Autos.serial

ANDoption = ’autoTrans’ );

}

}

Page 10: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

10

2. Data Warehouse• Database formed by extracting and

combining data from several sources– E.g., sales data for a supermarket chain– generally read-only

Warehouse

Source Source Source

Extractor Extractor Extractor

Combiner

Page 11: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

11

Create a warehouse of Dealer1 and Dealer2 databases with combined schema CarWH(serNo,model,color,autoTrans,dealer)

• Extract Dealer1’s data:INSERT INTO CarWH(serNo,model,color,

autoTrans,dealer)SELECT serialNo,model,color,

autoTrans,’dealer1’ FROM Cars;

Example (Data warehouse of Aadvark Co.)

• Extract Dealer2’s data:

Page 12: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

12

INSERT INTO CarWH(serNo,model,color, autoTrans,dealer)

SELECT serial,model,color, ’yes’,’dealer2’

FROM Autos, Options WHERE Autos.serial = Options.serial

AND option=’autoTrans’;

INSERT INTO CarWH(serNo,model,color, autoTrans,dealer)

SELECT serial,model,color, ’no’,’dealer2’

FROM Autos WHERE NOT EXISTS (

SELECT * FROM Options WHERE serial = Autos.serial AND

option = ’autoTrans’ );

Page 13: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

13

3. Mediators• Support views combining data from

several sources

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Page 14: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

14

Integrate Dealer1 and Dealer2 databases as a virtual database with schema CarMed(serNo,model,color,autoTrans,dealer)

• Suppose user of the mediator gives query Q:SELECT serNo, model FROM CarMed WHERE color=‘red’ ;

• Translation by wrapper of Dealer1 easy:Q1: SELECT serialNo AS serNo, model

FROM Cars WHERE color=‘red’ ;• Query Q2 on relation Autos of Dealer2 similarly• Return UNION of results to Q1 and Q2

Example (mediation of dealer sources)

Page 15: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

15

Extractors and wrappers• An extactor for a warehouse needs

– one or more queries to get data from source• SQL, or other supported language

– communication mechanisms for• passing queries to source• receiving responces from the source• passing data to the warehouse

• A wrapper for a mediator more complex– translates arbitrary queries for the source

Page 16: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

16

An aside: XML wrappers

• What if the source does not provide a query language?– Need to translate a piece of extracted

data (file/report/message/document/…)– developing and maintaining translation

programs is tedious => a wrapper specification language XW (XML Wrapper; Univ. of Kuopio, 2001-2002)

Page 17: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

17

XML in data exchange

• XML an industry standard for textual representation of structured data (W3C Recommendation 1998)– XML-based protocols developed for

e-business, medical messages, … – old message formats need to be

converted to XML– this translation addressed by XW

Page 18: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

18

XW wrapping

AA x1

x2

BBy1 y2

z1 z2

<part-a> <part-a> <e1> <e1>x1x1</e1> </e1>

<e2> <e2>x2x2</e2></e2></part-a></part-a><part-b><part-b> <line-1> <line-1> <d1> <d1>y1y1</d1> </d1>

<d2> <d2>y2y2</d2></d2> </line-1> </line-1> <d3> <d3>z2z2</d3></d3></part-b></part-b>

XW-engineXW-engine

<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>

ssource ource datadata

wwrapprapper er descriptiondescription

result result documentdocument

Page 19: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

19

Back to DB wrappers: Query templates• Translating queries of mediator to queries on

the source possible through templates– with parameters (place-holders for constants of

the original query)– Notation T => S:

template T is translated to source query S

• Example: Translating color-based selections on mediated view to queries on Dealer1 DB

Page 20: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

20

SELECT * FROM CarMed WHERE color = ‘$c’ ;

=>

SELECT serialNo AS serNo, model, color, autoTrans, ‘dealer1’

FROM Cars WHERE color = ‘$c’ ;

• Similar templates for model and autoTrans– supporting all combinations of n attributes requires

2n templates!– Can do with fewer by applying filtering (see later)

Example (of query templates)

Page 21: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

21

Wrapper generators

• Templates defining a wrapper translated into code of a wrapper– by a wrapper generator– How?

• Two parts of a wrapper– table of query patterns specified by

templates– a fixed driver

Page 22: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

22

Wrapper-generator architecture

Source

queries results

Wrapper generator

Driver

tableWrapper

TemplatesQueries from

mediator Results

Page 23: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

23

Filtering

• Often impossible to write templates for each possible form of mediator query

• Possible to wrap more complex queries– if some template returns a superset of the

original query (by fixing only a part of conditions), then

– filter the result of the template by the additional query conditions

Page 24: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

24

How to evaluate query SELECT * FROM CarMed

WHERE color=‘blue’ AND model=‘Gobi’ ;if we have only the below template?

SELECT * FROM CarMed WHERE color = ‘$c’ ;

=>

SELECT serialNo AS serNo, model, color,

autoTrans, ‘dealer1’ FROM Cars WHERE color = ‘$c’ ;

Example (of filtering by wrapper)

Page 25: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

25

1. Apply template with $c = ‘blue’ and store the result:

INSERT INTO Temp(serNo,model,color,autoTrans,dealer)

SELECT serialNo AS serNo, model, color, autoTrans, ‘dealer1’

FROM Cars WHERE color = ‘blue’ ;

Example (of filtering cont.)

2. Filter out Gobis from relation Temp:

SELECT * FROM Temp WHERE model = ‘Gobi’ ;

Page 26: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

26

Warehouses and their applications

• Warehouses– aimed at executive, decision maker,

analyst– often a copy of operational data

• Advantages– High query performance– Local processing at sources unaffected– Can operate when sources unavailable

Page 27: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

27

Applications of warehouses (1)• OLAP (On-line Analytic Processing)

– finding important patterns or trends from data stored at the warehouse

– involves complex queries on large parts of the database

– Contrast: OLTP (On Line Transaction Processing)• Describes processing at operational sites,

not at the warehouse

Page 28: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

28

OLTP vs. OLAP

• Reads & updates• Many small

transactions• MB-TB of data• Raw, up-to-date data• Clerical users• Consistency,

recoverability critical

• Mostly reads• Queries long,

complex• GB-TB of data• Summarized data,

updated periodically• Decision-makers,

analysts as users

OLTP OLAP

Page 29: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

29

• Data mining– active area of R&D since 1990’s– analysis of large amounts of data for

finding interesting regularities or exceptions

• Why?– To get valuable information out of huge

data collections of research commerce and industry

Applications of warehouses (2)

Page 30: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

30

Example (data mining)

• “Association rules” (IBM 1993)– way to analyse co-occurrence of values – E.g. the market-basket analysis:

bread, cheese butter (53%, 74%)

confidence: how many buyers of bread&cheese also bought butter?

support: how many market-baskets contain bread, cheese and butter

Page 31: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

31

What’s new in Data Mining?• Statistics:

– methods to test hypotheses on data

• Data mining– automatic generation of hypotheses

among hundreds or thousands of explaining variables

• OLAP:– user-driven examination of data

Page 32: DBMS2001Notes 10: Information Integration1 Principles of Database Management Systems 10: Information Integration Pekka Kilpeläinen University of Kuopio

DBMS2001 Notes 10: Information Integration

32

Summary

• We’ve had an overview of information integration

• Different approaches– federation, warehousing, mediation

• Applications of data warehouses– OLAP– data mining