data integration: a status report

33
Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003

Upload: zaide

Post on 12-Feb-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Data Integration: A Status Report. Alon Halevy University of Washington, Seattle BTW 2003. Data Integration Report. Recent progress Mediation languages Query processing (XML and other) Commercial Current challenges Flexible architectures: peer-data mgmt. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Integration: A Status Report

Data Integration:A Status Report

Alon HalevyUniversity of Washington, Seattle

BTW 2003

Page 2: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Data Integration Report Recent progress

Mediation languages Query processing (XML and other) Commercial

Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic

heterogeneity: schema mapping.

Page 3: Data Integration: A Status Report

ReviewsSh ipp ingO rdersIn ven toryBooks

m ybooks .com M edia ted Schem a

W e s t

...

F e dE x

W A N

a lt.bo o ks .re v ie w s

In te rne tIn te rne t Inte rne t

UP S

E a s t O rde rs C us to me rR e v ie w s

NY Time s

...

M o rga n-K a ufma n

P re ntic e -Ha ll

Data Integration Systems

• This is one possible architecture (virtual integration)• Only logical mediated schema is central. Data stays at the sources.

Page 4: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Motivation and Activity Application areas of data integration:

Enterprise information integration ($$) The government Data sources on the web Scientific data sharing.

Many research projects: Mine: Information Manifold, Tukwila, LSD.

Companies: Many startups, big guys getting in.

Page 5: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Outline Recent progress

Mediation languages Adaptive Query processing XML data management Commercial

Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic

heterogeneity: schema mapping. Crossing the Structure Chasm.

Page 6: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Mediation LanguagesGoal:

Mediated Schema

SourceSource Source Source Source

Language forSpecifyingSemanticrelationships

Q

Q’ Q’ Q’ Q’ Q’

Page 7: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Global-as-View (GAV)

Mediated Schema

SourceSource Source Source SourceR1 R2 R3 R4 R5

Title, Actor, …

Create view Actor ASR1UnionSelect A,B From S2Union…

Page 8: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Local-as-View (LAV)

Mediated Schema

SourceSource Source Source SourceR1 R2 R3 R4 R5

Title, Actor …Create View R1 asSelect title, nameFrom Title Join ActorWhere Year>1970

Create View R5 asSelect *From MovieWhere lang=“German”

(GLAV)

Page 9: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Adaptive Query Processing Problem: no stats, network unstable Cannot ‘Plan and then execute’ Need to adapt plan during execution. Idea already in Ingres (1976) Proposed before data integration:

Cole and Graefe (choose nodes) Kabra and Dewitt (mid-query re-opt).

Page 10: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Convergent Query Processing[Zack Ives, Ph.D 2002, U. Penn] Processor starts with initial plan

Monitors execution, accumulating stats. Switches plan when a better one found

Reuses intermediate results. Final, cleanup phase.

Possible transformation types: Plan partitioning, data partitioning, low-level

rescheduling. Can be aggressive (e.g., with aggregations).

Page 11: Data Integration: A Status Report

February 27th, 2003 BTW 2003

XML Query Processing XML facilitates integration. Mediator query processor may

manipulate XML directly. Progress on:

Publishing to XML, XML views on relations

Physical algebras for manipulating XML Optimization of XQuery.

Page 12: Data Integration: A Status Report

February 27th, 2003 BTW 2003

The Commercial World Some startups:

Nimble, MetaMatrix, Calixa, Enosys, … Big guys making announcements:

IBM, BEA, MS, (Oracle still being defiant). Progress: analysts have buzzword -- EII. Challenges:

Integration with EAI? Yet another middleware? Horizontal vs. vertical?

Page 13: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Outline Recent progress

Mediation languages Adaptive Query processing XML data management Commercial

Current challenges Flexible architectures: peer-data

mgmt. Getting to the root of semantic

heterogeneity: schema mapping.

Page 14: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Peer Data-Management PDMS: a network of peers Peers can:

Export base data Provide views on base data Serve as logical mediators for other peers

A peer can be both a server and a client.

Semantic relationships are specified locally (between small sets of peers).

Page 15: Data Integration: A Status Report

Network of Mappings (Piazza)

UW Stanford

DBLP

Saarbruecken Leipzig

CiteSeer

Berlin

GAV, LAVGLAV

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’

Page 16: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Advantages of PDMS No need for a central mediated schema. Can map data opportunistically, as is most

convenient. Queries are posed using the peer’s schema.

Answers come from anywhere in the system. Semantic Web. This is not P2P file sharing.

Data has rich semantics Membership is not as dynamic.

Page 17: Data Integration: A Status Report

Schema Mediation

UW Stanford

DBLP

Saarbruecken Leipzig

CiteSeer

Berlin

GAV, LAVGLAV

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’When can LAV and GAV be combined to form such a network structure?[ICDE-03],[WWW-03 for XML]

Page 18: Data Integration: A Status Report

Query Optimization

UW Stanford

DBLP

Saarbruecken Leipzig

CiteSeer

Berlin

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’Problems: • redundant paths• expensive reformulation.

Possible solution:• Pre-compose some paths

Page 19: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Mapping Composition Incredibly subtle! [w/ Madhavan] In general, composition can be an

infinite set of GLAV formulas. Results:

Finite in many cases Even when infinite, often has finite,

useful encoding. Hence, compositions can usually be

pre-optimized.

Page 20: Data Integration: A Status Report

Management of Updates[w/ Mork, Gribble]

UW Stanford

DBLP

Saarbruecken Leipzig

CiteSeer

BerlinQ

Q’

Q’Q’’

Q’’

Q’’

Q’’Problem: when updates are generated, we don’t know who will use them.

Solution: • represent updates as first-class citizens• Complement with boosters• Rules for usage.

Page 21: Data Integration: A Status Report

Other Research Issues

UW Stanford

DBLP

Saarbruecken Leipzig

CiteSeer

BerlinQ

Q’

Q’Q’’

Q’’

Q’’

Q’’Intelligent data placement

Management of mapping networks

Improving networks: finding additional connections.

Indexing of views

Page 22: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Schema Matching/Mapping Given

S1 and S2: a pair of schemas/DTDs/ontologies,… Possibly, data accompanying instances Additional domain knowledge

Find: A match between S1 and S2

A set of correspondences between the terms. Ultimately, a mapping

Should enable translating data between the schemas.

Page 23: Data Integration: A Status Report

Example: House Listings

house

location view

house

address

front back

num-baths

full-baths half-baths

Water view

Lake Mountains

1-1 mapping non 1-1 mapping?

Page 24: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Motivations Heart of any data sharing architecture

Virtual, warehouse, messaging, web services, semantic web Translation of legacy data, EAI, …

Key operator in model management Algebra for manipulating models of data See [Bernstein, CIDR-03], Melnik et al. [SIGMOD

03]. Currently, a bottleneck. Done mostly by

hand.

Page 25: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Approaches to Matching Matching is hard because schema does

not fully capture the semantics. Many techniques proposed. They

consider similarities in: Attribute names (synonyms) Data values, data types Relationships between columns Structural similarities

Anything a human expert would try! Hence, let’s try to simulate a human.

Page 26: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Philosophy of Solutions Effective schema matching requires

a principled combination of techniques.

Like human experts, the matcher should improve over time Learn from seeing many schemas,

matches. LSD [Doan, Ph.D 2002, U. of Illinois] COMA [Do et al.]

Page 27: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Corpus Based Solution[Madhavan, Bernstein, Chen, Halevy, Shenoy] Collect a corpus of schemas and

matches. Learn from the corpus:

Create a classifier for every corpus element Use multi-strategy learning.

Given S1 and S2 : Compare each schema element to corpus

elements. If two elements’ similarity vectors are close,

then maybe they match each other.

Page 28: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Learning from Corpus vs. Learning from the schemas

Shipping Domain

0

0.2

0.4

0.6

0.8

1

P1a P1b P2a P2b P3a P3b P4a P4b

Schema Pairs

Rec

all

MKB BASIC

Page 29: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Finding Different MatchesShipping Domain

-15

-10

-5

0

5

10

15

P1a P1b P2a P2b P3a P3b P4a P4b

Schema Pairs

Avg

Num

ber o

f Mat

ches

Extra Matches Missed Matches

Page 30: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Other Corpus Based Tools Conjecture: a corpus of schemas can be

the basis for many useful tools. Auto-complete:

I start creating a schema (or show sample data), and the tool suggests a completion.

Query reformulation: I ask a query using my terminology, and it

gets reformulated appropriately. Improving structured queries over structured

web sites (and focused crawling, a la BINGO!)

Page 31: Data Integration: A Status Report

February 27th, 2003 BTW 2003

The Corpus Contents:

Schemas, ontologies, meta-data, data, queries.

Sample statistics: How often does a word appear as a

relation name? When it does, what tend to be the

attribute names? What other tables are there? What

are the foreign keys?

Page 32: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Conclusion: Crossing the Structure Chasm Data authoring, querying and

sharing is everywhere; done by novices too.

Semantic web: the extreme example.

CorpusOf

schemas

schemamapping

Page 33: Data Integration: A Status Report

February 27th, 2003 BTW 2003

Some References www.cs.washington.edu/homes/alon Piazza: WebDB01, ICDE03, WWW03 The Structure Chasm: CIDR-03 Mediation surveys: VLDB Journal 01

Lenzerini, PODS 02 tutorial. Schema matching:

Rahm and Bernstein, VLDB Journal 01.