data integration: a status report

Data Integration:A Status Report

Alon HalevyUniversity of Washington, Seattle

BTW 2003

February 27th, 2003 BTW 2003

Data Integration Report Recent progress

Mediation languages Query processing (XML and other) Commercial

Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic

heterogeneity: schema mapping.

ReviewsSh ipp ingO rdersIn ven toryBooks

m ybooks .com M edia ted Schem a

W e s t

...

F e dE x

W A N

a lt.bo o ks .re v ie w s

In te rne tIn te rne t Inte rne t

UP S

E a s t O rde rs C us to me rR e v ie w s

NY Time s

...

M o rga n-K a ufma n

P re ntic e -Ha ll

Data Integration Systems

• This is one possible architecture (virtual integration)• Only logical mediated schema is central. Data stays at the sources.


Motivation and Activity Application areas of data integration:

Enterprise information integration ($$) The government Data sources on the web Scientific data sharing.

Many research projects: Mine: Information Manifold, Tukwila, LSD.

Companies: Many startups, big guys getting in.


Outline Recent progress

Mediation languages Adaptive Query processing XML data management Commercial

Current challenges Flexible architectures: peer-data mgmt. Getting to the root of semantic

heterogeneity: schema mapping. Crossing the Structure Chasm.


Mediation LanguagesGoal:

Mediated Schema

SourceSource Source Source Source

Language forSpecifyingSemanticrelationships

Q

Q’ Q’ Q’ Q’ Q’


Global-as-View (GAV)

Mediated Schema

SourceSource Source Source SourceR1 R2 R3 R4 R5

Title, Actor, …

Create view Actor ASR1UnionSelect A,B From S2Union…


Local-as-View (LAV)

Mediated Schema

SourceSource Source Source SourceR1 R2 R3 R4 R5

Title, Actor …Create View R1 asSelect title, nameFrom Title Join ActorWhere Year>1970

Create View R5 asSelect *From MovieWhere lang=“German”

(GLAV)


Adaptive Query Processing Problem: no stats, network unstable Cannot ‘Plan and then execute’ Need to adapt plan during execution. Idea already in Ingres (1976) Proposed before data integration:

Cole and Graefe (choose nodes) Kabra and Dewitt (mid-query re-opt).


Convergent Query Processing[Zack Ives, Ph.D 2002, U. Penn] Processor starts with initial plan

Monitors execution, accumulating stats. Switches plan when a better one found

Reuses intermediate results. Final, cleanup phase.

Possible transformation types: Plan partitioning, data partitioning, low-level

rescheduling. Can be aggressive (e.g., with aggregations).


XML Query Processing XML facilitates integration. Mediator query processor may

manipulate XML directly. Progress on:

Publishing to XML, XML views on relations

Physical algebras for manipulating XML Optimization of XQuery.


The Commercial World Some startups:

Nimble, MetaMatrix, Calixa, Enosys, … Big guys making announcements:

IBM, BEA, MS, (Oracle still being defiant). Progress: analysts have buzzword -- EII. Challenges:

Integration with EAI? Yet another middleware? Horizontal vs. vertical?


Outline Recent progress

Mediation languages Adaptive Query processing XML data management Commercial

Current challenges Flexible architectures: peer-data

mgmt. Getting to the root of semantic

heterogeneity: schema mapping.


Peer Data-Management PDMS: a network of peers Peers can:

Export base data Provide views on base data Serve as logical mediators for other peers

A peer can be both a server and a client.

Semantic relationships are specified locally (between small sets of peers).

Network of Mappings (Piazza)

UW Stanford

DBLP

Saarbruecken Leipzig

CiteSeer

Berlin

GAV, LAVGLAV

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’


Advantages of PDMS No need for a central mediated schema. Can map data opportunistically, as is most

convenient. Queries are posed using the peer’s schema.

Answers come from anywhere in the system. Semantic Web. This is not P2P file sharing.

Data has rich semantics Membership is not as dynamic.

Schema Mediation

UW Stanford

DBLP


CiteSeer

Berlin

GAV, LAVGLAV

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’When can LAV and GAV be combined to form such a network structure?[ICDE-03],[WWW-03 for XML]

Query Optimization

UW Stanford

DBLP


CiteSeer

Berlin

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’Problems: • redundant paths• expensive reformulation.

Possible solution:• Pre-compose some paths


Mapping Composition Incredibly subtle! [w/ Madhavan] In general, composition can be an

infinite set of GLAV formulas. Results:

Finite in many cases Even when infinite, often has finite,

useful encoding. Hence, compositions can usually be

pre-optimized.

Management of Updates[w/ Mork, Gribble]

UW Stanford

DBLP


CiteSeer

BerlinQ

Q’

Q’Q’’

Q’’

Q’’

Q’’Problem: when updates are generated, we don’t know who will use them.

Solution: • represent updates as first-class citizens• Complement with boosters• Rules for usage.

Other Research Issues

UW Stanford

DBLP


CiteSeer

BerlinQ

Q’

Q’Q’’

Q’’

Q’’

Q’’Intelligent data placement

Management of mapping networks

Improving networks: finding additional connections.

Indexing of views


Schema Matching/Mapping Given

S1 and S2: a pair of schemas/DTDs/ontologies,… Possibly, data accompanying instances Additional domain knowledge

Find: A match between S1 and S2

A set of correspondences between the terms. Ultimately, a mapping

Should enable translating data between the schemas.

Example: House Listings

house

location view

house

address

front back

num-baths

full-baths half-baths

Water view

Lake Mountains

1-1 mapping non 1-1 mapping?


Motivations Heart of any data sharing architecture

Virtual, warehouse, messaging, web services, semantic web Translation of legacy data, EAI, …

Key operator in model management Algebra for manipulating models of data See [Bernstein, CIDR-03], Melnik et al. [SIGMOD

03]. Currently, a bottleneck. Done mostly by

hand.


Approaches to Matching Matching is hard because schema does

not fully capture the semantics. Many techniques proposed. They

consider similarities in: Attribute names (synonyms) Data values, data types Relationships between columns Structural similarities

Anything a human expert would try! Hence, let’s try to simulate a human.


Philosophy of Solutions Effective schema matching requires

a principled combination of techniques.

Like human experts, the matcher should improve over time Learn from seeing many schemas,

matches. LSD [Doan, Ph.D 2002, U. of Illinois] COMA [Do et al.]


Corpus Based Solution[Madhavan, Bernstein, Chen, Halevy, Shenoy] Collect a corpus of schemas and

matches. Learn from the corpus:

Create a classifier for every corpus element Use multi-strategy learning.

Given S1 and S2 : Compare each schema element to corpus

elements. If two elements’ similarity vectors are close,

then maybe they match each other.


Learning from Corpus vs. Learning from the schemas

Shipping Domain

0

0.2

0.4

0.6

0.8

1

P1a P1b P2a P2b P3a P3b P4a P4b

Schema Pairs

Rec

all

MKB BASIC


Finding Different MatchesShipping Domain

-15

-10

-5

0

5

10

15

P1a P1b P2a P2b P3a P3b P4a P4b

Schema Pairs

Avg

Num

ber o

f Mat

ches

Extra Matches Missed Matches


Other Corpus Based Tools Conjecture: a corpus of schemas can be

the basis for many useful tools. Auto-complete:

I start creating a schema (or show sample data), and the tool suggests a completion.

Query reformulation: I ask a query using my terminology, and it

gets reformulated appropriately. Improving structured queries over structured

web sites (and focused crawling, a la BINGO!)


The Corpus Contents:

Schemas, ontologies, meta-data, data, queries.

Sample statistics: How often does a word appear as a

relation name? When it does, what tend to be the

attribute names? What other tables are there? What

are the foreign keys?


Conclusion: Crossing the Structure Chasm Data authoring, querying and

sharing is everywhere; done by novices too.

Semantic web: the extreme example.

CorpusOf

schemas

schemamapping


Some References www.cs.washington.edu/homes/alon Piazza: WebDB01, ICDE03, WWW03 The Structure Chasm: CIDR-03 Mediation surveys: VLDB Journal 01

Lenzerini, PODS 02 tutorial. Schema matching:

Rahm and Bernstein, VLDB Journal 01.

http://www.cs.washington.edu/homes/alon

data integration: a status report

Documents