crossing the structure chasm alon halevy university of washington, seattle ucla, april 15, 2004

Crossing the Structure Chasm

Alon Halevy

University of Washington, Seattle

UCLA, April 15, 2004

The Structure Chasm

Authoring Creating a schema

Writing text

Querying keywordsUsing someone else’s schema

Data sharing Easy Committees, standardsBut we can pose

complex queries

Why is This a Problem?Databases used to be isolated and administered only by experts.Today’s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, …) Government agencies Large corporations The web (over 100,000 searchable data sources)

The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web

Fundamental problem: reconciling different models of the world.

OutlineTwo motivating scenarios: A web of structured data Personal data management

A tour of recent data sharing architectures Data integration systems Peer-data management systems

The algorithmic problems: Query reformulation Reconciling semantic heterogeneity

Reconsidering authoring and querying challenges

Large-Scale Scientific Data Sharing

UW

UW Microbiology

UCLA GeneticsUW Genome Sciences

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

OMIMHUGO

Swiss-Prot

GeneClinics

Non-urgent Applications

UW

California IRS

Employer Tax Reports

1040 DBIRS

Fidelity

County real-estate DB

B of A

NY IRS

Personal Data Management

HTMLMail &

calendar

Cites

Event

Message

Document

Web Page

Presentation

Cached

SoftcopySoftcopySender,

Recipients

Organizer, Participants

Person

Paper

Author

Homepage

Author

Data is organized by application

[Semex: Sigurdsson, Nemes, H.]

Papers Files Presentations

Finding Publications

Person: A. HalevyPerson: Dan SuciuPerson: Maya RodrigPerson: Steven GribblePerson: Zachary Ives

Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa

Publication

Bernstein

Following Associations (1)

“A survey of approaches to automatic schema matching”

“Corpus-based schema matching”

“Database management for peer-to-peer computing: A vision”

“Matching schemas by learning from others”

“A survey of approaches to automatic schema matching”

“Corpus-based schema matching”

“Database management for peer-to-peer computing: A vision”

“Matching schemas by learning from others”

Publication

Bernstein


Publication

Bernstein

Cited by

Publication

Citations


Cited Authors

Bernstein

Publication


PIM Data Sharing Challenges

Need to combine data from multiple applications/ sources.

After initial set of concepts are given, extend and personalize concept hierarchy,share (parts) of our data with others, incorporate external data into our view.

Need also Instance level reconciliation: Alon Halevy, A. Halevy, Alon Y. Levy – same guy!



The algorithmic problems: Query reformulation Reconciling semantic heterogeneity


Data Integration

Goal: provide a uniform interface to a set of autonomous data sources.New abstraction layer over multiple sources. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, BioMediator Cal: Garlic (IBM), Ariadne (USC), XMAS (UCSD),…

Recent “Enterprise Information Integration” industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM

Relational Abstraction LayerSchema: the template for data.

Queries:

SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

Students: Takes:

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

Courses:

SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Data Integration: Higher-level Abstraction

Mediated Schema

Q

Q1 Q2 Q3SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …









… …

Semantic mappings

Mediated Schema

OMIMSwiss-Prot

HUGO GO

Gene-Clinics

EntrezLocus-Link

GEO

Entity

Sequenceable Entity

GenePhenotypeStructured Vocabulary

Experiment

ProteinNucleotide Sequence

Microarray Experiment

Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?

www.biomediator.orgTarczy-Hornoch, MorkTarczy-Hornoch, Mork

http://www.biomediator.org/

Semantic Mappings

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Differences in: Names in schema Attribute grouping

Coverage of databases Granularity and format of attributes

Key Issues

Mediated Schema

Q

Q’ Q’ Q’SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …









… …

Formalism for mappings Reformulation algorithms

How will we create them?

Beyond Data Integration

Mediated schema is a bottleneck for large-scale data sharing

It’s hard to create, maintain, and agree upon.

Peer Data Management Systems

UW

Stanford

DBLP

UCLA UCSD

CiteSeer

UC BerkeleyQ

Q1

Q2Q6

Q5

Q4

Q3Mappings specified locally

Map to most convenient nodes

Queries answered by traversing semantic paths.

Piazza: [Tatarinov, H., Ives, Suciu, Mork]

PDMS-Related Projects

Hyperion (Toronto)

PeerDB (Singapore)

Local relational models (Trento)

Edutella (Hannover, Germany)

Semantic Gossiping (EPFL Zurich)

Raccoon (UC Irvine)

Orchestra (U. Penn)

A Few Comments about Commerce

Until 5 years ago: Data integration = Data warehousing.

Since then: A wave of startups:

Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products.

Success: analysts have new buzzword – EII New addition to acronym soup (with EAI).

Lessons: Performance was fine. Need management tools.

Data Integration: Before

Mediated Schema

SourceSource Source Source Source

Q

Q’ Q’ Q’ Q’ Q’

XML Query

User Applications

Lens™ File InfoBrowser™Software

Developers Kit

NIMBLE™ APIs

Front-End

XML

Lens Builder™Lens Builder™

Management Tools

Management Tools

Integration Builder

Integration Builder

Security T

ools

Data Administrator

Data Administrator

Data Integration: After

Concordance Developer

Concordance Developer

Integration

Layer

Nimble Integration Engine™

Compiler Executor

MetadataServerCache

Relational Data Warehouse/ Mart

Legacy Flat File Web Pages

Common XML View

Sound Business Models

Explosion of intranet and extranet information80% of corporate information is unmanagedBy 2004 30X more enterprise data than 1999The average company: maintains 49 distinct

enterprise applications spends 35% of total IT

budget on integration-related efforts

1995 1997 1999 2001 2003 2005

Enterprise Information

Source: Gartner, 1999



The algorithmic problems Query reformulation Reconciling semantic heterogeneity


Languages for Schema Mapping

Mediated Schema

SourceSource Source Source Source

Q

Q’ Q’ Q’ Q’ Q’

GAV LAV GLAV

GLAV Mappings

Book: ISBN, Title, Genre, Year

R1aR1b R2 R3 R4

Author: ISBN, Name

R1a(isbn, title,n), R1b(isbn, genre,n) Book(isbn, title, genre, year), Author(isbn, n), year < 1970

Books before 1970

R5

Query Reformulation

Book: ISBN, Title, Genre, Year

R1 R2 R3 R4 R5

Author: ISBN, Name

Books before 1970 Humor books

Query: Find authors of humor books

Plan: R1 Join R5

R5(x,y) :- Book(x,y,”Humor”)

Answering Queries Using Views

Formal Problem: can we use previously answered queries to answer a new query? Challenge: need to invert query expression.

Results depend on: Query language used for sources and queries, Open-world vs. Closed-world assumption Allowable access patterns to the sources MiniCon [Pottinger and H., 2001]: scales to

thousands of sources.

Every commercial DBMS implements some version of answering queries using views.

Some Open Research Issues Managing large networks of mappings:

• Consistency• Trust

Improving networks: finding additional mappings

Indexing:Heterogeneous data across the networkCaching:Where? What?

UW

Stanford

DBLP

UCLA UCSD

CiteSeer

UC Berkeley



The algorithmic problems Query reformulation Reconciling semantic heterogeneity


Semantic Mappings

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Need mappings in every data sharing architecture

“Standards are great, but there are too many.”

Why is it so Hard?Schemas never fully capture their intended meaning:We need to leverage any additional information

we may have.

A human will always be in the loop.Goal is to improve designer’s productivity.Solution must be extensible.

Two cases for schema matching:Find a map to a common mediated schema.Find a direct mapping between two schemas.

Typical Matching HeuristicsWe build a model for every element from multiple sources of evidences in the schemas Schema element names

BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation

ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book

Data types, data instances DateTime Integer, addresses have similar formats

Schema structure All books have similar attributes

Models consider only the two schemas.

In isolation, techniques are incomplete or brittle:Need principled combination.

Using Past ExperienceMatching tasks are often repetitive Humans improve over time at matching. A matching system should improve too!

LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03]

Doan: 2003 ACM Distinguished Dissertation Award.

Mediated Schema

data sources

Mediated Schema

listed-price $250,000 $110,000 ...

address price agent-phone description

Example: Matching Real-Estate Sources

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

realestate.com

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

Learning Source Descriptions

We learn a classifier for each element of the mediated schema.Training examples are provided by the given mappings.Multi-strategy learning:Base learners: name, instance, descriptionCombine using stacking.

Accuracy of 70-90% in experiments.Learning about the mediated schema.

Corpus-Based Schema Matching[Madhavan, Doan, Bernstein, H.]

Can we use previous experience to match two new schemas?

Learn about a domain?

CDs Categories Artists

Items

Artists

Authors Books

Music

Information

Litreture

Publisher

Authors

Corpus of Schemas and MatchesCorpus of Schemas and MatchesReuse extracted knowledgeto match new schemas

Learn general purpose knowledge

Classifier for every corpus element

Exploiting The Corpus

Given an element s S and t T, how do we determine if s and t are similar?

The PIVOT Method: Elements are similar if they are similar to the same

corpus concepts

The AUGMENT Method: Enrich the knowledge about an element by

exploiting similar elements in the corpus.

Pivot: measuring (dis)agreement

Pk= Probability (s ~ ck )

Interpretation I(s) = element s Schema S

Compute interpretations w.r.t. corpus

# concepts in corpus

Similarity(I(s), I(t))

I(s)

I(t)s t

S T

Interpretation captures how similar an element is to each corpus concept Compared using cosine distance.

Augmenting element models

Search similar corpus concepts Pick the most similar ones from the interpretation

Build augmented models Robust since more training data to learn from

Compare elements using the augmented models

s

S

Schema

ElementModel

Name:Instances:Type:…

M’s

Search similar corpus concepts

Build augmented models

s e ff

e

Corpus of known schemas and mappings

Experimental Results

Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas

Performance measure: F-Measure:

Precision and recall are measured in terms of the matches predicted.

recallprecision

recallprecisionf

+=

**2

Comparison over domains

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

auto real estate invsmall inventory nameaddr

Average FMeasure

direct augment pivot

Corpus based techniques perform better in all the domains

“Tough” schema pairs

Significant improvement in difficult to match schema pairs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

auto real estate invsmall inventory nameaddr

Average F-Measure


Mixed corpus

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

auto + re + invsmall difficult auto + invsmall

Average F-Measure


Corpus with schemas from different domains can also be useful

Other Corpus Based ToolsA corpus of schemas can be the basis for many useful tools: Mirror the success of corpora in IR and NLP?

Back to the structure chasm: Authoring and querying.

Auto-complete: I start creating a schema (or show sample data),

and the tool suggests a completion.

Formulating queries on new databases: I ask a query using my terminology, and it gets

reformulated appropriately.

ConclusionVision: data authoring, querying and sharing by everyone, everywhere.

Need to make it easier to enjoy the benefits of structured data.

Challenge: reconciling semantic heterogeneity

CorpusOf

schemas

schemamapping

Some References

www.cs.washington.edu/homes/alonPiazza: ICDE03, WWW03, VLDB-03The Structure Chasm: CIDR-03Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002

Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01.

Teaching integration to undergraduates: SIGMOD Record, September, 2003.

crossing the structure chasm alon halevy university of washington, seattle ucla, april 15, 2004

Documents

external data

largescale data sharing

data integration goal

searchable data sources

pim data sharing challenges

ny irs slide

halevy person

peer computing