byu data extraction group funded by nsf1 brigham young university li xu source discovery and schema...

50
Funded by NSF 1 YU Data Extraction Group Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration

Post on 22-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Funded by NSF 1 BYU Data Extraction Group

Brigham Young University

Li Xu

Source Discovery andSchema Mapping

for Data Integration

Funded by NSF 2 BYU Data Extraction Group

Data IntegrationFind houses with four bedrooms priced under $200,000

global schema

homes.comrealestate.com

source schema 2

homeseekers.com

source schema 3source schema 1

wrappers

Mediator

Funded by NSF 3 BYU Data Extraction Group

Problems

• How to Recognize Applicable Information Sources for an Application?

• How to Specify Mapping between the Source Schemas and the Global Schema?

• How to Reformulate User Queries?• How to Merge Data from Heterogeneous

Sources?• …

Funded by NSF 4 BYU Data Extraction Group

Recognizing Ontology-Applicable

HTML Documents

Funded by NSF 5 BYU Data Extraction Group

Application Ontology

How to specify an application?

Funded by NSF 6 BYU Data Extraction Group

Applicable HTML Documents

• Multiple-Record Documents

• Single-Record Documents

• HTML Forms

How to distinguish an applicable HTML document?

Funded by NSF 7 BYU Data Extraction Group

Multiple-Record Doc’sDocument 1: Car Ads

Document 2: Items for Sale or Rent

Funded by NSF 8 BYU Data Extraction Group

Single-Record Doc.

Funded by NSF 9 BYU Data Extraction Group

HTML Forms

Information hidden under the HTML form

Funded by NSF 10 BYU Data Extraction Group

Recognition Heuristics

• h1+: Densities

• h2: Expected Values

• h3: Grouping

How to measure the applicability of an HTML document for an application?

Funded by NSF 11 BYU Data Extraction Group

Document 1: Car Ads

h1+: Densities

Document 2: Items for Sale or Rent

Funded by NSF 12 BYU Data Extraction Group

Document 1: Car Ads

Year: 3Make: 2Model: 3Mileage: 1Price: 1Feature: 15PhoneNr: 3

h2: Expected Values

Document 2: Items for Sale or Rent

Year: 1Make: 0Model: 0Mileage: 1Price: 0Feature: 0PhoneNr: 4

<Year:0.98, Make:0.93, Model:0.91, Mileage:0.45, Price:0.80, Feature:2.10, PhoneNr:1.15>

Funded by NSF 13 BYU Data Extraction Group

h3: Grouping (of 1-Max Object Sets)

YearMakeModelPriceYearModelYearMakeModelMileage…

Document 1: Car Ads

{{{

YearMileage…MileageYearPricePrice…

Document 2: Items for Sale or Rent

{{

Funded by NSF 14 BYU Data Extraction Group

Classification Problem

• Subtasks– Multiple Records– Singleton Record– Application Form

• Learning Algorithm: Decision Tree C4.5– (h1+0, h1+1, …, h2, h3, Positive)– (h1+0, h1+1, …, h2, h3, Negative)

How to construct recognition rules for an application?

Funded by NSF 15 BYU Data Extraction Group

Experiments Car Ads and Obituaries

• Training Sets– Car Ads (Yes| No)

• 143 | 363• 614 | 636• 50 |69

– Obituaries (Yes| No) • 68 | 135• 50 | 69 • 62 | 135

• Test Sets– Car Ads (40 | 40)

Precision 95%Recall 98%F-measure 96%

– Obituaries (40 |40)Precision 95%Recall 95%F-measure 95%

Funded by NSF 16 BYU Data Extraction Group

Link Analysis

Funded by NSF 17 BYU Data Extraction Group

Form Filling

Funded by NSF 18 BYU Data Extraction Group

Form Filling (Cont.)

Funded by NSF 19 BYU Data Extraction Group

Incorrect Positive ResponseMotorcycle

Year

Make

Price

Mileage

PhoneNr

Feature

Funded by NSF 20 BYU Data Extraction Group

HistoricalFigure

Deceased Name

Death Date

Birth Date

Age

Relationship

Relative Name

Funded by NSF 21 BYU Data Extraction Group

AutomatingSchema Mapping for

Data Integration

Funded by NSF 22 BYU Data Extraction Group

Schema Mapping

Source

Car

Year

Cost

Style

YearFeature

Cost

Phone

Target

Car

MilesMileage

Model

Make Make&

Model

Color

Body Type

Funded by NSF 23 BYU Data Extraction Group

Schema Mappingfor Populated Schemas

• Central Idea: Exploit All Data & Metadata

• Matching Possibilities (Facets)– Attribute Names– Data-Value Characteristics– Expected Data Values– Data-Dictionary Information– Structural Properties

Funded by NSF 24 BYU Data Extraction Group

The Approach• Input:

– Two Graphs, S and T– Data Instances for S and T– Lightweight Domain Ontology

• Output: – A Source-to-Target Mapping between S and T

• Should enable translating data instances from S to T.

– Direct and Many Indirect Matches• (t, s)• (t, s’ <= )

• Framework– Individual Facet Matching– Combination of Individual Matchers

Funded by NSF 25 BYU Data Extraction Group

Attribute Names

• Target and Source Attributes – T : A – S : B

• WordNet• C4.5 Decision Tree: feature selection, trained on

schemas in DB books– f0: same word– f1: synonym– f2: sum of distances to a common hypernym root– f3: number of different common hypernym roots– f4: sum of the number of senses of A and B

Funded by NSF 26 BYU Data Extraction Group

WordNet Rule

The number

of different common

hypernym roots of A

and B

The sum of distances of A and B to a

common hypernym

The sum of the

number of senses of A and B

Funded by NSF 27 BYU Data Extraction Group

Data-Value Characteristics

• C4.5 Decision Tree

• Features– Numeric data

(Mean, variation, standard deviation, …)

– Alphanumeric data(String length, numeric ratio, space ratio)

Funded by NSF 28 BYU Data Extraction Group

Make & Model Brand Model

Expected Data Values• Concepts and Relationships• Data Recognizers

– CarMake• “ford”

• “honda”

• …

– CarModel• “accord”

• “mustang”

• “taurus”

• …

Ford MustangFord TaurusFord F150…

CarMake . CarModel

Legend MustangA4…

CarModelCarMake

Target Source

Acura AudiBMW…

Funded by NSF 29 BYU Data Extraction Group

Structure Matching

House Agent

Golf

course

Water

front

Name

Fax

Address

Street City State

Basic_features

bedsSQFT

MLS

agentlocation_

description

name

fax phone

location

Address

Target Source

MLS Bedrooms

Funded by NSF 30 BYU Data Extraction Group

Structure Matching (Cont.)

House Agent

Golf

course

Water

front

Name

Fax

Address

Street City State

Basic_features

bedsSQFT

MLS

agentlocation_

description

name

fax phone

location

Address

Target Source

MLS Bedrooms

Funded by NSF 31 BYU Data Extraction Group

Structure Matching (Cont.)

House Agent

Golf

course

Water

front

Name

Fax

Address

Street City State

Basic_features

bedsSQFT

MLS

agentlocation_

description

name

fax phone

location

Address

Target Source

MLS Bedrooms

Funded by NSF 32 BYU Data Extraction Group

Structure Matching (Cont.)

House Agent

Golf

course

Water

front

Name

Fax

Address

Street City State

Basic_features

bedsSQFT

MLS

agentlocation_

description

name

fax phone

location

Address

TargetSource

MLS Bedrooms

Funded by NSF 33 BYU Data Extraction Group

Structure Matching (Cont.)

House Agent

Golf

course

Water

front

Name

Fax

Address

Street City State

Basic_features

bedsSQFT

MLS

agentlocation_

description

name

fax phone

location

Address

TargetSource

MLS Bedrooms

Funded by NSF 34 BYU Data Extraction Group

Structure Matching (Cont.)

House Agent

Golf

course

Water

front

Name

Fax

Address

Street City State

Basic_features

bedsSQFT

MLS

agentlocation_

description

name

fax phone

location

Address

TargetSource

MLS Bedrooms

Funded by NSF 35 BYU Data Extraction Group

{House, MLS} vs. {MLS}

House

Golf

course

Water

front Address

Street City State

Basic_features

bedsSQFT

MLS

location_

description

location

TargetSource

MLS Bedrooms

Funded by NSF 36 BYU Data Extraction Group

{House, MLS} vs. {MLS}

House

Golf

course

Water

front Address

Street City State

Basic_features

bedsSQFT

MLS

location_

description

location

TargetSource

MLS Bedrooms

Funded by NSF 37 BYU Data Extraction Group

{House, MLS} vs. {MLS}

House

Golf

course

Water

front Address

Street City State

Basic_features

beds

SQFT

MLS

location_

description

location

TargetSource

MLS Bedrooms

House’

Address1’

Funded by NSF 38 BYU Data Extraction Group

{House, MLS} vs. {MLS}

House

Golf

course

Water

front Address

Street City State

Basic_features

beds

SQFT

MLS

location_

description

location

Target Source

MLS Bedrooms

House’

Golf

course’

Water

front’Address1’

Street1’ City1’ State1’

Funded by NSF 39 BYU Data Extraction Group

{Agent} vs. {agent}

Agent

Name

Fax

Address

Street City State

agent

name

fax phone

address

TargetSource

Funded by NSF 40 BYU Data Extraction Group

{Agent} vs. {agent}

Agent

Name

Fax

Address

Street City State

agent

name

fax

phone

address

TargetSource

Address2’

Street2, City2’ State2’

Funded by NSF 41 BYU Data Extraction Group

Inter-Relationship Set

House Agent

Golf

course

Water

front

Name

Fax

Address

Street City State

MLS

agent

TargetSource

MLS Bedrooms

House’

Funded by NSF 42 BYU Data Extraction Group

Example:Source-To-Target Mapping

House’

Golf

course’

Water

front’

MLS

bedsagent

name

fax

Address1’ Address2’

Address’

Street’

City’

State’

Funded by NSF 43 BYU Data Extraction Group

Target-based Integration and Query System (TIQS)

• Definition : I = (T, {Si}, {Mi})

• Phases– Design (Source-to-Target Mappings {Mi})– Query Processing (Rule Unfolding)

Funded by NSF 44 BYU Data Extraction Group

Query Reformulation

• Query– House-Bedrooms(x, 4) :- House-Bedrooms(x, 4),

House-Golf_course(x, “Yes”),

House-Water_front(x, “Yes”)

House’

Golf

course’

Water

front’

MLS

bedsagent

name

fax

Address1’ Address2’

Address’

Street’

City’

State’

Funded by NSF 45 BYU Data Extraction Group

Query Reformulation

• Query– House-Bedrooms(x, 4) :- House-Bedrooms(x, 4),

House-Golf_Course(x, “Yes”),

House-Water_Front(x, “Yes”)

House’

Golf

course’

Water

front’

MLS

bedsagent

name

fax

Address1’ Address2’

Address’

Street’

City’

State’

Funded by NSF 46 BYU Data Extraction Group

TIQS (Cont.)

• User Queries– Logic Rules – Maximal and Sound Query Answers

• Advantages– Rule Unfolding– Scalability

Funded by NSF 47 BYU Data Extraction Group

Experimental ResultsApplication

(Number of Schemes)

Precision

(%)

Recall

(%)

F

(%)

Number Matches

Number Correct

Number

Incorrect

Faculty Member (5) 100 100 100 540 540 0

Course Schedule (5) 99 93 96 490 454 6

Real Estate (5) 90 94 92 876 820 92

Data borrowed from Univ. of Washington [DDH, SIGMOD01]

Indirect Matches: (precision 87%, recall 94%, F-measure 90%)

Rough Comparison with U of W Results

* Course Schedule – Accuracy: ~71%

• * Real Estate (2 tests) – Accuracy: ~75%

* Faculty Member – Accuracy, ~92%

Funded by NSF 48 BYU Data Extraction Group

Conclusion

• A Robust and Flexible Approach to Check Applicability of HTML documents

• A Composite Approach to Automate Schema Mapping– Direct Matches– Indirect Matches

• An Approach that Combines Advantages of Basic Approaches to Data Integration

Funded by NSF 49 BYU Data Extraction Group

Future Work

• Test More Applications and Data to Evaluate the Approaches

• Extend Training Classifiers for Applicability Checking• Further Automating Schema Mapping• Automate Ontology Mapping on the Semantic Web• Automate Mapping between XML Documents• …

Funded by NSF 50 BYU Data Extraction Group

Thanks !

Questions?