information integration

Information Integration on the Web (SA-2) 1Kambhampati & Knoblock

Information Integration

4/5


Information Integration on the Web

AAAI Tutorial (SA2)

Rao Kambhampati & Craig Knoblock

Monday July 22nd 2007. 9am-1pmSlides f

or Parts 1 an

d 5 are

Available in hard

copy at the

Front of the room


Information Integration• Combining information from multiple

autonomous information sources– And answering queries using the combined information

• Many Applications– WWW:

• Comparison shopping• Portals integrating data from multiple sources• B2B, electronic marketplaces• Mashups, service composion

– Science informatics• Integrating genomic data, geographic data, archaeological

data, astro-physical data etc. – Enterprise data integration

• An average company has 49 different databases and spends 35% of its IT dollars on integration efforts

Information Integration on the Web (SA-2) 5

Deployed information integration systems…

• Travel sites: Kayak, Expedia etc..• Google Base, DBPedia• Map Mashups• Libra; Citeseer; Google-squared etc

Kambhampati & Knoblock

Information Integration on the Web (MA-1) 6Kambhampati & Knoblock

Deep Web as a Motivation for II

• The surface web of crawlable pages is only a part of the overall web.

• There are many databases on the web– How many?– 25 million (or more)– Containing more than 80% of the accessible

content• Mediator-based information integration is

needed.


Services

Webpages

Structureddata

Sensors(streamingData)

~25M

Subbarao Kambhampati

There are three models, but they really are parts of the same specturm.SOme of the challenges are common, some are different.


Blind Men & the Elephant: Differing views on Information Integration

Database View• Integration of

autonomous structured data sources

• Challenges: Schema mapping, query reformulation, query processing

Web service view• Combining/

composing information provided by multiple web-sources

• Challenges: learning source descriptions; source mapping, record linkage etc.

IR/NLP view• Computing

textual entailment from the information in disparate web/text sources

• Challenges: Information Extraction


Services

Webpages

Structureddata


--Search Model----Materialize the pages--crawl &index them--include them in search results

--Mediator Model----Design a mediator--Reformulate queries--Access sources--Collate results

--Warehouse Model----Get all the data and put into a local DB--Support structured queries on the warehouse

~25M

Datasource

Datasource

Datasource

Relational database (warehouse)

User queries

Data extractionprograms

Data cleaning/scrubbing

OLAP / Decision support/Data cubes/ data mining

Datasource

wrapper

Datasource

wrapper

Datasource

wrapper

Mediator:User queries

Mediated schema

Data sourcecatalog

Reformulation engine

optimizer

Execution engineWhich data

model?




Dimensions of Variation• Conceptualization of (and approaches to)

information integration vary widely based on – Type of data sources: being integrated (text; structured;

images etc.)– Type of integration: vertical vs. horizontal vs. both– Level of up-front work: Ad hoc vs. pre-orchestrated – Control over sources: Cooperative sources vs.

Autonomous sources– Type of output: Giving answers vs. Giving pointers– Generality of Solution: Task-specific (Mashups) vs.

Task-independent (Mediator architectures)


Dimensions: Type of Data Sources

• Data sources can be– Structured (e.g. relational data)

• Schema mapping (GAV/LAV)• Precise integration (Could I have done better by

going to the sources myself?)– Text oriented

• Information Extraction

– Mixed


Dimensions: Type of Output(Pointers vs. Answers)

• The cost-effective approach may depend on the quality guarantees we would want to give.

• At one extreme, it is possible to take a “web search” perspective—provide potential answer pointers to keyword queries– Materialize the data records in the sources as HTML pages and

add them to the index• Give it a sexy name: Surfacing the deep web

• At the other, it is possible to take a “database/knowledge base” perspective– View the individual records in the data sources as assertions in a

knowledge base and support inference over the entire knowledge. • Extraction, Alignment etc. needed


Dimensions: Vertical vs. Horizontal

• Vertical: Sources being integrated are all exporting same type of information. The objective is to collate their results– Eg. Meta-search engines, comparison shopping, bibliographic

search etc. – Challenges: Handling overlap, duplicate detection, source

selection• Horizontal: Sources being integrated are exporting

different types of information– E.g. Composed services, Mashups, – Challenges: Handling “joins”

• Both..


Dimensions: Level of Up-front workAd hoc vs. Pre-orchestrated

• Fully Query-time II (blue sky for now)– Get a query from the user

on the mediator schema– Go “discover” relevant data

sources– Figure out their “schemas”– Map the schemas on to the

mediator schema– Reformulate the user query

into data source queries– Optimize and execute the

queries– Return the answers

• Fully pre-fixed II– Decide on the only query

you want to support– Write a (java)script that

supports the query by accessing specific (pre-determined) sources, piping results (through known APIs) to specific other sources

• Examples include Google Map Mashups

E.g. We may start with known sources and their known schemas, do hand-mapping and support automated reformulation and optimization

(most interesting action is

“in between”)


Dimensions: Control over Sources(Cooperative vs. Autonomous)

• Cooperative sources can (depending on their level of kindness)– Export meta-data (e.g. schema) information– Provide mappings between their meta-data and other

ontologies – Could be done with Semantic Web standards…

– Provide unrestricted access– Examples: Distributed databases; Sources following semantic

web standards• …for uncooperative sources all this information

has to be gathered by the mediator – Examples: Most current integration scenarios on the web


[Figure courtesy Halevy et. Al.]

25

Collated Challenges• Source Selection

– Which sources are likely to be most relevant to answer the query?

• Source quality• Source overlap

• Schema Mapping– How to map schemas of the

sources to the mediator?• Record Linkage

– How to couple data items that refer to the same entity

• Information quality– Data with null values– Inconsistent data

• Query reformulation– To convert query from the

mediator schema to the source schema

– To bring out relevant data despite incompleteness and inconsistency


Services

Webpages

Structureddata


~25M

Rao’s Solution…




Idea: Consider “endorsement” (a la A/H or PageRank) But there are no explicit links….


Services

Webpages

Structureddata


~25M

Rao’s Solution…



Case Study: Query Rewriting in QPIAD

Rewritten queriesQ1’: Model=A4Q2’: Model=Z4Q3’: Model=Boxster

Given a query Q:(Body style=Convt) retrieve all relevant tuplesId Make Model Year Body

1 Audi A4 2001 Convt

2 BMW Z4 2002 Convt

3 Porsche

Boxster 2005 Convt

4 BMW Z4 2003 Null

5 Honda Civic 2004 Null

6 Toyota Camry 2002 Sedan

7 Audi A4 2006 Null

Id Make Model Year Body

1 Audi A4 2001 Convt

2 BMW Z4 2002 Convt

3 Porsche

Boxster 2005 Convt

Base Result Set

AFD: Model~> Body style

Id Make Model Year Body Confidence

4 BMW Z4 2003 Null 0.7

7 Audi A4 2006 Null 0.3

Ranked Relevant Uncertain Answers

Re-order queries based on Estimated Precision

We can select top K rewritten queries using F-measureF-Measure = (1+α)*P*R/(α*P+R)P – Estimated PrecisionR – Estimated Recall based on P and Estimated Selectivity


Services

Webpages

Structureddata


~25M

Rao’s Solution…



Imprecise Queries ?

Want a ‘sedan’ priced around $7000

A Feasible Query

Make =“Toyota”, Model=“Camry”,

Price ≤ $7000

What about the price of a Honda Accord?

Is there a Camry for $7100?

Solution: Support Imprecise Queries

………

1998$6500Camry

Toyota

2000$6700Camry

Toyota

2001$7000Camry

Toyota

1999$7000Camry

Toyota

Concept (Value) SimilarityConcept: Any distinct attribute value pair.

E.g. Make=Toyota• Visualized as a selection query binding a

single attribute• Represented as a supertuple

Concept Similarity: Estimated as the percentage of correlated values common to two given concepts

where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R

• Measured as the Jaccard Similarity among supertuples representing the concepts

5995:4, 6500:3, 4000:6Price

2000:6,1999:5 2001:2,……YearCamry: 3, Corolla: 4,….Model

ST(QMake=Toyota)

Supertuple for Concept Make=Toyota

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries

Using AFDs findrelaxation order

Derive Extended Setbyexecuting relaxed queries

Use Concept similarityto measure tuplesimilarities

Prune tuples belowthreshold

Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

)))(,2()),(,1(()2,1( AivaluesvCorrelatedAivaluesvCorrelatedyCommonalitvvSimilarity

J accardSim(A,B) =BABA

Concept (Value) Similarity Graph

Ford

Chevrolet

Toyota

Honda

DodgeNissan

BMW

0.25

0.16

0.110.15

0.12

0.22

Information Integration on the Web (SA-2) 47

Things beyond this were not covered in Spring 2010

Kambhampati & Knoblock

48Kambhampati & Knoblock

Review of Topic 3: Finding, Representing & Exploiting

StructureGetting Structure:

Allow structure specification languages XML? [More structured than text and less structured than databases]If structure is not explicitly specified (or is obfuscated), can we extract it? Wrapper generation/Information Extraction

Using Structure: For retrieval: Extend IR techniques to use the additional structure For query processing: (Joins/Aggregations etc) Extend database techniques to use the partial structure For reasoning with structured knowledge Semantic web ideas..

Structure in the context of multiple sources: How to align structure How to support integrated querying on pages/sources (after alignment)

information integration

Documents

web sa

web ma

deep web

invisible web

overall web

information integration

integration efforts

multiple websourceschallenges