information integration
DESCRIPTION
Information Integration. 4/5. Information Integration on the Web AAAI Tutorial (SA2) Rao Kambhampati & Craig Knoblock. Slides for Parts 1 and 5 are Available in hardcopy at the Front of the room. Monday July 22 nd 2007. 9am-1pm. Information Integration. - PowerPoint PPT PresentationTRANSCRIPT
Information Integration on the Web (SA-2) 1Kambhampati & Knoblock
Information Integration
4/5
Information Integration on the Web (SA-2) 2Kambhampati & Knoblock
Information Integration on the Web
AAAI Tutorial (SA2)
Rao Kambhampati & Craig Knoblock
Monday July 22nd 2007. 9am-1pmSlides f
or Parts 1 an
d 5 are
Available in hard
copy at the
Front of the room
Information Integration on the Web (SA-2) 4Kambhampati & Knoblock
Information Integration• Combining information from multiple
autonomous information sources– And answering queries using the combined information
• Many Applications– WWW:
• Comparison shopping• Portals integrating data from multiple sources• B2B, electronic marketplaces• Mashups, service composion
– Science informatics• Integrating genomic data, geographic data, archaeological
data, astro-physical data etc. – Enterprise data integration
• An average company has 49 different databases and spends 35% of its IT dollars on integration efforts
Information Integration on the Web (SA-2) 5
Deployed information integration systems…
• Travel sites: Kayak, Expedia etc..• Google Base, DBPedia• Map Mashups• Libra; Citeseer; Google-squared etc
Kambhampati & Knoblock
Information Integration on the Web (MA-1) 6Kambhampati & Knoblock
Deep Web as a Motivation for II
• The surface web of crawlable pages is only a part of the overall web.
• There are many databases on the web– How many?– 25 million (or more)– Containing more than 80% of the accessible
content• Mediator-based information integration is
needed.
Information Integration on the Web (SA-2) 8Kambhampati & Knoblock
Services
Webpages
Structureddata
Sensors(streamingData)
~25M
Information Integration on the Web (SA-2) 9Kambhampati & Knoblock
Blind Men & the Elephant: Differing views on Information Integration
Database View• Integration of
autonomous structured data sources
• Challenges: Schema mapping, query reformulation, query processing
Web service view• Combining/
composing information provided by multiple web-sources
• Challenges: learning source descriptions; source mapping, record linkage etc.
IR/NLP view• Computing
textual entailment from the information in disparate web/text sources
• Challenges: Information Extraction
Information Integration on the Web (SA-2) 10Kambhampati & Knoblock
Services
Webpages
Structureddata
Sensors(streamingData)
--Search Model----Materialize the pages--crawl &index them--include them in search results
--Mediator Model----Design a mediator--Reformulate queries--Access sources--Collate results
--Warehouse Model----Get all the data and put into a local DB--Support structured queries on the warehouse
~25M
Datasource
Datasource
Datasource
Relational database (warehouse)
User queries
Data extractionprograms
Data cleaning/scrubbing
OLAP / Decision support/Data cubes/ data mining
Datasource
wrapper
Datasource
wrapper
Datasource
wrapper
Mediator:User queries
Mediated schema
Data sourcecatalog
Reformulation engine
optimizer
Execution engineWhich data
model?
Information Integration on the Web (SA-2) 15Kambhampati & Knoblock
Information Integration on the Web (SA-2) 17Kambhampati & Knoblock
Dimensions of Variation• Conceptualization of (and approaches to)
information integration vary widely based on – Type of data sources: being integrated (text; structured;
images etc.)– Type of integration: vertical vs. horizontal vs. both– Level of up-front work: Ad hoc vs. pre-orchestrated – Control over sources: Cooperative sources vs.
Autonomous sources– Type of output: Giving answers vs. Giving pointers– Generality of Solution: Task-specific (Mashups) vs.
Task-independent (Mediator architectures)
Information Integration on the Web (SA-2) 18Kambhampati & Knoblock
Dimensions: Type of Data Sources
• Data sources can be– Structured (e.g. relational data)
• Schema mapping (GAV/LAV)• Precise integration (Could I have done better by
going to the sources myself?)– Text oriented
• Information Extraction
– Mixed
Information Integration on the Web (SA-2) 19Kambhampati & Knoblock
Dimensions: Type of Output(Pointers vs. Answers)
• The cost-effective approach may depend on the quality guarantees we would want to give.
• At one extreme, it is possible to take a “web search” perspective—provide potential answer pointers to keyword queries– Materialize the data records in the sources as HTML pages and
add them to the index• Give it a sexy name: Surfacing the deep web
• At the other, it is possible to take a “database/knowledge base” perspective– View the individual records in the data sources as assertions in a
knowledge base and support inference over the entire knowledge. • Extraction, Alignment etc. needed
Information Integration on the Web (SA-2) 20Kambhampati & Knoblock
Dimensions: Vertical vs. Horizontal
• Vertical: Sources being integrated are all exporting same type of information. The objective is to collate their results– Eg. Meta-search engines, comparison shopping, bibliographic
search etc. – Challenges: Handling overlap, duplicate detection, source
selection• Horizontal: Sources being integrated are exporting
different types of information– E.g. Composed services, Mashups, – Challenges: Handling “joins”
• Both..
Information Integration on the Web (SA-2) 21Kambhampati & Knoblock
Dimensions: Level of Up-front workAd hoc vs. Pre-orchestrated
• Fully Query-time II (blue sky for now)– Get a query from the user
on the mediator schema– Go “discover” relevant data
sources– Figure out their “schemas”– Map the schemas on to the
mediator schema– Reformulate the user query
into data source queries– Optimize and execute the
queries– Return the answers
• Fully pre-fixed II– Decide on the only query
you want to support– Write a (java)script that
supports the query by accessing specific (pre-determined) sources, piping results (through known APIs) to specific other sources
• Examples include Google Map Mashups
E.g. We may start with known sources and their known schemas, do hand-mapping and support automated reformulation and optimization
(most interesting action is
“in between”)
Information Integration on the Web (SA-2) 23Kambhampati & Knoblock
Dimensions: Control over Sources(Cooperative vs. Autonomous)
• Cooperative sources can (depending on their level of kindness)– Export meta-data (e.g. schema) information– Provide mappings between their meta-data and other
ontologies – Could be done with Semantic Web standards…
– Provide unrestricted access– Examples: Distributed databases; Sources following semantic
web standards• …for uncooperative sources all this information
has to be gathered by the mediator – Examples: Most current integration scenarios on the web
Information Integration on the Web (SA-2) 24Kambhampati & Knoblock
[Figure courtesy Halevy et. Al.]
25
Collated Challenges• Source Selection
– Which sources are likely to be most relevant to answer the query?
• Source quality• Source overlap
• Schema Mapping– How to map schemas of the
sources to the mediator?• Record Linkage
– How to couple data items that refer to the same entity
• Information quality– Data with null values– Inconsistent data
• Query reformulation– To convert query from the
mediator schema to the source schema
– To bring out relevant data despite incompleteness and inconsistency
Information Integration on the Web (SA-2) 26Kambhampati & Knoblock
Services
Webpages
Structureddata
Sensors(streamingData)
~25M
Rao’s Solution…
Information Integration on the Web (SA-2) 28Kambhampati & Knoblock
Idea: Consider “endorsement” (a la A/H or PageRank) But there are no explicit links….
Information Integration on the Web (SA-2) 29Kambhampati & Knoblock
Information Integration on the Web (SA-2) 30Kambhampati & Knoblock
Information Integration on the Web (SA-2) 31Kambhampati & Knoblock
Services
Webpages
Structureddata
Sensors(streamingData)
~25M
Rao’s Solution…
Case Study: Query Rewriting in QPIAD
Rewritten queriesQ1’: Model=A4Q2’: Model=Z4Q3’: Model=Boxster
Given a query Q:(Body style=Convt) retrieve all relevant tuplesId Make Model Year Body
1 Audi A4 2001 Convt
2 BMW Z4 2002 Convt
3 Porsche
Boxster 2005 Convt
4 BMW Z4 2003 Null
5 Honda Civic 2004 Null
6 Toyota Camry 2002 Sedan
7 Audi A4 2006 Null
Id Make Model Year Body
1 Audi A4 2001 Convt
2 BMW Z4 2002 Convt
3 Porsche
Boxster 2005 Convt
Base Result Set
AFD: Model~> Body style
Id Make Model Year Body Confidence
4 BMW Z4 2003 Null 0.7
7 Audi A4 2006 Null 0.3
Ranked Relevant Uncertain Answers
Re-order queries based on Estimated Precision
We can select top K rewritten queries using F-measureF-Measure = (1+α)*P*R/(α*P+R)P – Estimated PrecisionR – Estimated Recall based on P and Estimated Selectivity
Information Integration on the Web (SA-2) 33Kambhampati & Knoblock
Services
Webpages
Structureddata
Sensors(streamingData)
~25M
Rao’s Solution…
Imprecise Queries ?
Want a ‘sedan’ priced around $7000
A Feasible Query
Make =“Toyota”, Model=“Camry”,
Price ≤ $7000
What about the price of a Honda Accord?
Is there a Camry for $7100?
Solution: Support Imprecise Queries
………
1998$6500Camry
Toyota
2000$6700Camry
Toyota
2001$7000Camry
Toyota
1999$7000Camry
Toyota
Concept (Value) SimilarityConcept: Any distinct attribute value pair.
E.g. Make=Toyota• Visualized as a selection query binding a
single attribute• Represented as a supertuple
Concept Similarity: Estimated as the percentage of correlated values common to two given concepts
where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R
• Measured as the Jaccard Similarity among supertuples representing the concepts
5995:4, 6500:3, 4000:6Price
2000:6,1999:5 2001:2,……YearCamry: 3, Corolla: 4,….Model
ST(QMake=Toyota)
Supertuple for Concept Make=Toyota
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Setbyexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
)))(,2()),(,1(()2,1( AivaluesvCorrelatedAivaluesvCorrelatedyCommonalitvvSimilarity
J accardSim(A,B) =BABA
Concept (Value) Similarity Graph
Ford
Chevrolet
Toyota
Honda
DodgeNissan
BMW
0.25
0.16
0.110.15
0.12
0.22
Information Integration on the Web (SA-2) 47
Things beyond this were not covered in Spring 2010
Kambhampati & Knoblock
48Kambhampati & Knoblock
Review of Topic 3: Finding, Representing & Exploiting
StructureGetting Structure:
Allow structure specification languages XML? [More structured than text and less structured than databases]If structure is not explicitly specified (or is obfuscated), can we extract it? Wrapper generation/Information Extraction
Using Structure: For retrieval: Extend IR techniques to use the additional structure For query processing: (Joins/Aggregations etc) Extend database techniques to use the partial structure For reasoning with structured knowledge Semantic web ideas..
Structure in the context of multiple sources: How to align structure How to support integrated querying on pages/sources (after alignment)