querying for relations from the semi-structured web sunita sarawagi iit bombay sunita contributors...
Post on 26-Mar-2015
217 Views
Preview:
TRANSCRIPT
Querying for relations from the semi-structured Web
Sunita Sarawagi
IIT Bombay
http://www.cse.iitb.ac.in/~sunita
Contributors
Rahul Gupta Girija Limaye Prashant Borole
Rakesh Pimplikar Aditya Somani
Web Search
Mainstream web search User Keyword queries Search engine Ranked list of documents
15 glorious years of serving all of user’s search need into this least common denominator
Structured web search User Natural language queries ~/~ Structured queries Search engine Point answer, record sets
Many challenges in understanding both query and content
15 years of slow but steady progress
2
The Quest for Structure Vertical structured search engines
Structure Schema Domain-specific Shopping: Shopbot: (Etzoini + 1997)
Product name, manufacturer, price Publications: Citeseer (Lawrence, Giles,+ 1998)
Paper title, author name, email, conference, year Jobs: Flipdog Whizbang labs (Mitchell + 2000)
Company name, job title, location, requirement People: DBLife (Doan 07)
Name, affiliations, committees served, talks delivered.
Triggered much research on extraction and IR-style search of structured data (BANKS ‘02).
3
Horizontal Structured Search Domain-independent structure Small, generic set of structured primitives over
entities, types, relationships, and properties <Entity> IsA <Type>
Mysore is a city <Entity> Has <Property>
<City> Average rainful <Value> <Entity1> <related-to> <Entity2>
<Person> born-in <City> <Person> CEO-of <Company>
4
Types of Structured Search Web+People Structured databases ( Ontologies)
Created manually (Psyche), or semi-automatically (Yago) True Knowledge (2009), Wolfram Alpha (2009)
Web annotated with structured elements Queries: Keywords + structured annotations
Example: <Physicist> +cosmos Open-domain structure extraction and annotations of web
docs (2005—)
5
Users, Ontologies, and the Web
Users are from Venus• Bi-syllabic, impatient, believe in
mind -reading Ontologies are from Mars
• One structure to fit allG
• Web content creators are from some other galaxy
– Ontologies= – Let search engines bring the
users
What is missed in Ontologies The trivial, the transient, and the textual Procedural knowledge
• What do I do on an error? Huge body of invaluable text of various type
reviews, literature, commentaries, videos Context
By stripping knowledge to its skeletal form, context that is so valuable for search is lost.
As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable.
Structured annotations in HTML Is A annotations
KnowITAll (2004) Open-domain Relationships
Text runner (Banko 2007) Ontological annotations
SemTag and Seeker (2003) Wikipedia annotations (Wikify! 2007, CSAW 2009)
8
All view documents as a sequence of tokens
Challenging to ensure high accuracy
WWT: Table queries over the semi-structured web
9
Queries in WWT Query by content
Query by description
10
Alan Turing Turing Machine
E. F. Codd Relational Databases
Desh Late night
Bhairavi Morning
Patdeep Afternoon
Inventor Computer science concept Year
Indian states Airport City
11
Answer: Table with ranked rows
Person Concept/Invention
Alan Turing Turing Machine
Seymour Cray Supercomputer
E. F. Codd Relational Databases
Tim Berners-Lee WWW
Charles Babbage Babbage Engine
12
Verbose articles, notstructured tables
The only document with an unstructured listof some desired records
Desired records spread across many documents
Correct answer is notone click away.
Computer science concept inventor year
Keyword search to find structured records
13
The only list in one of the retrieved pages
14
Highly relevant Wikipedia table not retrieved in the top-k
Ideal answer should be integrated from these incomplete sources
15
Attempt 2: Include samples in query
Documents relevant only to the keywords
Ideal answer still spread across manydocuments
Known examples
alan turing machine codd relational database
16
WWT Architecture
Index Query Builder
Web
Extract record sources
Query Table
Content+context index
Offline
Store
Ke
ywo
rd Q
uer
y
So
urce
L1,…
,Lk
Type Inference
Resolver
Resolver builder
Typesystem Hierarchy
Extractor
Record labeler
CRF modelsConsolidator
Tables T1,…,Tk
Consolidated Table
StatisticsCell resolver Row resolver
Ranker
Row and cell scores
Final consolidated table
User
Annotate
Ontology
18
Offline: Annotating to an Ontology
Annotate table cells with entity nodes and table columns with type nodes
movies
Indian_films English_films
2008_films Terrorism_films
A_Wednesday
Black&White
Coffee_house (film)
Wednesday
All
People
Entertainers
Coffee_house (Loc)
Indian_films
2008_filmsIndian_directors
Coffee_house (film)
Black&White
A_Wednesday
Challenges Ambiguity of entity names
“Coffee house” both a movie name and a place name
Noisy mentions of entity names Black&White versus Black and White
Multiple labels Yago Ontology has average 2.2 types per entity
Missing type links in Ontology cannot use least common ancestor Missing link: Black&White to 2008_films Not a missing link: 1920 to Terrorism_films
Scale: Yago has 1.9 million entities, 200,000 types 19
A unified approachGraphical model to jointly label cells and
columns to maximize sum of scores on
ycj = Entity label of cell c of column j
yj = Type label of column j Score(ycj ): String similarity between c & ycj .
Score(yj ): String similarity between header in j & yj
Score( yj, ycj)
Subsumed entity: Inversely proportional to distance between them
Outside enity: Fraction of overlapping entities between yj and immediate parent of ycj
Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero.
movies
Indian_films
English_films
yj
Terrorism_films
Subsumed entity y1j
Outside entity y3j
Subsumed entity y2j
21
WWT Architecture
Index Query Builder
Web
Extract record sources
Query Table
Content+context index
Offline
Store
Ke
ywo
rd Q
uer
y
So
urce
L1,…
,Lk
Type Inference
Resolver
Resolver builder
Typesystem Hierarchy
Extractor
Record labeler
CRF modelsConsolidator
Tables T1,…,Tk
Consolidated Table
StatisticsCell resolver Row resolver
Ranker
Row and cell scores
Final consolidated table
User
Annotate
Ontology
23
Extraction: Content queriesExtracting queries columns from list records
New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy.
Cornell University Ithaca
State University of New York Stony Brook
New York University New York
Lists are often human generated.
Query: QQuery: Q
A source: LiA source: Li
24
Extraction
New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy.
Rule-based extractor insufficient. Statistical extractor needs training data.
Generating that is also not easy!
Extracted table columnsExtracted table columns
Cornell University Ithaca
State University of New York Stony Brook
New York University New York
Query: QQuery: Q
25
Extraction: Labeled data generation
A fast but naïve approach for generating labeled records
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
Query about colleges in NY
Fragment of a relevant list source
Lists are unlabeled. Labeled records needed to train a CRF
New York University New York
Monroe College Brighton
State University of New York Stony Brook
26
Extraction: Labeled data generation
New York University New York
Monroe College Brighton
State University of New York Stony Brook
A fast but naïve approach
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
In the list, look for matches of every query cell.
Another match for New York UniversityAnother match for New York
27
Extraction: Labeled data generation
New York University New York
Monroe College Brighton
State University of New York Stony Brook
A fast but naïve approach
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
In the list, look for matches of every query cell. Greedily map each query row to the best match in the list
1
2
28
Extraction: Labeled data generation
New York University New York
Monroe College Brighton
State University of New York Stony Brook
A fast but naïve approach
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
Hard matching criteria has significantly low recall Missed segments. Does not use natural clues like Univ = University
Greedy matching can be lead to really bad mappings
1
2
Unmapped (hurts recall)
Wrongly MappedAssumed as ‘Other’
29
Generating labeled data: Soft approach New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
New York University New York
Monroe College Brighton
State University of New York Stony Brook
30
Match score for each query and source row Score of best segmentation of source row
to query columns Score of a segment s of column c:
Probability Cell c of query row same as segment s
Computed by the Resolver module based on the type of the column
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
0.9
0.3
1.8
Generating labeled data: Soft approach
New York University New York
Monroe College Brighton
State University of New York Stony Brook
31
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
2.0
0.7
0.3
Generating labeled data: Soft approach
Match score for each query and source row Score of best segmentation of source row into query columns Score of a segment s of column c:
Probability Cell c of query row same as segment s Computed by the Resolver module based on the type of the
column
New York University New York
Monroe College Brighton
State University of New York Stony Brook
32
New York University New York
Monroe College Brighton
State University of New York Stony Brook
New York Univ. in NYC
Columbia University in NYC
Monroe Community College in Brighton
State University of New York in Stony Brook, New York.
Compute the maximum weight matching Better than greedily choosing the best match for each row
Soft string-matching increases the labeled candidates significantly
Vastly improves recall, leads to better extraction models.
0.9
0.3
1.8
1.8
0.70.3
2
Greedy matching in red
Generating labeled data: Soft approach
33
Extractor
Use CRF on the generated labeled data Feature Set
Delimiters, HTML tokens in a window around labeled segments.
Alignment features Collective training of multiple sources
34
Experiments
Aim: Reconstruct Wikipedia tables from only a few sample rows.
Sample queries TV Series: Character name, Actor name, Season Oil spills: Tanker, Region, Time Golden Globe Awards: Actor, Movie, Year Dadasaheb Phalke Awards: Person, Year Parrots: common name, scientific name, family
35
Experiments: Dataset
Corpus: 16M lists from 500M pages from a web crawl. 45% of lists retrieved by index probe are irrelevant.
Query workload 65 queries. Ground truth hand-labeled by 10 users
over 1300 lists. 27% queries not answerable with one list (difficult). True consolidated table = 75% of Wikipedia table,
25% new rows not present in Wikipedia.
36
Extraction performance
Benefits of soft training data generation, alignment features, staged-extraction on F1 score.
More than 80% F1 accuracy with just three query records
Queries in WWT Query by content
Query by description
37
Alan Turing Turing Machine
E. F. Codd Relational Databases
Desh Late night
Bhairavi Morning
Patdeep Afternoon
Inventor Computer science concept Year
Indian states Airport City
Extraction: Description queries
Lithium 3
Sodium 11
Beryllium 4
Non-informative headers No headers
Context to get at relevant tables Ontological annotations
Context is union of Text around tables Headers Ontology labels when
present
39
Chemical_elements
Metals Non_Metals
Alkali Gas
Aluminium
Lithium Hydrogen
All
People
Non alkali
Lithium 3
Sodium 11
Beryllium 4
Non-gas
CarbonSodium
Alkali
Chemical element
Joint labeling of table columns Given
Candidate tables: T1 ,T2,..Tn
Query column q1, q2,.. qm
Task: label columns of Ti with {q1, q2,…, qm, } to maximize sum of these scores Score (T , j , qk) = Ontology type match + Header string
match with qk
Score (T , * , qk) = Match of description of T with qk
Score (T , j, T’ , j’, qk) = Content overlap of column j of table T with column j’ of table T’ when both label qk
Inference algorithm in a graphical model solve via Belief Propagation. 40
41
WWT Architecture
Index Query Builder
Web
Extract record sources
Query Table
Content+context index
Offline
Store
Ke
ywo
rd Q
uer
y
So
urce
L1,…
,Lk
Type Inference
Resolver
Resolver builder
Typesystem Hierarchy
Extractor
Record labeler
CRF modelsConsolidator
Tables T1,…,Tk
Consolidated Table
StatisticsCell resolver Row resolver
Ranker
Row and cell scores
Final consolidated table
User
Annotate
Ontology
42
Step 3: Consolidation
Cornell University Ithaca
State University of New York
Stony Brook
New York University New York City
Binghamton University Binghamton
Merging the extracted tables into one
SUNY Stony Brook
New York University (NYU)
New York
RPI Troy
Columbia University New York
Syracuse University Syracuse
+
Cornell University Ithaca
State University of New York OR SUNY
Stony Brook
New York University OR New York University (NYU)
New York City OR New York
Binghamton University Binghamton
RPI Troy
Columbia University New York
Syracuse University Syracuse
=Merging duplicates
Consolidation
Challenge: resolving when two rows are the same in the face of Extraction errors Missing columns Open-domain No training.
Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters
45
Resolver
P(RowMatch|rows q,r)
P(1st cell match|q1,r1) P(ith cell match|qi,ri) P(nth cell match|qn,rn)
Bayesian Network
Cell-level probabilities Parameters automatically set using list statistics Derived from user-supplied type-specific similarity functions
47
Ranking• Factors for ranking– Relevance: membership in overlapping sources– Support from multiple sources
– Completeness: importance of columns present Penalize records with only common ‘spam’ columns like City and
State Correctness: extraction confidence
School Location State Merged Row Confidence Support
- - NY 0.99 9
- NYC New York 0.95 7
New York Univ. OR New York University
New York City OR New York
New York 0.85 4
University of Rochester OR Univ. of Rochester,
Rochester New York 0.50 2
University of Buffalo Buffalo New York 0.70 2
Cornell University Ithaca New York 0.76 1
Relevance ranking on set membership Weighted sum approach
Score of a set t: s(t) = fraction of query rows in t
Relevance of consolidated row r: r t s(t)
Graph walk based approach Random walk from rows to table
nodes starting from query rows along with random restarts to query rows
48
Tables
Consolidated rows
Query rows
49
Ranking Criteria• Score(Row r):
× Graph-relevance of r.× Importance of columns C present in r (high if C functionally
determines the other)× Sum of cell extraction confidence: noisy-OR of cell extraction
confidence from individual CRFs
School Location State Merged Row Confidence Support
New York Univ. OR New York University (0.90)
New York City OR New York (0.95)
New York (0.98) 0.85 4
University of Buffalo (0.88) Buffalo (0.99) New York (0.99) 0.70 2
Cornell University (0.92) Ithaca (0.95) New York (0.99) 0.76 1
University of Rochester OR Univ. of Rochester, (0.80)
Rochester (0.95) New York (0.99) 0.50 2
- - NY (0.99) 0.99 9
- NYC (0.98) New York (0.98) 0.95 7
50
Overall performance
Justify sophisticated consolidation and resolution. So compare with: Processing only the magically known single best list
=> no consolidation/resolution required. Simple consolidation. No merging of approximate duplicates.
WWT has > 55% recall, beats others. Gain bigger for difficult queries.
All Queries Difficult Queries
51
Running time
< 30 seconds with 3 query records.
Can be improved by processing sources in parallel. Variance high because time depends on number of columns,
record length etc.
52
Related Work Google-Squared
Developed independently. Launched in May 2009 User provides keyword query, e.g. “list of Italian
joints in Manhattan”. Schema inferred. Technical details not public.
Prior methods for extraction and resolution. Assume labeled data/pre-trained parameters We generate labeled data, and automatically train
resolver parameters from the list source.
53
Summary Structured web search & the role of non-text,
partially structured web sources WWT system
Domain-independent Online: structure interpretation at query time Relies heavily on unsupervised statistical learning
Graphical model for table annotation Soft-approach for generating labeled data Collective column labeling for descriptive queries Bayesian network for resolution and consolidation Page rank + confidence from a probabilistic extractor for
ranking
What next? Designing plans for non-trivial ways of combining of
sources Better ranking and user-interaction models. Expanding query set
Aggregate queries: tables are rich in quantities Point queries: attribute value and relationship queries
Interplay between semi-structured web & Ontologies Augmenting one with the other.
Quantify information in structured sources vis-à-vis text sources on typical query workloads
54
top related