structured querying of web text: a technical challenge michael j. cafarella, christopher re, dan...
TRANSCRIPT
![Page 1: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/1.jpg)
Structured Querying of Web Text:
A Technical Challenge
Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko
University of Washington
Asilomar, CAJanuary 9, 2007
![Page 2: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/2.jpg)
2
“Show me some people, what they invented, and the years they died”
q(?a, ?b, ?c):- invented(?a, ?b), died-in(?a, <year> ?c)
Structured Queries,Unstructured Data
a b c prob
Kepler log books 1630 .7902
Heisenberg matrix mechanics 1976 .7897
Galileo telescope 1642 .7395
Newton calculus 1727 .7366
![Page 3: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/3.jpg)
3
ExDB
Web
…no one could
surprising. In
1877, Edisoninvented thephonograph.
Although he…
…didnt surprising.
In1877, Edisoninvented thephonograph.
Although he…
…was surprising.
In1877,
Edisoninvented
thephonograp
h.Although
he…
Obj1 Pred Obj2 prob
Edison invented
light bulb
0.97
Morgan born-in 1837 0.85
Type Instance prob
scientist Einstein 0.99
city Seattle 0.92
Pred1 Pred2 prob
invented did-invent 0.85
invented created 0.72
Facts
Types
Synonyms
RDBMS
Querymiddlewa
re
invented(Edison ?e, ?i)
1. Run extractors 2. Populate data model 3. Queries
![Page 4: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/4.jpg)
4
ExDB
Web
…no one could
surprising. In
1877, Edisoninvented thephonograph.
Although he…
…didnt surprising.
In1877, Edisoninvented thephonograph.
Although he…
…was surprising.
In1877,
Edisoninvented
thephonograp
h.Although
he…
Obj1 Pred Obj2 prob
Edison invented
light bulb
0.97
Morgan born-in 1837 0.85
Type Instance prob
scientist Einstein 0.99
city Seattle 0.92
Pred1 Pred2 prob
invented did-invent 0.85
invented created 0.72
Facts
Types
Synonyms
RDBMS
Querymiddlewa
re
invented(Edison ?e, ?i)
1. Run extractors 2. Populate data model 3. Queries
![Page 5: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/5.jpg)
5
Information Extraction Each concept has an IE mechanism
Example Description IE technique
invented(Edison, phonograph)
Arity-2 fact TextRunner
<scientist> Einstein Type (hypernymy)
KnowItAll
has-invented = invented
Synonymy DIRT
invented discovered
ID (troponymy) ?
FD: has-capital(x, y) has-capital(y)
FD (rule) ?
![Page 6: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/6.jpg)
6
ExDB
Web
…no one could
surprising. In
1877, Edisoninvented thephonograph.
Although he…
…didnt surprising.
In1877, Edisoninvented thephonograph.
Although he…
…was surprising.
In1877,
Edisoninvented
thephonograp
h.Although
he…
Obj1 Pred Obj2 prob
Edison invented
light bulb
0.97
Morgan born-in 1837 0.85
Type Instance prob
scientist Einstein 0.99
city Seattle 0.92
Pred1 Pred2 prob
invented did-invent 0.85
invented created 0.72
Facts
Types
Synonyms
RDBMS
Querymiddlewa
re
invented(Edison ?e, ?i)
1. Run extractors 2. Populate data model 3. Queries
![Page 7: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/7.jpg)
7
Populate Data Model Use extractions to fill tables
Obj1 Pred Obj2 prob
Edison invented
light bulb
0.97
Morgan born-in 1837 0.85Type Instance prob
scientist Einstein 0.99
city Boston 0.92
Pred1 Pred2 prob
invented did-invent 0.85
invented created 0.72
Inclusion Includer prob
invented discovered 0.81
Seattle Washington 0.65
LHS RHS prob
capital(x, y) capital(y) 0.77
Facts
Types
Synonyms
IDs
FDs
It was big news when Edison invented the light bulb.
He visited cities such as Boston and New York.
We all know that Edison invented the light bulb.…In 1877 Edison created the light bulb.
![Page 8: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/8.jpg)
8
ExDB
Web
…no one could
surprising. In
1877, Edisoninvented thephonograph.
Although he…
…didnt surprising.
In1877, Edisoninvented thephonograph.
Although he…
…was surprising.
In1877,
Edisoninvented
thephonograp
h.Although
he…
Obj1 Pred Obj2 prob
Edison invented
light bulb
0.97
Morgan born-in 1837 0.85
Type Instance prob
scientist Einstein 0.99
city Seattle 0.92
Pred1 Pred2 prob
invented did-invent 0.85
invented created 0.72
Facts
Types
Synonyms
RDBMS
Querymiddlewa
re
invented(Edison ?e, ?i)
1. Run extractors 2. Populate data model 3. Queries
![Page 9: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/9.jpg)
9
For non-projecting queries, we can compute top-k queries Comb. fn is product of probabilities
For projecting queries, we compute the disjunction of m probabilistic events In general NP-hard, so we
approximate using the panel of experts
Query Processing
![Page 10: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/10.jpg)
10
Related Work Query Systems:
CIMple (CIDR07), AVATAR (DEBul06) Liu, Dong, Halevy (WebDB06) Gubanov and Bernstein (WebDB06)
Extraction: Sarawagi (VLDB06 and others), Etzioni (WWW04), …
Probabilistic DBs: MYSTIQ, Trio, … Deep web, reference
reconciliation, …
![Page 11: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington](https://reader035.vdocuments.mx/reader035/viewer/2022072015/56649ed15503460f94be026d/html5/thumbnails/11.jpg)
11
Web crawl: 90M pages Facts: 338M tuples, 102M objects Types: 6.6M instances Synonyms: 17k pairs No IDs or FDs yet Most queries in ~30 seconds Built on DB2 with custom
middleware; we want to try a compressed C-store
Our prototype