structured querying of web text: a technical challenge michael j. cafarella, christopher re, dan...

11
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington Asilomar, CA January 9, 2007

Upload: alban-edwards

Post on 02-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

Structured Querying of Web Text:

A Technical Challenge

Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko

University of Washington

Asilomar, CAJanuary 9, 2007

Page 2: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

2

“Show me some people, what they invented, and the years they died”

q(?a, ?b, ?c):- invented(?a, ?b), died-in(?a, <year> ?c)

Structured Queries,Unstructured Data

a b c prob

Kepler log books 1630 .7902

Heisenberg matrix mechanics 1976 .7897

Galileo telescope 1642 .7395

Newton calculus 1727 .7366

Page 3: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

3

ExDB

Web

…no one could

surprising. In

1877, Edisoninvented thephonograph.

Although he…

…didnt surprising.

In1877, Edisoninvented thephonograph.

Although he…

…was surprising.

In1877,

Edisoninvented

thephonograp

h.Although

he…

Obj1 Pred Obj2 prob

Edison invented

light bulb

0.97

Morgan born-in 1837 0.85

Type Instance prob

scientist Einstein 0.99

city Seattle 0.92

Pred1 Pred2 prob

invented did-invent 0.85

invented created 0.72

Facts

Types

Synonyms

RDBMS

Querymiddlewa

re

invented(Edison ?e, ?i)

1. Run extractors 2. Populate data model 3. Queries

Page 4: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

4

ExDB

Web

…no one could

surprising. In

1877, Edisoninvented thephonograph.

Although he…

…didnt surprising.

In1877, Edisoninvented thephonograph.

Although he…

…was surprising.

In1877,

Edisoninvented

thephonograp

h.Although

he…

Obj1 Pred Obj2 prob

Edison invented

light bulb

0.97

Morgan born-in 1837 0.85

Type Instance prob

scientist Einstein 0.99

city Seattle 0.92

Pred1 Pred2 prob

invented did-invent 0.85

invented created 0.72

Facts

Types

Synonyms

RDBMS

Querymiddlewa

re

invented(Edison ?e, ?i)

1. Run extractors 2. Populate data model 3. Queries

Page 5: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

5

Information Extraction Each concept has an IE mechanism

Example Description IE technique

invented(Edison, phonograph)

Arity-2 fact TextRunner

<scientist> Einstein Type (hypernymy)

KnowItAll

has-invented = invented

Synonymy DIRT

invented discovered

ID (troponymy) ?

FD: has-capital(x, y) has-capital(y)

FD (rule) ?

Page 6: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

6

ExDB

Web

…no one could

surprising. In

1877, Edisoninvented thephonograph.

Although he…

…didnt surprising.

In1877, Edisoninvented thephonograph.

Although he…

…was surprising.

In1877,

Edisoninvented

thephonograp

h.Although

he…

Obj1 Pred Obj2 prob

Edison invented

light bulb

0.97

Morgan born-in 1837 0.85

Type Instance prob

scientist Einstein 0.99

city Seattle 0.92

Pred1 Pred2 prob

invented did-invent 0.85

invented created 0.72

Facts

Types

Synonyms

RDBMS

Querymiddlewa

re

invented(Edison ?e, ?i)

1. Run extractors 2. Populate data model 3. Queries

Page 7: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

7

Populate Data Model Use extractions to fill tables

Obj1 Pred Obj2 prob

Edison invented

light bulb

0.97

Morgan born-in 1837 0.85Type Instance prob

scientist Einstein 0.99

city Boston 0.92

Pred1 Pred2 prob

invented did-invent 0.85

invented created 0.72

Inclusion Includer prob

invented discovered 0.81

Seattle Washington 0.65

LHS RHS prob

capital(x, y) capital(y) 0.77

Facts

Types

Synonyms

IDs

FDs

It was big news when Edison invented the light bulb.

He visited cities such as Boston and New York.

We all know that Edison invented the light bulb.…In 1877 Edison created the light bulb.

Page 8: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

8

ExDB

Web

…no one could

surprising. In

1877, Edisoninvented thephonograph.

Although he…

…didnt surprising.

In1877, Edisoninvented thephonograph.

Although he…

…was surprising.

In1877,

Edisoninvented

thephonograp

h.Although

he…

Obj1 Pred Obj2 prob

Edison invented

light bulb

0.97

Morgan born-in 1837 0.85

Type Instance prob

scientist Einstein 0.99

city Seattle 0.92

Pred1 Pred2 prob

invented did-invent 0.85

invented created 0.72

Facts

Types

Synonyms

RDBMS

Querymiddlewa

re

invented(Edison ?e, ?i)

1. Run extractors 2. Populate data model 3. Queries

Page 9: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

9

For non-projecting queries, we can compute top-k queries Comb. fn is product of probabilities

For projecting queries, we compute the disjunction of m probabilistic events In general NP-hard, so we

approximate using the panel of experts

Query Processing

Page 10: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

10

Related Work Query Systems:

CIMple (CIDR07), AVATAR (DEBul06) Liu, Dong, Halevy (WebDB06) Gubanov and Bernstein (WebDB06)

Extraction: Sarawagi (VLDB06 and others), Etzioni (WWW04), …

Probabilistic DBs: MYSTIQ, Trio, … Deep web, reference

reconciliation, …

Page 11: Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington

11

Web crawl: 90M pages Facts: 338M tuples, 102M objects Types: 6.6M instances Synonyms: 17k pairs No IDs or FDs yet Most queries in ~30 seconds Built on DB2 with custom

middleware; we want to try a compressed C-store

Our prototype