the web’s many models

Post on 26-Jan-2016

31 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Web’s Many Models. ?. Michael J. Cafarella University of Michigan AKBC May 19, 2010. Web Information Extraction. Much recent research in information extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) - PowerPoint PPT Presentation

TRANSCRIPT

The Web’s Many Models

Michael J. Cafarella University of Michigan

AKBCMay 19, 2010

?

2

Web Information Extraction Much recent research in information

extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data)

Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query

processing But where is it?

3

Web Information Extraction Omnivore

“Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA.

Suggested remedies for data ingestion, user interaction

This talk says why ideas in that paper might already be out of date, gives alternative ideas

If there are mistakes here, then you have a chance to save me years of work!

4

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

5

Parallel Extraction Previous hypothesis

Many data models for interesting data, e.g., relational tables and E/R graphs, etc.

Should build large integration infrastructure to consume many extraction streams

6

Database Construction (1)

Start with a single large Web crawl

7

Database Construction (2)

Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-

dependent schema

8

Database Construction (3)

For each extractor output, unfold into common entity-relation model

9

Database Construction (4)

Unify results

10

Database Construction (5)

Emit final database

11

Potential Problems Pressing problems:

Recall Simple intra-source reconciliation Time

Tables, entities probably OK for now Many data sources (DBPedia, Facebook,

IMDB) already match one of these two pretty well

One possible different direction: the Data-Centric Web Addresses recall only

12

The Data-Centric Web

13

The Data-Centric Web

14

The Data-Centric Web

15

The Data-Centric Web

16

The Data-Centric Web

17

The Data-Centric Web

18

The Data-Centric Web

19

The Data-Centric Web

20

The Data-Centric Web

21

The Data-Centric Web

22

The Data-Centric Web

23

The Data-Centric Web

24

Data-Centric Lists Lists of Data-Centric Entities give

hints: About what the target entity contains

That all members of set are DCEs, or not

That members of set belong to a class or type (e.g., program committee members)

25

Build the Data-Centric Web1. Download the Web2. Train classifiers to detect DCEs, DCLs3. Filter out all pages that fail both tests4. Use lists to fix up incorrect Data-Centric

Entity classifications5. Run attr/val extractors on DCEs

Yields E/R dataset, for insertion into DBPedia, YAGO, etc

In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.

26

Research Question 1 How many useful entities…

Lack a page in the Data-Centric Web? (That means no homepage, no Amazon

page, no public Facebook page, etc.) AND are otherwise well-described

enough online that IE can recover an entity-centric view?

Put differently: Does every entity worth extracting

already have a homepage on the Web?

27

Research Question 2 Does a single real-world entity

have more than one “authoritative” URL? Note that Wikipedia provides pretty

minimal assistance in choosing the right entity, but does a good job

28

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

29

Model Generation for Output Previous hypothesis

Many different user applications built against single back-end database

Difficult task is translating from back-end data model to the application’s data model

30

Query Processing (1)

Query arrives at system

31

Query Processing (2)

Entity-relation database processor yields entity results

32

Query Processing (3)

Query Renderer chooses appropriate output schema

33

Query Processing (4)

User corrections are logged and fed into later iterations of db construction

34

Potential Problems Many plausible front-end applications,

none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an

end-user application Need to explore possible applications

rather than build multi-app infrastructure

One possible different direction: data integration as user primitive

35

Data Integration as UI Can we combine tables to create

new data sources? Many existing “mashup” tools,

which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in

advance Transient integrations Dirty data

36

Interaction Challenge Try to create a database of all“VLDB program committee members”

37

Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but

high/low quality (like search) Also, prosaic traditional operators

Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova,

Halevy]

Octopus

38

Walkthrough - Operator #1 SEARCH(“VLDB program committee members”)

serge abiteboul inria

anastassia ail… carnegie…

gustavo alonso etz zurich

… …

serge abiteboul inria

michael adiba …grenoble

antonio albano …pisa

… …

39

Walkthrough - Operator #2 Recover relevant data

serge abiteboul inria

michael adiba …grenoble

antonio albano …pisa

… …

serge abiteboul inria

anastassia ail… carnegie…

gustavo alonso etz zurich

… …

CONTEXT()

CONTEXT()

40

Walkthrough - Operator #2 Recover relevant data

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

… … …

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

CONTEXT()

CONTEXT()

41

Walkthrough - Union Combine datasets

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

… … …

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

Union()

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

42

Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic

EXTEND( “publications”, col=0)

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

serge abiteboul inria 1996 “Large Scale P2P Dist…”

michael adiba …grenoble 1996 “Exploiting bitemporal…”

antonio albano …pisa 1996 “Another Example of a…”

serge abiteboul inria 2005 “Large Scale P2P Dist…”

anastassia ail… carnegie… 2005 “Efficient Use of the…”

gustavo alonso etz zurich 2005 “A Dynamic and Flexible…”

… … …

• User has integrated data sources with little effort• No wrappers; data was never intended for reuse

“publications”

43

CONTEXT Algorithms Input: table and source page Output: data values to add to table

SignificantTerms sorts terms in source page by “importance” (tf-idf)

44

Related View Partners Looks for different “views” of same

data

45

CONTEXT Experiments

46

Data Integration as UI Compelling for db researchers, but

will large numbers of people use it?

47

Conclusion Automatic Web KBs rapidly

progressing Recall still not good enough for many

tasks, but progress is rapid Not clear what those tasks should be, and

progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper

Omnivore’s approach not wrong, but did not directly address these problems

top related