the web’s many models

The Web’s Many Models

Michael J. Cafarella University of Michigan

AKBCMay 19, 2010

Web Information Extraction Much recent research in information

extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data)

Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query

processing But where is it?

Web Information Extraction Omnivore

“Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA.

Suggested remedies for data ingestion, user interaction

This talk says why ideas in that paper might already be out of date, gives alternative ideas

If there are mistakes here, then you have a chance to save me years of work!

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

Parallel Extraction Previous hypothesis

Many data models for interesting data, e.g., relational tables and E/R graphs, etc.

Should build large integration infrastructure to consume many extraction streams

Database Construction (1)

Start with a single large Web crawl

Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-

dependent schema

For each extractor output, unfold into common entity-relation model

Unify results

Emit final database

Potential Problems Pressing problems:

Recall Simple intra-source reconciliation Time

Tables, entities probably OK for now Many data sources (DBPedia, Facebook,

IMDB) already match one of these two pretty well

One possible different direction: the Data-Centric Web Addresses recall only

The Data-Centric Web

Data-Centric Lists Lists of Data-Centric Entities give

hints: About what the target entity contains

That all members of set are DCEs, or not

That members of set belong to a class or type (e.g., program committee members)

Build the Data-Centric Web1. Download the Web2. Train classifiers to detect DCEs, DCLs3. Filter out all pages that fail both tests4. Use lists to fix up incorrect Data-Centric

Entity classifications5. Run attr/val extractors on DCEs

Yields E/R dataset, for insertion into DBPedia, YAGO, etc

In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.

Research Question 1 How many useful entities…

Lack a page in the Data-Centric Web? (That means no homepage, no Amazon

page, no public Facebook page, etc.) AND are otherwise well-described

enough online that IE can recover an entity-centric view?

Put differently: Does every entity worth extracting

already have a homepage on the Web?

Research Question 2 Does a single real-world entity

have more than one “authoritative” URL? Note that Wikipedia provides pretty

minimal assistance in choosing the right entity, but does a good job

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

Model Generation for Output Previous hypothesis

Many different user applications built against single back-end database

Difficult task is translating from back-end data model to the application’s data model

Query Processing (1)

Query arrives at system

Entity-relation database processor yields entity results

Query Renderer chooses appropriate output schema

User corrections are logged and fed into later iterations of db construction

Potential Problems Many plausible front-end applications,

none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an

end-user application Need to explore possible applications

rather than build multi-app infrastructure

One possible different direction: data integration as user primitive

Data Integration as UI Can we combine tables to create

new data sources? Many existing “mashup” tools,

which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in

advance Transient integrations Dirty data

Interaction Challenge Try to create a database of all“VLDB program committee members”

Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but

high/low quality (like search) Also, prosaic traditional operators

Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova,

Halevy]

Octopus

Walkthrough - Operator #1 SEARCH(“VLDB program committee members”)

serge abiteboul inria

anastassia ail… carnegie…

gustavo alonso etz zurich

… …

michael adiba …grenoble

antonio albano …pisa

… …

Walkthrough - Operator #2 Recover relevant data

michael adiba …grenoble

antonio albano …pisa

… …

anastassia ail… carnegie…

gustavo alonso etz zurich

… …

CONTEXT()

Walkthrough - Operator #2 Recover relevant data

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

… … …

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

CONTEXT()

Walkthrough - Union Combine datasets

… … …

Union()

… … …

Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic

EXTEND( “publications”, col=0)

… … …

serge abiteboul inria 1996 “Large Scale P2P Dist…”

michael adiba …grenoble 1996 “Exploiting bitemporal…”

antonio albano …pisa 1996 “Another Example of a…”

serge abiteboul inria 2005 “Large Scale P2P Dist…”

anastassia ail… carnegie… 2005 “Efficient Use of the…”

gustavo alonso etz zurich 2005 “A Dynamic and Flexible…”

… … …

• User has integrated data sources with little effort• No wrappers; data was never intended for reuse

“publications”

CONTEXT Algorithms Input: table and source page Output: data values to add to table

SignificantTerms sorts terms in source page by “importance” (tf-idf)

Related View Partners Looks for different “views” of same

CONTEXT Experiments

Data Integration as UI Compelling for db researchers, but

will large numbers of people use it?

Conclusion Automatic Web KBs rapidly

progressing Recall still not good enough for many

tasks, but progress is rapid Not clear what those tasks should be, and

progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper

Omnivore’s approach not wrong, but did not directly address these problems

the web’s many models

datacentric webdownload

datacentric webaddresses

data integration

interesting data

centric entities

comprehensive web database

entitycentric view

comprehensive web kbs

Documents

web’s weekly roundup may 2, 2015 presenter: web begole

circular business models - creating value through...

influencer marketing hub is one of the web’s leading...

fafsa on the web’s new look for 2002-2003

choosing the web’s future

taming the many headed dragon: collaborative models and...

using endnote web’s capture...

10 new models many in bright colours - louis stone

many models, one building

searching all the web’s spatial data · searching all the...

web’s weekly roundup march 28, 2015 twitter: @marketwebs...

estimating and testing models with many treatment levels...

web’s weekly roundup knowing the future(s) june 6, 2015...

2 day seminar linear speed - the web’s #1 provider of

part ii: graphical models. challenges of probabilistic...

the web’s many models michael j. cafarella university of...

cs224n: final project investigating the web’s dark matter

using the internet: making the most of the web’s resources

solving general equilibrium models with incomplete markets...

the semantic web’s impact upon e-business