data and information extraction on the web

Data and Information Extraction on the Web

Gestione delle Informazioni su Web - 2009/2010Tommaso Teofili

tommaso [at] apache [dot] org

lunedì 12 aprile 2010

Agenda

Search

Goals

Problems

Data extraction

Information extraction

Mixing things together


Search - Goals

Find what we are looking for

Quickly

Easily

Have suggestions on other interesting related stuff

Turn results into useful knowledge


What are you looking for?lunedì 12 aprile 2010

Problems when googling

Where to search what we are looking for

How to write good queries (i.e.: relations between terms?)

How to evaluate when a query is good


Search sources

Redundant, unhomogeneous, widespread, public, noisy, free, sometimes standard, semi-structured, linked, reachable...

in one word: the Web


Focused search sources

Address interesting sources for the desired domain

Where possible, filter out the unclean and fragmented ones

Choose the most standard and well structured ones


Fragmented sourceslunedì 12 aprile 2010

Structered sourceslunedì 12 aprile 2010

Data extraction

Automatically collect data from the Web

Crawl data from domain specific sources

Aggregate homogeneous data (i.e.: using equivalence classes)

Save (portions of downloaded) data to a convenient separate storage (DB, file system, repository, etc.)


Data extraction - Crawling

From scratch (good luck!)

Leveraging existing facilities (wget, HtmlUnit, Selenium, Apache HttpClient, Ning’s Async HttpClient, etc.)

Playing with existing projects (RoadRunner, Webpipe, Apache Nutch, etc.)


Data extraction - HttpClientlunedì 12 aprile 2010

Data extraction - HtmlUnitlunedì 12 aprile 2010

Data extraction - Aggregating

Downloaded resources can be assigned to equivalence classes

Crawling process is inherently defining page classes to which pages belong automatically

Relations between page classes

RoadRunner, Webpipe, etc.


Data extraction - EC


Data extraction - EC

“players” class

“teams indexes” class

“coaches” class

“teams” class


Data extraction - Relevance

What do we really deserve?

Depending on the specific domain

Not all pages in all classes could be relevant

We could be interested only in a subset of the found page classes


Data extraction - Example

We may be interested in retrieving only information regarding players (Player class)


Data extraction - Problems

Server unavailability (HTTP 404, 403, 303, etc.)

Security and bandwith filters (don’t get your crawler machine IP banned!)

Client unavailability (memory and storage space are unlimited only in theory)

Encoding

Legal issues

...


From Data to Informationlunedì 12 aprile 2010

Data vs Information

Data

Rough

Semi-structured

Mixed content

Unmutable

Navigation oriented

Information

Clean

Structured

Focused

Managed

Domain oriented


From Data to Information

We have crawled a lot of data

We eventually have some rough structure (page classes and relations)

We want to pick only what we need


Information extraction - Pruning

We want to filter out at least:

Banners, advertisement, etc.

Headers/Footers

Navigation bars/Search boxes

Everything else not related with content

We may use XPath


Information extraction - Pruning


Information extraction

Once we have extracted content

We are now interested in getting useful information from it -> knowledge

Look for some matchings between extracted data and our domain model


Information extraction - Example

Navigate XML (HTML DOM) nodes with XPath

Navigate content and find specific “parts” (nodes or sub-trees)

Tag such “parts” as objects or properties inside a (specific) domain model

Eventually need to traverse DOM multiple times


Information extraction - Name


Information extraction - Date of Birth


Information extraction - Team


Information extraction - Example

A Player (taken from the Player pageclass)

with name, date of birth and belonging to a team

We now know that “Francesco Totti” is a Player of “Italy” team and was born on “27/09/1976”

We can apply such XPaths to all PageClass instances and get information about each player


Information extraction - Wrapper

Context navigation

RoadRunner

Webpipe

Statistical analysis

ExAlg

Other...


Information extraction - Problems

Not well structured sources

Frequently changing sources

False positives

Corrupted extracted data


False positiveslunedì 12 aprile 2010

Information extraction - Relevance

Using wrappers we can get a lot of information

We could rank what is relevant in the:

“page” context

the domain model

For efficiency and “reasoning” purposes


Information extraction - relevance


Information extraction - Metadata

Stream extracted information into our domain model

Extracted information -> Metadata

Populated domain objects contain

interesting semantics

relations


Store Metadata

DB (with classic relational schema)

Filesystem (XML)

Key-Value repository

Index

Triple Store

...


Query enriched data

Exploit acquired metadata semantics to build SQL-like (with attributes and relations of our domain model) queries on previously unstructered data

Extract hidden knowledge querying aggregated metadata


Sample queries

Get “young players”

SELECT * FROM giocatore g WHERE g.dob AFTER 1993/01/01

Aggregate queries

Find the average age in each team

Find the average age of World Cup players


Information extraction on the Web


References

http://www.w3.org/TR/xpath/

http://www.w3.org/DOM/

http://www.dia.uniroma3.it/db/roadRunner/

http://www.slideshare.net/n0on3/exalg-overview

http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm

http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/overview_and_setup/overview_and_setup.html

http://en.wikipedia.org/wiki/Web_scraping

http://www.alchemyapi.com/api/scrape/




















data and information extraction on the web

Technology

data extraction httpclient

information extraction

data extraction htmlunit

data extraction example

data extraction relevance

web luned

extracted data

useful information