data and information extraction on the web

42
Data and Information Extraction on the Web Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org lunedì 12 aprile 2010

Upload: tommaso-teofili

Post on 15-Jan-2015

5.068 views

Category:

Technology


5 download

DESCRIPTION

Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

TRANSCRIPT

Page 1: Data and Information Extraction on the Web

Data and Information Extraction on the Web

Gestione delle Informazioni su Web - 2009/2010Tommaso Teofili

tommaso [at] apache [dot] org

lunedì 12 aprile 2010

Page 2: Data and Information Extraction on the Web

Agenda

Search

Goals

Problems

Data extraction

Information extraction

Mixing things together

lunedì 12 aprile 2010

Page 3: Data and Information Extraction on the Web

Search - Goals

Find what we are looking for

Quickly

Easily

Have suggestions on other interesting related stuff

Turn results into useful knowledge

lunedì 12 aprile 2010

Page 4: Data and Information Extraction on the Web

What are you looking for?lunedì 12 aprile 2010

Page 5: Data and Information Extraction on the Web

Problems when googling

Where to search what we are looking for

How to write good queries (i.e.: relations between terms?)

How to evaluate when a query is good

lunedì 12 aprile 2010

Page 6: Data and Information Extraction on the Web

Search sources

Redundant, unhomogeneous, widespread, public, noisy, free, sometimes standard, semi-structured, linked, reachable...

in one word: the Web

lunedì 12 aprile 2010

Page 7: Data and Information Extraction on the Web

Focused search sources

Address interesting sources for the desired domain

Where possible, filter out the unclean and fragmented ones

Choose the most standard and well structured ones

lunedì 12 aprile 2010

Page 8: Data and Information Extraction on the Web

Fragmented sourceslunedì 12 aprile 2010

Page 9: Data and Information Extraction on the Web

Structered sourceslunedì 12 aprile 2010

Page 10: Data and Information Extraction on the Web

Data extraction

Automatically collect data from the Web

Crawl data from domain specific sources

Aggregate homogeneous data (i.e.: using equivalence classes)

Save (portions of downloaded) data to a convenient separate storage (DB, file system, repository, etc.)

lunedì 12 aprile 2010

Page 11: Data and Information Extraction on the Web

Data extraction - Crawling

From scratch (good luck!)

Leveraging existing facilities (wget, HtmlUnit, Selenium, Apache HttpClient, Ning’s Async HttpClient, etc.)

Playing with existing projects (RoadRunner, Webpipe, Apache Nutch, etc.)

lunedì 12 aprile 2010

Page 12: Data and Information Extraction on the Web

Data extraction - HttpClientlunedì 12 aprile 2010

Page 13: Data and Information Extraction on the Web

Data extraction - HtmlUnitlunedì 12 aprile 2010

Page 14: Data and Information Extraction on the Web

Data extraction - Aggregating

Downloaded resources can be assigned to equivalence classes

Crawling process is inherently defining page classes to which pages belong automatically

Relations between page classes

RoadRunner, Webpipe, etc.

lunedì 12 aprile 2010

Page 15: Data and Information Extraction on the Web

Data extraction - EC

lunedì 12 aprile 2010

Page 16: Data and Information Extraction on the Web

Data extraction - EC

“players” class

“teams indexes” class

“coaches” class

“teams” class

lunedì 12 aprile 2010

Page 17: Data and Information Extraction on the Web

Data extraction - Relevance

What do we really deserve?

Depending on the specific domain

Not all pages in all classes could be relevant

We could be interested only in a subset of the found page classes

lunedì 12 aprile 2010

Page 18: Data and Information Extraction on the Web

Data extraction - Example

We may be interested in retrieving only information regarding players (Player class)

lunedì 12 aprile 2010

Page 19: Data and Information Extraction on the Web

Data extraction - Problems

Server unavailability (HTTP 404, 403, 303, etc.)

Security and bandwith filters (don’t get your crawler machine IP banned!)

Client unavailability (memory and storage space are unlimited only in theory)

Encoding

Legal issues

...

lunedì 12 aprile 2010

Page 20: Data and Information Extraction on the Web

From Data to Informationlunedì 12 aprile 2010

Page 21: Data and Information Extraction on the Web

Data vs Information

Data

Rough

Semi-structured

Mixed content

Unmutable

Navigation oriented

Information

Clean

Structured

Focused

Managed

Domain oriented

lunedì 12 aprile 2010

Page 22: Data and Information Extraction on the Web

From Data to Information

We have crawled a lot of data

We eventually have some rough structure (page classes and relations)

We want to pick only what we need

lunedì 12 aprile 2010

Page 23: Data and Information Extraction on the Web

Information extraction - Pruning

We want to filter out at least:

Banners, advertisement, etc.

Headers/Footers

Navigation bars/Search boxes

Everything else not related with content

We may use XPath

lunedì 12 aprile 2010

Page 24: Data and Information Extraction on the Web

Information extraction - Pruning

lunedì 12 aprile 2010

Page 25: Data and Information Extraction on the Web

Information extraction - Pruning

lunedì 12 aprile 2010

Page 26: Data and Information Extraction on the Web

Information extraction

Once we have extracted content

We are now interested in getting useful information from it -> knowledge

Look for some matchings between extracted data and our domain model

lunedì 12 aprile 2010

Page 27: Data and Information Extraction on the Web

Information extraction - Example

Navigate XML (HTML DOM) nodes with XPath

Navigate content and find specific “parts” (nodes or sub-trees)

Tag such “parts” as objects or properties inside a (specific) domain model

Eventually need to traverse DOM multiple times

lunedì 12 aprile 2010

Page 28: Data and Information Extraction on the Web

Information extraction - Name

lunedì 12 aprile 2010

Page 29: Data and Information Extraction on the Web

Information extraction - Date of Birth

lunedì 12 aprile 2010

Page 30: Data and Information Extraction on the Web

Information extraction - Team

lunedì 12 aprile 2010

Page 31: Data and Information Extraction on the Web

Information extraction - Example

A Player (taken from the Player pageclass)

with name, date of birth and belonging to a team

We now know that “Francesco Totti” is a Player of “Italy” team and was born on “27/09/1976”

We can apply such XPaths to all PageClass instances and get information about each player

lunedì 12 aprile 2010

Page 32: Data and Information Extraction on the Web

Information extraction - Wrapper

Context navigation

RoadRunner

Webpipe

Statistical analysis

ExAlg

Other...

lunedì 12 aprile 2010

Page 33: Data and Information Extraction on the Web

Information extraction - Problems

Not well structured sources

Frequently changing sources

False positives

Corrupted extracted data

lunedì 12 aprile 2010

Page 34: Data and Information Extraction on the Web

False positiveslunedì 12 aprile 2010

Page 35: Data and Information Extraction on the Web

Information extraction - Relevance

Using wrappers we can get a lot of information

We could rank what is relevant in the:

“page” context

the domain model

For efficiency and “reasoning” purposes

lunedì 12 aprile 2010

Page 36: Data and Information Extraction on the Web

Information extraction - relevance

lunedì 12 aprile 2010

Page 37: Data and Information Extraction on the Web

Information extraction - Metadata

Stream extracted information into our domain model

Extracted information -> Metadata

Populated domain objects contain

interesting semantics

relations

lunedì 12 aprile 2010

Page 38: Data and Information Extraction on the Web

Store Metadata

DB (with classic relational schema)

Filesystem (XML)

Key-Value repository

Index

Triple Store

...

lunedì 12 aprile 2010

Page 39: Data and Information Extraction on the Web

Query enriched data

Exploit acquired metadata semantics to build SQL-like (with attributes and relations of our domain model) queries on previously unstructered data

Extract hidden knowledge querying aggregated metadata

lunedì 12 aprile 2010

Page 40: Data and Information Extraction on the Web

Sample queries

Get “young players”

SELECT * FROM giocatore g WHERE g.dob AFTER 1993/01/01

Aggregate queries

Find the average age in each team

Find the average age of World Cup players

lunedì 12 aprile 2010

Page 41: Data and Information Extraction on the Web

Information extraction on the Web

lunedì 12 aprile 2010

Page 42: Data and Information Extraction on the Web

References

http://www.w3.org/TR/xpath/

http://www.w3.org/DOM/

http://www.dia.uniroma3.it/db/roadRunner/

http://www.slideshare.net/n0on3/exalg-overview

http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm

http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/overview_and_setup/overview_and_setup.html

http://en.wikipedia.org/wiki/Web_scraping

http://www.alchemyapi.com/api/scrape/

lunedì 12 aprile 2010