data and information extraction on the web
DESCRIPTION
Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre UniversityTRANSCRIPT
Data and Information Extraction on the Web
Gestione delle Informazioni su Web - 2009/2010Tommaso Teofili
tommaso [at] apache [dot] org
lunedì 12 aprile 2010
Agenda
Search
Goals
Problems
Data extraction
Information extraction
Mixing things together
lunedì 12 aprile 2010
Search - Goals
Find what we are looking for
Quickly
Easily
Have suggestions on other interesting related stuff
Turn results into useful knowledge
lunedì 12 aprile 2010
What are you looking for?lunedì 12 aprile 2010
Problems when googling
Where to search what we are looking for
How to write good queries (i.e.: relations between terms?)
How to evaluate when a query is good
lunedì 12 aprile 2010
Search sources
Redundant, unhomogeneous, widespread, public, noisy, free, sometimes standard, semi-structured, linked, reachable...
in one word: the Web
lunedì 12 aprile 2010
Focused search sources
Address interesting sources for the desired domain
Where possible, filter out the unclean and fragmented ones
Choose the most standard and well structured ones
lunedì 12 aprile 2010
Fragmented sourceslunedì 12 aprile 2010
Structered sourceslunedì 12 aprile 2010
Data extraction
Automatically collect data from the Web
Crawl data from domain specific sources
Aggregate homogeneous data (i.e.: using equivalence classes)
Save (portions of downloaded) data to a convenient separate storage (DB, file system, repository, etc.)
lunedì 12 aprile 2010
Data extraction - Crawling
From scratch (good luck!)
Leveraging existing facilities (wget, HtmlUnit, Selenium, Apache HttpClient, Ning’s Async HttpClient, etc.)
Playing with existing projects (RoadRunner, Webpipe, Apache Nutch, etc.)
lunedì 12 aprile 2010
Data extraction - HttpClientlunedì 12 aprile 2010
Data extraction - HtmlUnitlunedì 12 aprile 2010
Data extraction - Aggregating
Downloaded resources can be assigned to equivalence classes
Crawling process is inherently defining page classes to which pages belong automatically
Relations between page classes
RoadRunner, Webpipe, etc.
lunedì 12 aprile 2010
Data extraction - EC
lunedì 12 aprile 2010
Data extraction - EC
“players” class
“teams indexes” class
“coaches” class
“teams” class
lunedì 12 aprile 2010
Data extraction - Relevance
What do we really deserve?
Depending on the specific domain
Not all pages in all classes could be relevant
We could be interested only in a subset of the found page classes
lunedì 12 aprile 2010
Data extraction - Example
We may be interested in retrieving only information regarding players (Player class)
lunedì 12 aprile 2010
Data extraction - Problems
Server unavailability (HTTP 404, 403, 303, etc.)
Security and bandwith filters (don’t get your crawler machine IP banned!)
Client unavailability (memory and storage space are unlimited only in theory)
Encoding
Legal issues
...
lunedì 12 aprile 2010
From Data to Informationlunedì 12 aprile 2010
Data vs Information
Data
Rough
Semi-structured
Mixed content
Unmutable
Navigation oriented
Information
Clean
Structured
Focused
Managed
Domain oriented
lunedì 12 aprile 2010
From Data to Information
We have crawled a lot of data
We eventually have some rough structure (page classes and relations)
We want to pick only what we need
lunedì 12 aprile 2010
Information extraction - Pruning
We want to filter out at least:
Banners, advertisement, etc.
Headers/Footers
Navigation bars/Search boxes
Everything else not related with content
We may use XPath
lunedì 12 aprile 2010
Information extraction - Pruning
lunedì 12 aprile 2010
Information extraction - Pruning
lunedì 12 aprile 2010
Information extraction
Once we have extracted content
We are now interested in getting useful information from it -> knowledge
Look for some matchings between extracted data and our domain model
lunedì 12 aprile 2010
Information extraction - Example
Navigate XML (HTML DOM) nodes with XPath
Navigate content and find specific “parts” (nodes or sub-trees)
Tag such “parts” as objects or properties inside a (specific) domain model
Eventually need to traverse DOM multiple times
lunedì 12 aprile 2010
Information extraction - Name
lunedì 12 aprile 2010
Information extraction - Date of Birth
lunedì 12 aprile 2010
Information extraction - Team
lunedì 12 aprile 2010
Information extraction - Example
A Player (taken from the Player pageclass)
with name, date of birth and belonging to a team
We now know that “Francesco Totti” is a Player of “Italy” team and was born on “27/09/1976”
We can apply such XPaths to all PageClass instances and get information about each player
lunedì 12 aprile 2010
Information extraction - Wrapper
Context navigation
RoadRunner
Webpipe
Statistical analysis
ExAlg
Other...
lunedì 12 aprile 2010
Information extraction - Problems
Not well structured sources
Frequently changing sources
False positives
Corrupted extracted data
lunedì 12 aprile 2010
False positiveslunedì 12 aprile 2010
Information extraction - Relevance
Using wrappers we can get a lot of information
We could rank what is relevant in the:
“page” context
the domain model
For efficiency and “reasoning” purposes
lunedì 12 aprile 2010
Information extraction - relevance
lunedì 12 aprile 2010
Information extraction - Metadata
Stream extracted information into our domain model
Extracted information -> Metadata
Populated domain objects contain
interesting semantics
relations
lunedì 12 aprile 2010
Store Metadata
DB (with classic relational schema)
Filesystem (XML)
Key-Value repository
Index
Triple Store
...
lunedì 12 aprile 2010
Query enriched data
Exploit acquired metadata semantics to build SQL-like (with attributes and relations of our domain model) queries on previously unstructered data
Extract hidden knowledge querying aggregated metadata
lunedì 12 aprile 2010
Sample queries
Get “young players”
SELECT * FROM giocatore g WHERE g.dob AFTER 1993/01/01
Aggregate queries
Find the average age in each team
Find the average age of World Cup players
lunedì 12 aprile 2010
Information extraction on the Web
lunedì 12 aprile 2010
References
http://www.w3.org/TR/xpath/
http://www.w3.org/DOM/
http://www.dia.uniroma3.it/db/roadRunner/
http://www.slideshare.net/n0on3/exalg-overview
http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm
http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/overview_and_setup/overview_and_setup.html
http://en.wikipedia.org/wiki/Web_scraping
http://www.alchemyapi.com/api/scrape/
lunedì 12 aprile 2010