onlineinfo2012 - scraping

76
DATA LIBERATION Opening Up Data by Hook or by Crook - Data Scraping, Linkage and the Value of a Good Identifier Tony Hirst Department of Communication and Systems The Open University

Upload: tony-hirst

Post on 27-Jan-2015

137 views

Category:

Business


1 download

DESCRIPTION

Is open data disruptive to data vendors/verticals in the information industry? How can scrapers turn data published as information on the web or in PDFs back into structured data? What business models or publications are built from scraped data?

TRANSCRIPT

Page 1: Onlineinfo2012 - Scraping

DATA LIBERATION

Opening Up Data by Hook or by Crook - Data Scraping, Linkage and the Value of a Good Identifier

Tony HirstDepartment of Communication and

SystemsThe Open University

Page 2: Onlineinfo2012 - Scraping

data NOT information

Craftby Vicky Hugheston

Page 3: Onlineinfo2012 - Scraping

[Disruptive Innovation?]

Page 4: Onlineinfo2012 - Scraping
Page 5: Onlineinfo2012 - Scraping

“First” generation:data catalogues

Page 6: Onlineinfo2012 - Scraping

Breathing life into data…

Page 7: Onlineinfo2012 - Scraping

=importData(“CSV_URL”)

Google Sheets

Page 8: Onlineinfo2012 - Scraping

the spreadsheet becomes

A DATABASE

Page 9: Onlineinfo2012 - Scraping

Google Charts

Visualisation API

Page 10: Onlineinfo2012 - Scraping

Google Charts

Visualisation API

Page 11: Onlineinfo2012 - Scraping

Google Charts

Visualisation API

Page 12: Onlineinfo2012 - Scraping

“Second” generation:data management

systems

Page 13: Onlineinfo2012 - Scraping

DMS – Data Management System

Page 14: Onlineinfo2012 - Scraping

BUT

Page 15: Onlineinfo2012 - Scraping

There’s lots more data that’s locked up in web pages…

Page 16: Onlineinfo2012 - Scraping

Scraping…

Page 17: Onlineinfo2012 - Scraping
Page 18: Onlineinfo2012 - Scraping

“grabbing web content in a machine readable

format and then processing it for your

own purposes”

Page 19: Onlineinfo2012 - Scraping
Page 20: Onlineinfo2012 - Scraping

DIY API

Page 21: Onlineinfo2012 - Scraping
Page 22: Onlineinfo2012 - Scraping

Original HTML web

page

Accessible web page

Extract Information

-> data

Page 23: Onlineinfo2012 - Scraping

Recreating the database that was used

to populate a (templated) page

Page 24: Onlineinfo2012 - Scraping
Page 25: Onlineinfo2012 - Scraping
Page 26: Onlineinfo2012 - Scraping
Page 27: Onlineinfo2012 - Scraping
Page 28: Onlineinfo2012 - Scraping
Page 29: Onlineinfo2012 - Scraping

Implied semantics

Page 30: Onlineinfo2012 - Scraping

…quick’n’dirty=importHTML(“pageURL”,“table”,N)

Page 31: Onlineinfo2012 - Scraping
Page 32: Onlineinfo2012 - Scraping
Page 33: Onlineinfo2012 - Scraping
Page 34: Onlineinfo2012 - Scraping
Page 35: Onlineinfo2012 - Scraping
Page 36: Onlineinfo2012 - Scraping
Page 37: Onlineinfo2012 - Scraping

PDF scraping

Page 38: Onlineinfo2012 - Scraping
Page 39: Onlineinfo2012 - Scraping

Scrapers

Views

Scraper SQLite database

SQLite database Scraper

Page 40: Onlineinfo2012 - Scraping
Page 41: Onlineinfo2012 - Scraping
Page 42: Onlineinfo2012 - Scraping
Page 43: Onlineinfo2012 - Scraping

Sometimes the data is spread

across different files…

Page 44: Onlineinfo2012 - Scraping
Page 45: Onlineinfo2012 - Scraping

Row based aggregation

Page 46: Onlineinfo2012 - Scraping

Sometimes the data is spread

across different websites…

Page 47: Onlineinfo2012 - Scraping

…Normalisation…

Page 48: Onlineinfo2012 - Scraping
Page 49: Onlineinfo2012 - Scraping

Data Enrichment

Page 50: Onlineinfo2012 - Scraping

Column Additions/An

notations

Page 51: Onlineinfo2012 - Scraping
Page 52: Onlineinfo2012 - Scraping

Sometimes the data is split

across different files…

Page 53: Onlineinfo2012 - Scraping

Column based merge

Page 54: Onlineinfo2012 - Scraping
Page 55: Onlineinfo2012 - Scraping

-> Data cleansing

Page 56: Onlineinfo2012 - Scraping

Clustering…

Page 57: Onlineinfo2012 - Scraping

OpenRefinehttp://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/

/via Martin Hawksey/@mhawksey

Page 58: Onlineinfo2012 - Scraping

OpenRefine

Page 59: Onlineinfo2012 - Scraping

OpenRefine

Page 60: Onlineinfo2012 - Scraping

“Finessing” a common identifer

Page 61: Onlineinfo2012 - Scraping

Common identifiers (common KEYS) make

it MUCH easier to JOIN datasets by column

Page 62: Onlineinfo2012 - Scraping

Book Title -> ISBN

Page 63: Onlineinfo2012 - Scraping

I am “psychemedia” on Twitter, delicious, slideshare, flickr, etc

etc

Page 64: Onlineinfo2012 - Scraping
Page 65: Onlineinfo2012 - Scraping

Reconciliation…

Page 66: Onlineinfo2012 - Scraping

OpenRefine

Page 67: Onlineinfo2012 - Scraping

OpenRefine

Page 68: Onlineinfo2012 - Scraping

OpenRefine

Page 69: Onlineinfo2012 - Scraping

OpenRefine

Page 70: Onlineinfo2012 - Scraping
Page 71: Onlineinfo2012 - Scraping

Linked Data™

Page 72: Onlineinfo2012 - Scraping
Page 73: Onlineinfo2012 - Scraping

So who speaks SPARQL?

Diners - Journal Canteenby avlxyz

Page 74: Onlineinfo2012 - Scraping

You DON’T have to….

Page 75: Onlineinfo2012 - Scraping

Just think about how one piece of data might be related to another

through a common means of addressing them…

Page 76: Onlineinfo2012 - Scraping

http://ouseful.info

@psychemedia