scraping talk public

Getting data from the webfor research

Andrew Whitby27 February 2014

Web data projects I’ve worked onA project… Website Data items Scrape API

Examining the global trade in music

Various websites incl. Wikipedia, Musicbrainz

8 million chart entries~50k unique artists

Analysing promotion techniques for artists in foreign markets

A social network 5k users with2+ million user preferences (similar to ‘likes’)

Investigation of data skills

University course database

20,000 courses

Modelling political orientation of various organisations*

Twitter 10ks of followers

* Not at Nesta

Do you really need to scrape?

Bulk download: Some sites make their data available as a download. Check!

Use an API: A programming interface designed to expose data directly.

Manually collect the data: for up to 100s of items, this can be quicker (intern, contract researcher?)

Contact the site owner: For smaller sites this can be surprisingly effective.

Scrape the website: Do this as a last resort.

Easiest

Hardest

Can it be scraped?

Structured or semi-structured= Scraping

Unstructured text= A different problem

Scrapers

Web 101

• Clients (your browser) send requests to servers (e.g. www.nesta.org.uk) using HyperText Transfer Protocol (HTTP)

• Depending on the request, the server might return– A web page, in HTML– An image (e.g. a PNG or JPG)– Some data, as XML or JSON– Etc

• Scraping and APIs both use HTTP

So how does web scraping work?

• In the (good) old days web pages were very simple, handcrafted, marked-up text

• Now most automatically generated from databases of content according to templates, so they naturally have a repetitive structure

• Scraping exploits the regularities of this (semi-) structure to extract data using text-manipulation algorithms

Scraping example: Nesta PeopleOrdinary URL that you would browse to

Extraneous information, formatting, etc

The data you actually want: either as a table or list here, or possibly as a link to a page-per-item

Pagination, e.g.<<First <Prev 1 2 3 Next> Last>>

Scraping example: under the bonnet

Scraping example: under the bonnet

Adam

Albe

rt

Start of an entry

End of an entry

Photo link

Link to Albert’s main page Name text

Scraping: legal considerations

• Jurisdiction issues• Laws that have been relied upon

– Contract: terms of service– Copyright law– EU Databases Directive (research exemption?)– US Computer Fraud & Abuse Act– US Digital Millennium Copyright Act

• Case law– Unsettled - conflicting decisions

Bottom line: this is a grey area and not without legal risk(Also: I’m not a lawyer, this is not legal advice)

Scraping: ethical considerations

• Remember, the site wasn’t designed for this purpose: be sympathetic to the site owner

• Avoid putting an unreasonable burden on the site– Some run on massive datacentres, others a single machine.– Rule of thumb: don’t scrape multiple items in parallel

• Ask permission if you can– But be realistic, and remember a lot of web traffic is scraping (Google,

Bing, etc)• Observe robots.txt– But this is (probably) not legally binding either way

This is before even thinking about privacy (if user data involved)

Scraping courtesy: robots.txt

If this file exists it will be at http://sitename.com/robots.txt

Scraping: practical issues

Sites may reject connections, or challenge your humanity with CAPTCHAs

Getting around limits

The simple options– Slow down requests, introduce random delays– Use ‘user agent’ to pretend to be human

The serious option – Tor (“the onion router”)– Anonymises your network location.– Ethical consideration though

• Tor is a fragile community with better uses

These aren’t the droids you’re looking for

If these don’t work, give up. If they’ve gone to this much trouble to prevent scraping, they’re more likely to get upset and possibly take action against you.

http://arstechnica.com/security/2014/01/hackers-use-amazon-cloud-to-scrape-mass-number-of-linkedin-member-profiles/

APIs (Application Programming Interfaces)

How do APIs work?

• Way of extracting structured data from a web site or service– A service intentionally made available by the data owner

• Just a set of rules for communicating / exchanging data– Request is usually made as a specially-constructed web address– Response is usually encoded as JSON or XML

• You can access an API:– directly in your browser (good for testing)– using a tool like curl– by programming it directly– by using a ‘wrapper’ in your language of choice (Python, Ruby, Java, etc)

An API is a set of rules

API example: Companies House

A RESTful request using HTTP with data returned in JSON format

Specially constructed URL (‘request’)

Structured, unformatted data returned (‘response’)

API example: Companies House

The same data rendered in a human-friendly web format.

Formatted, human-friendly page returned

APIs: legal issues• Situation is simpler/safer than scraping• Publishing an API means a data provider is encouraging use, and

explicitly controlling the amount of data you can collect• With an API you are more likely to have to expressly agree to

something (“clickwrap”); with a paid API you’ll have a formal contract

APIs: ethical issues

• As with scraping, avoid putting an unreasonable burden on the site

• But often API owners will be explicit about what a reasonable burden is– This may be voluntary– Or enforced via a ‘rate limit’

• Easier for API owner to enforce, so responsibility is shifted somewhat

APIs: practical issues

• APIs will often be ‘rate limited’: that is, a limit is imposed on how many requests you can make per minute/hour.

• This can increase the elapsed time it takes to collect large quantities of information– But often free registration will increase your rate limit– And paid accounts may increase it further– Don’t try to work around this any other way

• APIs may not provide all the same fields web users see – they are often designed for third-party apps rather than research– In which case, scraping may be an option

DIY web data accessScraping API access

Point and click Import.ioYahoo DapperYahoo PipesVarious browser extensions (e.g. Chrome Scraper)Kimono?

Scraperwiki (Twitter)

Some code ScraperwikiMorphi.ioKrake.io

Scraperwiki

Lots of code ScrapyBeautifulSoup

Your language of choice(Python+Requests is good)

Also see this list of non-code scraping things to try courtesy of a pair of US journalists: here

http://michelleminkoff.com/2012/02/27/teaching-materials-from-nicar-2012/

Contracted web data access

• How much:– e.g. ScraperWiki: $3-10k

upfront, $200-500 per month

• Think about– How will you

receive/analyse the data?– What is the time period of

interest?– Is it a well-known API (e.g.

Twitter) or something exotic (e.g. Douban)?

Case study

Data: Twitter public API (~800 users, 1m tweets over Jan-Oct 2012, plus network snapshots at 3 timesCost: £10-15kTime: months (limitations of API history + rate limits)Issues• Lack of transparency/documentation about

data processing decisions (what’s in, what’s out) – getting from complex to flat data structures

• Need for iteration, constant communication• Data collection skills may not coexist with

report-writing skills

Summary

1. Consider your non-scraping options2. A legally grey area - be aware of this

3. If you scrape, scrape ethically4. Scraping starts simply, but can get complicated

5. Life is easier with open APIs

GlossaryAPI Key a secret string that you use to identify yourself to an API

CAPTCHA Completely Automated Public Turing Test to tell Computers and Humans Apart

HTML (HyperText Markup Language) the language in which web pages are constructed

HTTP (HyperText Transfer Protocol) the communications protocol that is used to transfer web pages from the server to your browser. APIs use this too

JSON a very simple data format based on the Javascript language, that is quite readable to humans too

rate limit a limit on how frequently you can make requests to the API

REST a popular semantic approach to using HTTP for APIs

XML a more complex data format that predates JSON

scraping talk public

Documents

hypertext

unreasonable

structured

data

language

text

twitter

api