scraping talk public
Post on 13-Sep-2014
1.299 views
DESCRIPTION
TRANSCRIPT
Getting data from the webfor research
Andrew Whitby27 February 2014
Web data projects I’ve worked onA project… Website Data items Scrape API
Examining the global trade in music
Various websites incl. Wikipedia, Musicbrainz
8 million chart entries~50k unique artists
Analysing promotion techniques for artists in foreign markets
A social network 5k users with2+ million user preferences (similar to ‘likes’)
Investigation of data skills
University course database
20,000 courses
Modelling political orientation of various organisations*
Twitter 10ks of followers
* Not at Nesta
Do you really need to scrape?
Bulk download: Some sites make their data available as a download. Check!
Use an API: A programming interface designed to expose data directly.
Manually collect the data: for up to 100s of items, this can be quicker (intern, contract researcher?)
Contact the site owner: For smaller sites this can be surprisingly effective.
Scrape the website: Do this as a last resort.
Easiest
Hardest
Can it be scraped?
Structured or semi-structured= Scraping
Unstructured text= A different problem
Scrapers
Web 101
• Clients (your browser) send requests to servers (e.g. www.nesta.org.uk) using HyperText Transfer Protocol (HTTP)
• Depending on the request, the server might return– A web page, in HTML– An image (e.g. a PNG or JPG)– Some data, as XML or JSON– Etc
• Scraping and APIs both use HTTP
So how does web scraping work?
• In the (good) old days web pages were very simple, handcrafted, marked-up text
• Now most automatically generated from databases of content according to templates, so they naturally have a repetitive structure
• Scraping exploits the regularities of this (semi-) structure to extract data using text-manipulation algorithms
Scraping example: Nesta PeopleOrdinary URL that you would browse to
Extraneous information, formatting, etc
The data you actually want: either as a table or list here, or possibly as a link to a page-per-item
Pagination, e.g.<<First <Prev 1 2 3 Next> Last>>
Scraping example: under the bonnet
Scraping example: under the bonnet
Adam
Albe
rt
Start of an entry
End of an entry
Photo link
Link to Albert’s main page Name text
Scraping: legal considerations
• Jurisdiction issues• Laws that have been relied upon
– Contract: terms of service– Copyright law– EU Databases Directive (research exemption?)– US Computer Fraud & Abuse Act– US Digital Millennium Copyright Act
• Case law– Unsettled - conflicting decisions
Bottom line: this is a grey area and not without legal risk(Also: I’m not a lawyer, this is not legal advice)
Scraping: ethical considerations
• Remember, the site wasn’t designed for this purpose: be sympathetic to the site owner
• Avoid putting an unreasonable burden on the site– Some run on massive datacentres, others a single machine.– Rule of thumb: don’t scrape multiple items in parallel
• Ask permission if you can– But be realistic, and remember a lot of web traffic is scraping (Google,
Bing, etc)• Observe robots.txt– But this is (probably) not legally binding either way
This is before even thinking about privacy (if user data involved)
Scraping courtesy: robots.txt
If this file exists it will be at http://sitename.com/robots.txt
Scraping: practical issues
Sites may reject connections, or challenge your humanity with CAPTCHAs
Getting around limits
The simple options– Slow down requests, introduce random delays– Use ‘user agent’ to pretend to be human
The serious option – Tor (“the onion router”)– Anonymises your network location.– Ethical consideration though
• Tor is a fragile community with better uses
These aren’t the droids you’re looking for
If these don’t work, give up. If they’ve gone to this much trouble to prevent scraping, they’re more likely to get upset and possibly take action against you.
APIs (Application Programming Interfaces)
How do APIs work?
• Way of extracting structured data from a web site or service– A service intentionally made available by the data owner
• Just a set of rules for communicating / exchanging data– Request is usually made as a specially-constructed web address– Response is usually encoded as JSON or XML
• You can access an API:– directly in your browser (good for testing)– using a tool like curl– by programming it directly– by using a ‘wrapper’ in your language of choice (Python, Ruby, Java, etc)
An API is a set of rules
API example: Companies House
A RESTful request using HTTP with data returned in JSON format
Specially constructed URL (‘request’)
Structured, unformatted data returned (‘response’)
API example: Companies House
The same data rendered in a human-friendly web format.
Formatted, human-friendly page returned
APIs: legal issues• Situation is simpler/safer than scraping• Publishing an API means a data provider is encouraging use, and
explicitly controlling the amount of data you can collect• With an API you are more likely to have to expressly agree to
something (“clickwrap”); with a paid API you’ll have a formal contract
APIs: ethical issues
• As with scraping, avoid putting an unreasonable burden on the site
• But often API owners will be explicit about what a reasonable burden is– This may be voluntary– Or enforced via a ‘rate limit’
• Easier for API owner to enforce, so responsibility is shifted somewhat
APIs: practical issues
• APIs will often be ‘rate limited’: that is, a limit is imposed on how many requests you can make per minute/hour.
• This can increase the elapsed time it takes to collect large quantities of information– But often free registration will increase your rate limit– And paid accounts may increase it further– Don’t try to work around this any other way
• APIs may not provide all the same fields web users see – they are often designed for third-party apps rather than research– In which case, scraping may be an option
DIY web data accessScraping API access
Point and click Import.ioYahoo DapperYahoo PipesVarious browser extensions (e.g. Chrome Scraper)Kimono?
Scraperwiki (Twitter)
Some code ScraperwikiMorphi.ioKrake.io
Scraperwiki
Lots of code ScrapyBeautifulSoup
Your language of choice(Python+Requests is good)
Also see this list of non-code scraping things to try courtesy of a pair of US journalists: here
Contracted web data access
• How much:– e.g. ScraperWiki: $3-10k
upfront, $200-500 per month
• Think about– How will you
receive/analyse the data?– What is the time period of
interest?– Is it a well-known API (e.g.
Twitter) or something exotic (e.g. Douban)?
Case study
Data: Twitter public API (~800 users, 1m tweets over Jan-Oct 2012, plus network snapshots at 3 timesCost: £10-15kTime: months (limitations of API history + rate limits)Issues• Lack of transparency/documentation about
data processing decisions (what’s in, what’s out) – getting from complex to flat data structures
• Need for iteration, constant communication• Data collection skills may not coexist with
report-writing skills
Summary
1. Consider your non-scraping options2. A legally grey area - be aware of this
3. If you scrape, scrape ethically4. Scraping starts simply, but can get complicated
5. Life is easier with open APIs
GlossaryAPI Key a secret string that you use to identify yourself to an API
CAPTCHA Completely Automated Public Turing Test to tell Computers and Humans Apart
HTML (HyperText Markup Language) the language in which web pages are constructed
HTTP (HyperText Transfer Protocol) the communications protocol that is used to transfer web pages from the server to your browser. APIs use this too
JSON a very simple data format based on the Javascript language, that is quite readable to humans too
rate limit a limit on how frequently you can make requests to the API
REST a popular semantic approach to using HTTP for APIs
XML a more complex data format that predates JSON