intro to web scraping with python

25
Intro to Web Scraping with Python Vancouver PyLadies Meetup March 7, 2016 Maris Lemba

Upload: maris-lemba

Post on 22-Jan-2018

722 views

Category:

Data & Analytics


9 download

TRANSCRIPT

Page 1: Intro to web scraping with Python

Intro to Web Scraping with Python

Vancouver PyLadies MeetupMarch 7, 2016

Maris Lemba

Page 2: Intro to web scraping with Python

2

Before we begin● Do consider whether the site you are interested in allows scraping by

examining its robots.txt file– This robots exclusion standard is used by web servers to communicate with web

crawlers and other robots

– Located in the top-level directory of web server

– Can specify which directories the site owner does not want robots to access

– Can have different access rules for different robots

● If you intend to publish the output from your scraping, consider whether it is legal to do so– It is generally ok to re-publish “factual data”

#Sample robots.txt fileUser-agent: Googlebot Disallow: /folder/

User-agent: *Disallow: /photos/

Page 3: Intro to web scraping with Python

3

Basic components of a web page● HTML – a set of markup tags for describing web documents● CSS – style sheet language for describing the presentation of an HTML document● JavaScript – popular client-side programming language used for web applications

Sample html source

This is what it would look like in the browser

If we wanted to scrape information from the “Day of the week” table, we can locate it in the html inside the <table> ... </table> element.

If there are multiple tables, we can use attributes such as id to identify the right table.

Page 4: Intro to web scraping with Python

4

How the web works

user interface

web server

web browser

database, data handling, etc

server-based system

user interface

Ajax “engine” (javascript)

web browser

HTTP requestHTTP request

HTML + CSS data

web server

database, data handling, etc

server-based system

XML/JSON or HTML/CSS or Javascript data

Schematic adapted from https://en.wikipedia.org/wiki/Ajax_%28programming%29

Cl ie

ntS

erve

r

Javascriptcall

HTML+CSS data

Conventional model of web application Ajax model of web application

Page 5: Intro to web scraping with Python

5

Scraping ad hoc vs at scale

● Use case 1 (basic scraping): I need to scrape a few pages, maybe crawl a small site. I can wait to have all data retrievals done sequentially. – This is what we will focus on in this talk

– Two scenarios:● All the data you need is in the HTML files you get from the server● The page is dynamically generated using Ajax

● Use case 2: I am crawling large sites periodically. I need task scheduling, asynchronous retrieval, pipeline automation, and lots of other tools to make this easier. – You may want to look into a crawling/scraping framework like Scrapy

Page 6: Intro to web scraping with Python

6

Agenda

● We'll walk through examples for the two scenarios under “use case 1”– We will use two web sites,

pacificroadrunners.ca/firsthalf/ and ultrasignup.com, to scrape the results for all years of two running races

● Pacific Road Runners First Half Marathon● Chuckanut 50K

Page 7: Intro to web scraping with Python

7

Basic scraping: static HTML

● Tasks:– Inspect the HTML source to locate where in the Document Object Model (DOM) the data of

interest resides● Right-click on the page in browser window and select “View Page Source”

– Retrieve the HTML file using built-in urllib library (note that the interface to this library has changed between Python 2 and 3)

– Parse the HTML file and extract the data of interest, using either a built-in parser like html.parser, or third-party libraries such as lxml, html5lib or BeautifulSoup

● Which library you choose for parsing depends on – how sloppy your html is: if your html is “incorrect”, the parsers may handle it differently

– whether you want to maximize speed (choose lxml)

– what external dependencies you can live with (lxml is a binding to popular C libraries libxml2 and libxslt)

– how “easy” you want the interface to be (BeautifulSoup has an easy-to-get-started API and great documentation with examples)

Page 8: Intro to web scraping with Python

8

urllib library● High level standard library for retrieving data across the Web

(not just html files)

● urllib library was split into several submodules in Python 3 for working with URLs: – urllib.request, urllib.parse, urllib.error, urllib.robotparser

● The urllib.request provides methods for opening URLs

● The urllib.request.urlopen() method is similar to the built-in function open() for opening files

Page 9: Intro to web scraping with Python

9

BeautifulSoup library

● Python third-party library for extracting data from html and xml files

● Works with html.parser, lxml, html5lib● Provides ways to navigate, search and modify the parse

tree based on the position in the parse tree, tag name, tag attributes, css classes using regular expressions, user-defined functions etc

● Excellent tutorial with examples at http://www.crummy.com/software/BeautifulSoup/bs4/doc/

● Supports Python 2.7 and 3

Page 10: Intro to web scraping with Python

10

Quick intro to BeautifulSoup

● BeautifulSoup constructor creates an object that represents original documents as a nested data structure

● The simplest way to address a tag is to use its name● find() and find_all() methods allow to search for an

element; find() returns the first element found, and find_all() a list of elements

● You can access a tag's attributes by treating the tag like a dictionary

● You can extract text from the element

Parsing samplepage.html with BeautifulSoup

Page 11: Intro to web scraping with Python

11

Example: Results tab of Vancouver First Half race

The main Results page has links to yearly results

Page 12: Intro to web scraping with Python

12

HTML source inspection: front page of results

Find the links in the HTML source file corresponding to the rows of yearly results you see in the browser

Note that the links we are interested in are all inside <div class=”entry-content”> ... </div>

The links to each year's page are inside an <a> tag, as a value to attribute named href

The text field of the <a> ... </a> element is a string containing the year of the race

Page 13: Intro to web scraping with Python

13

Extract all links to individual year's results

Note that the collected links are relative; we can append them to the main url to get absolute links

Page 14: Intro to web scraping with Python

14

HTML source inspection: individual year

Find the table of results in the DOM This is the table we want, each runner's info in <tr> ... </tr>

Each participant's entry is a table row <tr> ... </tr>.The text (participant's placement, name, finishing time, etc) are inside <td> ... </td> elements nested in the table row.

Page 15: Intro to web scraping with Python

15

Scrape results for all yearsFirst five finishers of the 2016 race on the web page

Let's print out the first five of the 2016 entry in our dictionary

Page 16: Intro to web scraping with Python

16

What is Ajax

● Name comes from Asynchronous JavaScript and XML, even though XML is no longer the only format for data exchange

● Refers to a group of web technologies used to implement web applications that communicate with the server in the background without interfering with the display of the page or reloading the entire page

● For web scraping, this means that the data you are interested in may not be in the source HTML file you receive from the server but is generated on the client side

● You can use browser development tools (“inspect element”) to inspect the Ajax-generated HTML that your browser displays

Page 17: Intro to web scraping with Python

17

Second example: ultrasignup.comSo far, web site looks similar to previous example. We are interested in scraping results for all years of the race.

Page 18: Intro to web scraping with Python

18

Results page for an individual year

This is a sample page for 2014 results. It looks like the results are in a neat table.

Page 19: Intro to web scraping with Python

19

Inspecting the DOM: the result table is not filled out in the source HTML

However, the <table id=”list”> element is empty when we view source html file.This means we can't simply extract the result table from the HTML file that the server sent to us like we did in the previous example.

We see all result rows in <table id=”list”> </table> when inspecting the element in browser developer tools such as Mozilla Firebug (which shows the generated HTML)

Again, the details of the results are within the <td> ... </td> element inside each <tr> ... </tr> element

Page 20: Intro to web scraping with Python

20

Scraping dynamically generated HTML

● Your scraping system needs to be able to execute JavaScript and either retrieve the results directly, or generate the HTML as seen in the browser and scrape data from that

● A popular choice for the latter is to use Selenium, a suite of tools for browser automation– Python's binding to Selenium suite is the library selenium

– Selenium supports several browsers (Firefox, Chrome, etc), but for scraping a headless non-graphical browser like PhantomJS is a good choice

– Once Selenium has generated the HTML after running JavaScript, you can either access it as a string and use this as input to BeautifulSoup for data extraction, or you can navigate the DOM and find elements within Selenium

Page 21: Intro to web scraping with Python

21

Quick intro to selenium

● Selenium WebDriver accepts commands, sends them to the browser, and retrieves results● You can navigate and search the DOM

– find_element_by_id('my_id')

– find_elements_by_tag_name('table')

– find_elements_by_class_name('my_class_name')

– find_elements_by_xpath()● For example, find_elements_by_xpath(“//table[@id='list']”) will find a table with the id = “list”● Tutorial on xpath syntax: www.w3schools.com/xsl/xpath_intro.asp

– and other methods (using css selectors, link text, etc)

● You can click on links using WebDriver's click() method

● You can retrieve the generated html source by calling .page_source on webdriver instance and parse it outside of selenium

● More on Selenium with Python:

http://selenium-python.readthedocs.org/index.html

Page 22: Intro to web scraping with Python

22

Scraping ultrasignup race results using Selenium and PhantomJS

Page 23: Intro to web scraping with Python

23

And it works!Note that the first two lists (rows in web page table) in our dictionary corresponding to a year do not actually contain finisher results. Let's ignore those, and print out the subsequent results for the first 12 finishers.

This table is from the 2014 results page for comparison: the results match

Page 24: Intro to web scraping with Python

24

Additional resources for scraping projects

Additional third-party libraries you may want to look at:

● requests library

– Filling out forms/login information before accessing data and managing cookies

● mechanize library

– Navigating through forms, session and cookie handling

– Does not yet support Python 3

– Does not run JavaScript

● scrapy library

– Framework for web crawling and scraping: managing crawling, asynchronous scheduling and processing of requests, cookie and session handling, built-in support for data extraction etc

– Does not yet support Python 3

Page 25: Intro to web scraping with Python

25

References for further reading

● BeautifulSoup– http://www.crummy.com/software/BeautifulSoup/bs4/doc/

● Selenium with Python

– http://selenium-python.readthedocs.org/

● General web scraping– Web Scraping with Python by Ryan Mitchell,

O'Reilly 2015