overview of python web scraping tools

27
Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012

Upload: maikroeder

Post on 30-Aug-2014

14.784 views

Category:

Technology


4 download

DESCRIPTION

A talk I gave at the Barcelona Python Meetup May 2012.

TRANSCRIPT

Page 1: Overview of python web scraping tools

Overview of Python web scraping tools

Maik RöderBarcelona Python Meetup Group

17.05.2012

Friday, May 18, 2012

Page 2: Overview of python web scraping tools

Data Scraping

• Automated Process

• Explore and download pages

• Grab content

• Store in a database or in a text file

Friday, May 18, 2012

Page 3: Overview of python web scraping tools

urlparse

• Manipulate URL strings

urlparse.urlparse()urlparse.urljoin()urlparse.urlunparse()

Friday, May 18, 2012

Page 4: Overview of python web scraping tools

urllib

• Download data through different protocols

• HTTP, FTP, ...

urllib.parse()urllib.urlopen()urllib.urlretrieve()

Friday, May 18, 2012

Page 5: Overview of python web scraping tools

Scrape a web site

• Example: http://www.wunderground.com/

Friday, May 18, 2012

Page 7: Overview of python web scraping tools

Beautifulsoup

• HTML/XML parser

• designed for quick turnaround projects like screen-scraping

• http://www.crummy.com/software/BeautifulSoup

Friday, May 18, 2012

Page 8: Overview of python web scraping tools

BeautifulSoup

from BeautifulSoup import *

a = BeautifulSoup(d).findAll('a')

[x['href'] for x in a]

Friday, May 18, 2012

Page 9: Overview of python web scraping tools

Faster BeautifulSoup

from BeautifulSoup import *

p = SoupStrainer('a')

a = BeautifulSoup(d, parseOnlyThese=p)

[x['href'] for x in a]

Friday, May 18, 2012

Page 10: Overview of python web scraping tools

Inspect the Element

• Inspect the Maximum temperature

Friday, May 18, 2012

Page 11: Overview of python web scraping tools

Find the node

>>> from BeautifulSoup import BeautifulSoup>>> soup = BeautifulSoup(d)>>> attrs = {'class':'nobr'}>>> nobrs = soup.findAll(attrs=attrs)>>> temperature = nobrs[3].span.string>>> print temperature23

Friday, May 18, 2012

Page 12: Overview of python web scraping tools

htmllib.HTMLParser

• Interesting only for historical reasons

• based on sgmllib

Friday, May 18, 2012

Page 13: Overview of python web scraping tools

htmllib5• Using the custom simpletree format

• a built-in DOM-ish tree type (pythonic idioms)

from html5lib import parsefrom html5lib import treebuilderse = treebuilders.simpletree.Elementi = parse(d)a =[x for x in d if isinstance(x, e) and x.name= 'a'][x.attributes['href'] for x in a]

Friday, May 18, 2012

Page 14: Overview of python web scraping tools

lxml• Library for processing XML and HTML

• Based on C librariessudo aptitude install libxml2-devsudo aptitude install libxslt-dev

• Extends the ElementTree API

• e.g. with XPath

Friday, May 18, 2012

Page 15: Overview of python web scraping tools

lxml

from lxml import etreet = etree.parse('t.xml')for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent()

Friday, May 18, 2012

Page 16: Overview of python web scraping tools

twill• Simple

• No JavaScript

• http://twill.idyll.org

• Some more interesting concepts

• Pages, Scenarios

• State Machines

Friday, May 18, 2012

Page 17: Overview of python web scraping tools

twill

• Commonly used methods:go() code() show() showforms() formvalue() (or fv()) submit()

Friday, May 18, 2012

Page 18: Overview of python web scraping tools

Twill

>>> from twill import commands as twill>>> from twill import get_browser>>> twill.go('http://www.google.com')>>> twill.showforms()>>> twill.formvalue(1, 'q', 'Python')>>> twill.showforms()>>> twill.submit()>>> get_browser().get_html()

Friday, May 18, 2012

Page 20: Overview of python web scraping tools

Twill>>> twill.config("acknowledge_equiv_refresh", "false")>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")==> at http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'

Friday, May 18, 2012

Page 21: Overview of python web scraping tools

mechanize• Stateful programmatic web browsing

• navigation history

• HTML form state

• cookies

• ftp:, http: and file: URL schemes

• redirections

• proxies

• Basic and Digest HTTP authentication

Friday, May 18, 2012

Page 24: Overview of python web scraping tools

Selenium

• http://seleniumhq.org

• Support for JavaScript

Friday, May 18, 2012

Page 25: Overview of python web scraping tools

Selenium

from selenium import webdriverfrom selenium.common.exceptions \ import NoSuchElementExceptionfrom selenium.webdriver.common.keys \ import Keysimport time

Friday, May 18, 2012

Page 27: Overview of python web scraping tools

Phantom JS

• http://www.phantomjs.org/

Friday, May 18, 2012