overview of python web scraping tools

Overview of Python web scraping tools

Maik RöderBarcelona Python Meetup Group

17.05.2012

Friday, May 18, 2012

Data Scraping

• Automated Process

• Explore and download pages

• Grab content

• Store in a database or in a text file


urlparse

• Manipulate URL strings

urlparse.urlparse()urlparse.urljoin()urlparse.urlunparse()


urllib

• Download data through different protocols

• HTTP, FTP, ...

urllib.parse()urllib.urlopen()urllib.urlretrieve()


Scrape a web site

• Example: http://www.wunderground.com/


http://www.wunderground.com/

http://www.wunderground.com/

Preparation

>>> from StringIO import StringIO>>> from urllib2 import urlopen>>> f = urlopen('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')

>>> p = f.read()>>> d = StringIO(p)>>> f.close()


http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'






Beautifulsoup

• HTML/XML parser

• designed for quick turnaround projects like screen-scraping

• http://www.crummy.com/software/BeautifulSoup


http://www.crummy.com/software/BeautifulSoup




BeautifulSoup

from BeautifulSoup import *

a = BeautifulSoup(d).findAll('a')

[x['href'] for x in a]


Faster BeautifulSoup

from BeautifulSoup import *

p = SoupStrainer('a')

a = BeautifulSoup(d, parseOnlyThese=p)

[x['href'] for x in a]


Inspect the Element

• Inspect the Maximum temperature


Find the node

>>> from BeautifulSoup import BeautifulSoup>>> soup = BeautifulSoup(d)>>> attrs = {'class':'nobr'}>>> nobrs = soup.findAll(attrs=attrs)>>> temperature = nobrs[3].span.string>>> print temperature23


htmllib.HTMLParser

• Interesting only for historical reasons

• based on sgmllib


htmllib5• Using the custom simpletree format

• a built-in DOM-ish tree type (pythonic idioms)

from html5lib import parsefrom html5lib import treebuilderse = treebuilders.simpletree.Elementi = parse(d)a =[x for x in d if isinstance(x, e) and x.name= 'a'][x.attributes['href'] for x in a]


lxml• Library for processing XML and HTML

• Based on C librariessudo aptitude install libxml2-devsudo aptitude install libxslt-dev

• Extends the ElementTree API

• e.g. with XPath


lxml

from lxml import etreet = etree.parse('t.xml')for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent()


twill• Simple

• No JavaScript

• http://twill.idyll.org

• Some more interesting concepts

• Pages, Scenarios

• State Machines


http://twill.idyll.org

http://twill.idyll.org

twill

• Commonly used methods:go() code() show() showforms() formvalue() (or fv()) submit()


Twill

>>> from twill import commands as twill>>> from twill import get_browser>>> twill.go('http://www.google.com')>>> twill.showforms()>>> twill.formvalue(1, 'q', 'Python')>>> twill.showforms()>>> twill.submit()>>> get_browser().get_html()


http://www.google.com

http://www.google.com

Twill - acknowledge_equiv_refresh

>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")...twill.errors.TwillException: infinite refresh loop discovered; aborting.Try turning off acknowledge_equiv_refresh...


http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html








Twill>>> twill.config("acknowledge_equiv_refresh", "false")>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")==> at http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'




















mechanize• Stateful programmatic web browsing

• navigation history

• HTML form state

• cookies

• ftp:, http: and file: URL schemes

• redirections

• proxies

• Basic and Digest HTTP authentication


mechanize - robots.txt>>> import mechanize>>> browser = mechanize.Browser()>>> browser.open('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')

mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt










mechanize - robots.txt

• Do not handle robots.txtbrowser.set_handle_robots(False)

• Do not handle equivbrowser.set_handle_equiv(False)

browser.open('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')










Selenium

• http://seleniumhq.org

• Support for JavaScript


http://seleniumhq.org

http://seleniumhq.org

Selenium

from selenium import webdriverfrom selenium.common.exceptions \ import NoSuchElementExceptionfrom selenium.webdriver.common.keys \ import Keysimport time


Selenium

>>> browser = webdriver.Firefox() >>> browser.get("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")>>> a = browser.find_element_by_xpath("(//span[contains(@class,'nobr')])[position()=2]/span").textbrowser.close()>>> print a

23








Phantom JS

• http://www.phantomjs.org/


http://www.phantomjs.org

http://www.phantomjs.org

overview of python web scraping tools

Technology

beautifulsoup

www

historyairport

selenium

2012 mechanize

browser

beautifulsoup

parse