web scraping with python

Web Scraping With Python

Robert Dempsey

There is a lot of data provided freely on the Internet. Not all data is free, and not all site owners allow you to scrape

data from their sites. ALWAYS check the terms of service for a website BEFORE

scraping it. Be responsible, and stay within legal limits at all times.

Important Disclaimer

Data Wranglers LinkedIn GroupWhere the discussions happen.

If you have a question – ask it. Be polite and courteous to others.

Turn your cell phones to vibrate when you come to the meeting. You know more than you think. At some point, I’d like you to

share, with us, something you’ve learned so we can all benefit from it.

Group Rules

Twitter Hashtag

#dwdc

Wireless Network: Logik_guest Password: logik1234

Connecting to the Internet

www.fminer.com

www.websundew.com

www.visualwebripper.com

screen-scraper.com

XPath

Xpath Helper – Adam Sadovsky

Xpath finder

Our method: BeautifulSoup4 + Python libraries Scrapy

Application framework (you still have to code) http://scrapy.org

DIY Scraper - Python

Bare Metal = Nokogiri + Mechanize Frameworks

Upton: https://github.com/propublica/upton Wombat: https://github.com/felipecsl/wombat

DIY Scraper - Ruby

Browser Extensions For Scraping

Scraper

https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd

Grabbing The Full Monty

SiteSucker: sitesucker.us

Wget: http://www.gnu.org/s/wget/

CSS Sprites Honeypots IP blocking Captcha Login Ad popups

The Ways Websites Try To Block Us

NetShadehttp://raynersoftware.com/netshade/

WinGatehttp://www.wingate.com/

Continuum.io: Anaconda http://continuum.io/downloads

BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ pip install beautifulsoup4 easy_install beautifulsoup4

Unicodecsv pip install unicodecsv

Installs

http://continuum.io/downloads



http://www.crummy.com/software/BeautifulSoup/

Find the webpage(s) you want Get the path to the data using Xpath or the CSS selectors Write the code Test Scrape Export to CSV Enjoy your data!

General Steps

1. Ensure you’ve installed the extension2. Log in to Google Docs (this is where the data goes)3. Open the URL: http://www.inc.com/inc5000/list4. Highlight the first line5. Right-click and select “Scrape Similar”6. Verify the data in the window that pops up7. Click the “Export to Google Docs…” button8. Voila!

#1: Scraping the Inc. 5000 with Scraper

http://www.inc.com/inc5000/list

http://www.inc.com/inc5000/list

Only works with data in a tabular format Only exports to Google Docs Works on one page at a time

Suggestion: Keep the scraping window open, go to the next page, click “Scrape” again.

Notes On Scraper

BeautifulSoup A toolkit for dissecting a document and extracting what you need. Automatically converts incoming documents to Unicode and outgoing

documents to UTF-8. Sits on top of popular Python parsers like lxml and html5lib

Examples http://www.crummy.com/software/BeautifulSoup/bs4/doc/

#2: Using Python to Scrape Pages

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

1. Import your libraries2. Take a LinkedIn URL as input3. Build an opener4. Create the soup using BS45. Extract the company description and specialties6. Clean up the rest of the data7. Extract the website, type, founded, industry, and company

size if they exist, otherwise set them to “N/A”8. Output to CSV9. Sleep some random number of seconds & milliseconds

Scraping LinkedIn Company Pages - PseudoCode

https://github.com/rdempsey/dwdc

Get The Code

Contacting Rob

[email protected] Twitter: rdempsey LinkedIn: robertwdempsey

mailto:[email protected]