web scraping with python

38
Web Scraping With Python Robert Dempsey

Upload: robert-dempsey

Post on 26-Aug-2014

5.099 views

Category:

Self Improvement


37 download

DESCRIPTION

Data Wranglers DC December meetup: http://www.meetup.com/Data-Wranglers-DC/events/151563622/ There's a lot of data sitting on websites just waiting to be combined with data you have sitting on your servers. During this talk, Robert Dempsey will show you how to create a dataset using Python by scraping websites for the data you want.

TRANSCRIPT

Page 1: Web Scraping With Python

Web Scraping With Python

Robert Dempsey

Page 2: Web Scraping With Python

There is a lot of data provided freely on the Internet. Not all data is free, and not all site owners allow you to scrape

data from their sites. ALWAYS check the terms of service for a website BEFORE

scraping it. Be responsible, and stay within legal limits at all times.

Important Disclaimer

Page 3: Web Scraping With Python
Page 4: Web Scraping With Python
Page 5: Web Scraping With Python
Page 6: Web Scraping With Python

Data Wranglers LinkedIn GroupWhere the discussions happen.

Page 7: Web Scraping With Python

If you have a question – ask it. Be polite and courteous to others.

Turn your cell phones to vibrate when you come to the meeting. You know more than you think. At some point, I’d like you to

share, with us, something you’ve learned so we can all benefit from it.

Group Rules

Page 8: Web Scraping With Python
Page 9: Web Scraping With Python

Twitter Hashtag

#dwdc

Page 10: Web Scraping With Python

Wireless Network: Logik_guest Password: logik1234

Connecting to the Internet

Page 11: Web Scraping With Python
Page 12: Web Scraping With Python
Page 13: Web Scraping With Python

www.fminer.com

Page 14: Web Scraping With Python

www.websundew.com

Page 15: Web Scraping With Python

www.visualwebripper.com

Page 16: Web Scraping With Python

screen-scraper.com

Page 17: Web Scraping With Python
Page 18: Web Scraping With Python

XPath

Xpath Helper – Adam Sadovsky

Xpath finder

Page 19: Web Scraping With Python

Our method: BeautifulSoup4 + Python libraries Scrapy

Application framework (you still have to code) http://scrapy.org

DIY Scraper - Python

Page 20: Web Scraping With Python

Bare Metal = Nokogiri + Mechanize Frameworks

Upton: https://github.com/propublica/upton Wombat: https://github.com/felipecsl/wombat

DIY Scraper - Ruby

Page 21: Web Scraping With Python

Browser Extensions For Scraping

Scraper

https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd

Page 22: Web Scraping With Python

Grabbing The Full Monty

SiteSucker: sitesucker.us

Wget: http://www.gnu.org/s/wget/

Page 23: Web Scraping With Python

CSS Sprites Honeypots IP blocking Captcha Login Ad popups

The Ways Websites Try To Block Us

Page 24: Web Scraping With Python
Page 25: Web Scraping With Python

NetShadehttp://raynersoftware.com/netshade/

WinGatehttp://www.wingate.com/

Page 26: Web Scraping With Python
Page 27: Web Scraping With Python
Page 28: Web Scraping With Python

Continuum.io: Anaconda http://continuum.io/downloads

BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ pip install beautifulsoup4 easy_install beautifulsoup4

Unicodecsv pip install unicodecsv

Installs

Page 29: Web Scraping With Python

Find the webpage(s) you want Get the path to the data using Xpath or the CSS selectors Write the code Test Scrape Export to CSV Enjoy your data!

General Steps

Page 30: Web Scraping With Python

1. Ensure you’ve installed the extension2. Log in to Google Docs (this is where the data goes)3. Open the URL: http://www.inc.com/inc5000/list4. Highlight the first line5. Right-click and select “Scrape Similar”6. Verify the data in the window that pops up7. Click the “Export to Google Docs…” button8. Voila!

#1: Scraping the Inc. 5000 with Scraper

Page 31: Web Scraping With Python

Only works with data in a tabular format Only exports to Google Docs Works on one page at a time

Suggestion: Keep the scraping window open, go to the next page, click “Scrape” again.

Notes On Scraper

Page 32: Web Scraping With Python

BeautifulSoup A toolkit for dissecting a document and extracting what you need. Automatically converts incoming documents to Unicode and outgoing

documents to UTF-8. Sits on top of popular Python parsers like lxml and html5lib

Examples http://www.crummy.com/software/BeautifulSoup/bs4/doc/

#2: Using Python to Scrape Pages

Page 33: Web Scraping With Python

1. Import your libraries2. Take a LinkedIn URL as input3. Build an opener4. Create the soup using BS45. Extract the company description and specialties6. Clean up the rest of the data7. Extract the website, type, founded, industry, and company

size if they exist, otherwise set them to “N/A”8. Output to CSV9. Sleep some random number of seconds & milliseconds

Scraping LinkedIn Company Pages - PseudoCode

Page 34: Web Scraping With Python

https://github.com/rdempsey/dwdc

Get The Code

Page 35: Web Scraping With Python
Page 36: Web Scraping With Python
Page 37: Web Scraping With Python
Page 38: Web Scraping With Python

Contacting Rob

[email protected] Twitter: rdempsey LinkedIn: robertwdempsey