web scraping with python
DESCRIPTION
Data Wranglers DC December meetup: http://www.meetup.com/Data-Wranglers-DC/events/151563622/ There's a lot of data sitting on websites just waiting to be combined with data you have sitting on your servers. During this talk, Robert Dempsey will show you how to create a dataset using Python by scraping websites for the data you want.TRANSCRIPT
Web Scraping With Python
Robert Dempsey
There is a lot of data provided freely on the Internet. Not all data is free, and not all site owners allow you to scrape
data from their sites. ALWAYS check the terms of service for a website BEFORE
scraping it. Be responsible, and stay within legal limits at all times.
Important Disclaimer
Data Wranglers LinkedIn GroupWhere the discussions happen.
If you have a question – ask it. Be polite and courteous to others.
Turn your cell phones to vibrate when you come to the meeting. You know more than you think. At some point, I’d like you to
share, with us, something you’ve learned so we can all benefit from it.
Group Rules
Twitter Hashtag
#dwdc
Wireless Network: Logik_guest Password: logik1234
Connecting to the Internet
www.fminer.com
www.websundew.com
www.visualwebripper.com
screen-scraper.com
XPath
Xpath Helper – Adam Sadovsky
Xpath finder
Our method: BeautifulSoup4 + Python libraries Scrapy
Application framework (you still have to code) http://scrapy.org
DIY Scraper - Python
Bare Metal = Nokogiri + Mechanize Frameworks
Upton: https://github.com/propublica/upton Wombat: https://github.com/felipecsl/wombat
DIY Scraper - Ruby
Browser Extensions For Scraping
Scraper
https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd
Grabbing The Full Monty
SiteSucker: sitesucker.us
Wget: http://www.gnu.org/s/wget/
CSS Sprites Honeypots IP blocking Captcha Login Ad popups
The Ways Websites Try To Block Us
NetShadehttp://raynersoftware.com/netshade/
WinGatehttp://www.wingate.com/
Continuum.io: Anaconda http://continuum.io/downloads
BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ pip install beautifulsoup4 easy_install beautifulsoup4
Unicodecsv pip install unicodecsv
Installs
Find the webpage(s) you want Get the path to the data using Xpath or the CSS selectors Write the code Test Scrape Export to CSV Enjoy your data!
General Steps
1. Ensure you’ve installed the extension2. Log in to Google Docs (this is where the data goes)3. Open the URL: http://www.inc.com/inc5000/list4. Highlight the first line5. Right-click and select “Scrape Similar”6. Verify the data in the window that pops up7. Click the “Export to Google Docs…” button8. Voila!
#1: Scraping the Inc. 5000 with Scraper
Only works with data in a tabular format Only exports to Google Docs Works on one page at a time
Suggestion: Keep the scraping window open, go to the next page, click “Scrape” again.
Notes On Scraper
BeautifulSoup A toolkit for dissecting a document and extracting what you need. Automatically converts incoming documents to Unicode and outgoing
documents to UTF-8. Sits on top of popular Python parsers like lxml and html5lib
Examples http://www.crummy.com/software/BeautifulSoup/bs4/doc/
#2: Using Python to Scrape Pages
1. Import your libraries2. Take a LinkedIn URL as input3. Build an opener4. Create the soup using BS45. Extract the company description and specialties6. Clean up the rest of the data7. Extract the website, type, founded, industry, and company
size if they exist, otherwise set them to “N/A”8. Output to CSV9. Sleep some random number of seconds & milliseconds
Scraping LinkedIn Company Pages - PseudoCode
https://github.com/rdempsey/dwdc
Get The Code
Contacting Rob
[email protected] Twitter: rdempsey LinkedIn: robertwdempsey