network programming in python - university of michigan...why scrape? •pull data - particularly...

7
Network Programming in Python Charles Severance www.dr-chuck.com Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License. http://creativecommons.org/licenses/by/3.0/. Copyright 2009, Charles Severance What is Web Scraping? When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information and then looks at more web pages. Search engines scrape web pages - we call this “spidering the web” or “web crawling” http://en.wikipedia.org/wiki/Web_scraping http://en.wikipedia.org/wiki/Web_crawler Server GET HTML GET HTML

Upload: others

Post on 19-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has

Network Programming in Python

Charles Severancewww.dr-chuck.com

Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License.http://creativecommons.org/licenses/by/3.0/.

Copyright 2009, Charles Severance

What is Web Scraping?

• When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information and then looks at more web pages.

• Search engines scrape web pages - we call this “spidering the web” or “web crawling”

http://en.wikipedia.org/wiki/Web_scrapinghttp://en.wikipedia.org/wiki/Web_crawler

ServerGET

HTML

GET

HTML

Page 2: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has

Why Scrape?

• Pull data - particularly social data - who links to who?

• Get your own data back out of some system that has no “export capability”

• Monitor a site for new information

• Spider the web to make a database for a search engine

Scraping Web Pages

• There is some controversy about web page scraping and some sites are a bit snippy about it.

• Google: facebook scraping block

• Republishing copyrighted information is not allowed

• Violating terms of service is not allowed

http://www.facebook.com/terms.php

HTML and HTTP in PythonUsing urllib

Page 3: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has

Using urllib to retrieve web pages

http://docs.python.org/lib/module-urllib.html

• You get the entire web page when you do f.read() - lines are separated by a “newline” character “\n”

• You get the entire web page when you do f.read() - lines are separated by a “newline” character “\n”

• We can split the contents into lines using the split() function

\n\n

\n\n

Page 4: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has

• Splitting the contents on the newline character gives use a nice list where each entry is a single line

• We can easily write a for loop to look through the lines

>>> print len(contents)95328>>> lines = contents.split("\n")>>> print len(lines)2244>>> print lines[3]<style type="text/css">>>>

for ln in lines: # Do something for each line

A Simple Web “Browser”returl.py

import urllib

url = raw_input("Enter a URL: ")

print "Retrieving:", urlf = urllib.urlopen(url)

contents = f.read()print "Retrieved",len(contents),"characters"f.close()

lines = contents.split('\n');print "Retrieved",len(lines),"lines"

python returl.py Enter a URL: http://www.dr-chuck.com/Retrieving: http://www.dr-chuck.com/Retrieved 104867 charactersRetrieved 2401 lines

python returl.py Enter a URL: http://www.appenginelearn.com/Retrieving: http://www.appenginelearn.com/Retrieved 6769 charactersRetrieved 175 lines

Page 5: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has

python returl.py Enter a URL: www.dr-chuck.comRetrieving: www.dr-chuck.comTraceback (most recent call last): File "returl.py", line 6, in <module> f = urllib.urlopen(url) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 87, in urlopen File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 203, in open File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 461, in open_file File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 475, in open_local_fileIOError: [Errno 2] No such file or directory: 'www.dr-chuck.com'

Parsing HTML

Counting Anchor Tags

• We want to look through a web page and see how many lines have the string “<a”

• We don’t care if there is more than one per line - we are looking for a rough number

Page 6: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has

<body> <div id="header"> <h1><a href="index.htm">AppEngineLearn</a></h1> <ul><li><a href="/ezlaunch.htm">EZ-Launch</a></li><li><a href=http://oreilly.com/catalog/9780596800697/ target=_new> Book</a></li><li><a href=http://www.dr-chuck.com/ target=_new> Author</a></li><li><a href=http://www.pythonlearn.com/>Python</a></li><li><a href=http://code.google.com/appengine/ target=_new> App Engine</a></li> </ul> </div>

import urlliburl = raw_input("Enter a URL: ")print "Retrieving:", urlf = urllib.urlopen(url)contents = f.read()print "Retrieved",len(contents),"characters"f.close()lines = contents.split('\n');print "Retrieved",len(lines),"lines"

count = 0for line in lines: if line.find("<a") >= 0 : count = count + 1

print count, "lines with <a tag"

• You get the entire web page when you do f.read() - lines are separated by a “newline” character “\n”

• We can split the contents into lines using the split() function

\n\n

\n\n

• Splitting the contents on the newline character gives use a nice list where each entry is a single line

• We can easily write a for loop to look through the lines

>>> print len(contents)95328>>> lines = contents.split("\n")>>> print len(lines)2244>>> print lines[3]<style type="text/css">>>>

for line in lines: # Do something for each line

Page 7: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has

import urlliburl = raw_input("Enter a URL: ")print "Retrieving:", urlf = urllib.urlopen(url)contents = f.read()print "Retrieved",len(contents),"characters"f.close()lines = contents.split('\n');print "Retrieved",len(lines),"lines"

count = 0for line in lines: if line.find("<a") >= 0 : count = count + 1

print count, "lines with <a tag"

python hrefs.py Enter a URL: http://www.dr-chuck.com/Retrieving: http://www.dr-chuck.com/Retrieved 104739 charactersRetrieved 2405 lines63 lines with <a tag

python hrefs.py Enter a URL: http://www.appenginelearn.com/Retrieving: http://www.appenginelearn.com/Retrieved 6769 charactersRetrieved 175 lines33 lines with <a tag

Summary

• Python can easily retrieve data from the web and use its powerful string parsing capabilities to sift through the information and make “sense” of the information

• We can build a simple directed web-spider for our own purposes

• Make sure that we do not violate the terms and conditions of a web seit and make sure not to use copyrighted material improperly