network programming in python - university of michigan...why scrape? •pull data - particularly...
TRANSCRIPT
![Page 1: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b5af22ff331740a69e75c/html5/thumbnails/1.jpg)
Network Programming in Python
Charles Severancewww.dr-chuck.com
Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License.http://creativecommons.org/licenses/by/3.0/.
Copyright 2009, Charles Severance
What is Web Scraping?
• When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information and then looks at more web pages.
• Search engines scrape web pages - we call this “spidering the web” or “web crawling”
http://en.wikipedia.org/wiki/Web_scrapinghttp://en.wikipedia.org/wiki/Web_crawler
ServerGET
HTML
GET
HTML
![Page 2: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b5af22ff331740a69e75c/html5/thumbnails/2.jpg)
Why Scrape?
• Pull data - particularly social data - who links to who?
• Get your own data back out of some system that has no “export capability”
• Monitor a site for new information
• Spider the web to make a database for a search engine
Scraping Web Pages
• There is some controversy about web page scraping and some sites are a bit snippy about it.
• Google: facebook scraping block
• Republishing copyrighted information is not allowed
• Violating terms of service is not allowed
http://www.facebook.com/terms.php
HTML and HTTP in PythonUsing urllib
![Page 3: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b5af22ff331740a69e75c/html5/thumbnails/3.jpg)
Using urllib to retrieve web pages
http://docs.python.org/lib/module-urllib.html
• You get the entire web page when you do f.read() - lines are separated by a “newline” character “\n”
• You get the entire web page when you do f.read() - lines are separated by a “newline” character “\n”
• We can split the contents into lines using the split() function
\n\n
\n\n
![Page 4: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b5af22ff331740a69e75c/html5/thumbnails/4.jpg)
• Splitting the contents on the newline character gives use a nice list where each entry is a single line
• We can easily write a for loop to look through the lines
>>> print len(contents)95328>>> lines = contents.split("\n")>>> print len(lines)2244>>> print lines[3]<style type="text/css">>>>
for ln in lines: # Do something for each line
A Simple Web “Browser”returl.py
import urllib
url = raw_input("Enter a URL: ")
print "Retrieving:", urlf = urllib.urlopen(url)
contents = f.read()print "Retrieved",len(contents),"characters"f.close()
lines = contents.split('\n');print "Retrieved",len(lines),"lines"
python returl.py Enter a URL: http://www.dr-chuck.com/Retrieving: http://www.dr-chuck.com/Retrieved 104867 charactersRetrieved 2401 lines
python returl.py Enter a URL: http://www.appenginelearn.com/Retrieving: http://www.appenginelearn.com/Retrieved 6769 charactersRetrieved 175 lines
![Page 5: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b5af22ff331740a69e75c/html5/thumbnails/5.jpg)
python returl.py Enter a URL: www.dr-chuck.comRetrieving: www.dr-chuck.comTraceback (most recent call last): File "returl.py", line 6, in <module> f = urllib.urlopen(url) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 87, in urlopen File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 203, in open File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 461, in open_file File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 475, in open_local_fileIOError: [Errno 2] No such file or directory: 'www.dr-chuck.com'
Parsing HTML
Counting Anchor Tags
• We want to look through a web page and see how many lines have the string “<a”
• We don’t care if there is more than one per line - we are looking for a rough number
![Page 6: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b5af22ff331740a69e75c/html5/thumbnails/6.jpg)
<body> <div id="header"> <h1><a href="index.htm">AppEngineLearn</a></h1> <ul><li><a href="/ezlaunch.htm">EZ-Launch</a></li><li><a href=http://oreilly.com/catalog/9780596800697/ target=_new> Book</a></li><li><a href=http://www.dr-chuck.com/ target=_new> Author</a></li><li><a href=http://www.pythonlearn.com/>Python</a></li><li><a href=http://code.google.com/appengine/ target=_new> App Engine</a></li> </ul> </div>
import urlliburl = raw_input("Enter a URL: ")print "Retrieving:", urlf = urllib.urlopen(url)contents = f.read()print "Retrieved",len(contents),"characters"f.close()lines = contents.split('\n');print "Retrieved",len(lines),"lines"
count = 0for line in lines: if line.find("<a") >= 0 : count = count + 1
print count, "lines with <a tag"
• You get the entire web page when you do f.read() - lines are separated by a “newline” character “\n”
• We can split the contents into lines using the split() function
\n\n
\n\n
• Splitting the contents on the newline character gives use a nice list where each entry is a single line
• We can easily write a for loop to look through the lines
>>> print len(contents)95328>>> lines = contents.split("\n")>>> print len(lines)2244>>> print lines[3]<style type="text/css">>>>
for line in lines: # Do something for each line
![Page 7: Network Programming in Python - University of Michigan...Why Scrape? •Pull data - particularly social data - who links to who? •Get your own data back out of some system that has](https://reader034.vdocuments.mx/reader034/viewer/2022042023/5e7b5af22ff331740a69e75c/html5/thumbnails/7.jpg)
import urlliburl = raw_input("Enter a URL: ")print "Retrieving:", urlf = urllib.urlopen(url)contents = f.read()print "Retrieved",len(contents),"characters"f.close()lines = contents.split('\n');print "Retrieved",len(lines),"lines"
count = 0for line in lines: if line.find("<a") >= 0 : count = count + 1
print count, "lines with <a tag"
python hrefs.py Enter a URL: http://www.dr-chuck.com/Retrieving: http://www.dr-chuck.com/Retrieved 104739 charactersRetrieved 2405 lines63 lines with <a tag
python hrefs.py Enter a URL: http://www.appenginelearn.com/Retrieving: http://www.appenginelearn.com/Retrieved 6769 charactersRetrieved 175 lines33 lines with <a tag
Summary
• Python can easily retrieve data from the web and use its powerful string parsing capabilities to sift through the information and make “sense” of the information
• We can build a simple directed web-spider for our own purposes
• Make sure that we do not violate the terms and conditions of a web seit and make sure not to use copyrighted material improperly