python and web data extraction: introduction2016/07/02  · – web scraping with python: collecting...

35
Python and Web Data Extraction: Introduction Alvin Zuyin Zheng [email protected] http://community.mis.temple.edu/zuyinzheng/

Upload: others

Post on 16-Jul-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

PythonandWebDataExtraction:Introduction

Alvin Zuyin [email protected]

http://community.mis.temple.edu/zuyinzheng/

Page 2: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Outline

• Overview• StepsinWebScraping– FetchingaWebpage– Downloadthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile

• Tutorial2:ExtractingTextualDatafrom10-K

Page 3: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Webscrapingtypicallyconsistof

Step1.Fetchingawebpage

Step2.Downloadingthe

webpage(Optional)

Step3.Extractinginformationfromthewebpage

Step4.Storinginformationinafile

Page 4: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Example:10-K

URL:https://www.sec.gov/Archives/edgar/data/1288776/000165204416000012/goog10-k2015.htm

Page 5: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Example:TablewithLinks

URL:https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=GOOG&type=10-K&dateb=&owner=exclude&count=100

https://www.sec.gov/edgar/searchedgar/companysearch.html

Page 6: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Outline

• Overview• StepsinWebScraping– FetchingaWebpage– Downloadingthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile

• Tutorial2:ExtractingTextualDatafrom10-K

Page 7: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

FetchingaWebpage• Usetheurllib2 packagetoopenawebpage– Donotneedtoinstallmanually

>>> import urllib2

>>> urlLink = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=GOOG&type=10-K&dateb=&owner=exclude&count=100"

>>> pageRequest = urllib2.Request(urlLink)

>>> pageOpen = urllib2.urlopen(pageRequest)

>>> pageRead = pageOpen.read()

>>>

Page 8: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Outline

• Overview• StepsinWebScraping– FetchingaWebpage– Downloadingthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile

• Tutorial2:ExtractingTextualDatafrom10-K

Page 9: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

DownloadingaWebpage

• Weoftenwanttodownload thewebpagesbecause– Wewanttolimitwebrequests– Websitesmaychangeovertime– Wewanttoreplicateresearch

>>> os.chdir('/Users/alvinzuyinzheng/Dropbox/PythonWorkshop/scripts/') #Change your working directory

>>> htmlname = "goog10-k2015.htm"

>>> htmlfile = open(htmlname, "wb")

>>> htmlfile.write(pageRead)

>>> htmlfile.close()

Page 10: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Outline

• Overview• StepsinWebScraping– FetchingaWebpage– Downloadthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile

• Tutorial2:ExtractingTextualDatafrom10-K

Page 11: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

WhatisHTML

• Whenperformingwebdataextraction,wedealwithHTMLfiles– HyperTextMarkupLanguage

• HTMLspecifies asetof tags thatidentifystructureandcontenttypeofwebpages– Tags:surroundedbyanglebrackets<>– mosttagscomeinpairs,markingabeginningandending

• E.g., <title> and</title> enclosethetitleofapage

Page 12: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

HTMLLayout

<html>

<head><title>HTML examples</title></head>

<body><h1>HTML demonstration</h1><p>This is a sample HTML page.</p></body>

</html>

html_example.html Openinabrowser:

MoreonHTMLtags:checkHTMLtutorialfromW3schools.

Page 13: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

ViewHTMLSourceCode

• ToinspecttheHTMLpageindetails,youcandooneofthefollowing:

– InFirefox/Chrome:Rightclick>ViewPageSource

– OpentheHTMLfileinatexteditor(eg,Notepad++)

Page 14: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Example:10-K

Viewthewebpageinbrowser

Page 15: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Example:10-KHTMLsourcecode:

Page 16: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

InspectElements

• ToinspectaspecificelementontheHTMLpage,youcandooneofthefollowing:

• InChrome:Rightclickontheelement> Inspect

• InFirefox:Rightclickontheelement> InspectElement

Page 17: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Example:TablewithLinks

Page 18: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Example:TablewithLinksHTMLsourcecode:

Page 19: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

WaystoExtractDatafromHTML

• Thebs4(BeautifulSoup)Package– UsedforpullingdataoutofHTMLandXMLfiles

• There(regularexpression)Package– CanbeusedforbothHTMLandplaintextfiles

Page 20: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Thebs4(BeautifulSoup)Package

• Installingthepackageinyourcommandlineinterface:pip install beautifulsoup4

• ImportthepackageinPythonfrom bs4 import BeautifulSoup

VisitheretolearnmoreaboutBeautifulSoup:https://www.crummy.com/software/BeautifulSoup/

Page 21: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Example:ExtractingLinksfromaTable

Page 22: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

FetchingtheWebpagewithurllib2

>>> import urllib2

>>> urlLink = “https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=GOOG&type=10-K&dateb=&owner=exclude&count=100"

>>> pageRequest = urllib2.Request(urlLink)

>>> pageOpen = urllib2.urlopen(pageRequest)

>>> pageRead = pageOpen.read()

>>>

InPython:

https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=GOOG&type=10-K&dateb=&owner=exclude&count=100

Page 23: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

ExtractingtheLinkswithBeautifulSoup

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(pageRead,"html.parser")

>>> table = soup.find("table",{"class":"tableFile2"})

>>> links = []

>>> for row in table.findAll("tr"):

... cells = row.findAll("td")

... if len(cells)==5:

... link=cells[1].find("a",{"id":"documentsbutton"})

... docLink="https://www.sec.gov"+link['href']

... links.append(docLink)

InPython:

Page 24: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

ExtractingtheLinkswithBeautifulSoup

https://www.sec.gov/Archives/edgar/data/1288776/000119312516520367/0001193125-16-520367-index.htm

https://www.sec.gov/Archives/edgar/data/1288776/000165204416000012/0001652044-16-000012-index.htm

https://www.sec.gov/Archives/edgar/data/1288776/000128877615000008/0001288776-15-000008-index.htm

https://www.sec.gov/Archives/edgar/data/1288776/000128877614000020/0001288776-14-000020-index.htm

……

Whatwewillget:

Page 25: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

• WetalkedaboutRegularExpressions– Powerfultextmanipulationtoolforsearching,replacing,andparsingtextpatterns

• InPython,youneedtoloadthe“re”package

ExtractingTextualDataUsingre

>>> import re

Page 26: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Example:Item1of10-K

<div style="text-align:left;font-size:10pt;"><font style="font-family:Arial;font-size:10pt;font-weight:bold;">ITEM&nbsp;1.</font>

</div>

Inspectelement “ITEM1.”

HTMLsourcecode:

Havingthesubtitle “ITEM1.”inboldmakessurethatitis inthesubtitle,notinmaintext

Page 27: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

ExtractingTextualDataUsingre# assume we have pre-processed the webpage

# and the page content is stored in a variable “page”

>>> import re

>>> regex="bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\."

>>> match = re.search(regex, page, flags=re.IGNORECASE)

#returns everything between “Item 1.” and “Item 1A.”

>>> match.group(1)

Page 28: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

AnatomyoftheREPattern• Weusedthefollowingpattern:

'bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.'

“Item1.” “Item1A.”

(.+?)representseverything elseinbetween thatwillbeextracted

• Whatdoeseachelementmean?RegularExpression CorrespondingText

\" "

\s Whitespace(suchasspace,tab,newline)

* Repeatstheprecedingcharacterzeroormoretimes

\. .(dot)

Page 29: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

MoreBasicPatternsSymbols Meaning^ Matches the beginning of a line$ Matches the end of the line. (dot) Matches any character but a whitespace\s Matches a single whitespace* Repeats a character zero or more times+ Repeats a character one or more times? Repeats a character zero or 1 times[xyz] Matches any of x, y, z\w Matches a letter or digit or underbar\d Matches a digit [0-9]() Indicates where string extraction is

to start and end

Page 30: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Outline

• Overview• StepsinWebScraping– FetchingaWebpage– Downloadthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile

• Tutorial2:ExtractingTextualDatafrom10-K

Page 31: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Storinginformationtoacsvfile# Previously we have a list of links extracted

# and stored in a variable “links”

>>> import csv

>>> csvOutput = open("IndexLinks.csv", "wb")

>>> csvWriter = csv.writer(csvOutput, quoting = csv.QUOTE_NONNUMERIC)

>>> for link in links:

... csvWriter.writerow([link])

>>> csvOutput.close()

Page 32: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Outline

• Overview• StepsinWebScraping– FetchingaWebpage– Downloadthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile

• Tutorial2:ExtractingTextualDatafrom10-K

Page 33: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Tutorial2:ExtractingTextualDatafrom10-K

• InstalltheBeautifulSouppackagepip install beautifulsoup4

• Downloadthefollowingfilesfromourwebsite,andputthemintothesamefolder• 1GetIndexLinks.py• 2Get10kLinks.py• 3DownLoadHTML.py• 4ReadHTML.py• CompanyList.csv

Page 34: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

Tutorial2:ExtractingTextualDatafrom10-K

• ChangingWorkingDirectory– Foreachofthefourscripts,changetheworkingdirectorytowhereyouputthecompanylist(CompanyList.csv)bychangingthefollowingline:

os.chdir(‘/Users/alvinzuyinzheng/Dropbox/PythonWorkshop/scripts/')

• RuneachPythonScriptone-by-one

Page 35: Python and Web Data Extraction: Introduction2016/07/02  · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining

OtherResources

• Books:– WebScrapingwithPython:CollectingDatafromtheModernWeb(byRyanMitchell)

– MiningtheSocialWeb:DataMiningFacebook,Twitter,LinkedIn,Google+,GitHub,andMore(byMatthewA.Russell)

• BeautifulSoupDocumentation:– https://www.crummy.com/software/BeautifulSoup/bs4/doc/