![Page 1: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/1.jpg)
Data Acquisition:Companies & Wharton Data
Basic web scrapingUsing APIs
Session 3Wharton Summer Tech Camp
![Page 2: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/2.jpg)
Set up problems
• Mac– mostly no problems due to linux-like environment and great support
• Windows on MOBAXTERM– You can use apt-cyg to install everything
– Apt-cyg install python– Apt-cyg install idle– Apt-cyg install idlex
![Page 3: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/3.jpg)
REGEX CHALLENGE! • 3 REGEX Challenges• 1 from a well known t-shirt joke (if you know this,
don’t say anything) • 2 are song lyrics (tried to find well known songs). • Raise your hand to say the answer
![Page 4: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/4.jpg)
a t-shirt people wear
r”(bb|[^b]{2})”
Difficulty *Hint: Phrase
![Page 5: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/5.jpg)
a t-shirt people wear
r”(bb|[^b]{2})”
“To be or not to be”
Difficulty *Hint: Phrase
![Page 6: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/6.jpg)
Challenge 2
Difficulty *****Hint: This is literally the entire lyric for the song
r”(\w+ [a-z]{3} w..ld ){144}”
![Page 7: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/7.jpg)
Challenge 2
Difficulty ****Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year
r”(ar\w{3} [a-z]{3} w..ld ){144}”
![Page 8: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/8.jpg)
Challenge 2
Difficulty *****Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year
r”(\w+ [a-z]{3} w..ld ){144}”
Around the world – by Daft Punk
![Page 9: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/9.jpg)
Challenge 3
Difficulty **Hint: Lyric of an old song
r”ah, ((ba ){4} (bar){2}a an{2} \s)+”
![Page 10: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/10.jpg)
Difficulty **
r”ah, ((ba ){4} (bar){2}a an{2} \s)+”
Ah, Ba ba ba ba Barbara Ann~ Ah, Ba ba ba ba Barbara Ann~
Challenge 3
![Page 11: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/11.jpg)
Song PhrasesEver since I learned regex, I was thinking that many Daft Punk songs are optimized for regex.
Lyrics for a song in its entirety with this one simple regex • r”(Around the world ){144}” – Around the world• r"((buy|use|break|fix|trash|change) it )+ now upgrade
it” –Technologic• r”(((work|make|do|makes|more) (it|us|than) (harder|
better|faster|stronger|ever))+ hour after our work is never over. \s)+” – Harder, better, faster, stronger
![Page 12: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/12.jpg)
THE BIGGEST concern for doctoral students doing empirical work (year 2-4)“WHERE AND HOW DO I GET THE DATA?!“
Mr. Data: “I believe what you are experiencing is frustration”
![Page 13: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/13.jpg)
Data sources1.Companies2.Wharton Organizations3.Scraping Web4.APIs : application
programming interface
![Page 14: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/14.jpg)
DATA SOURCES
1. Companies – HARD, UNIQUE– Hardest but once you get a good company, you are set for a
paper or two or more…2. Wharton Organizations – (WRDS) (EASY, COMMON - great for auxiliary data) Other
people can also easily access this data. Data probably have been used already
– (WCAI) (EASY, UNIQUE) data is actually pretty great and only few select teams get it after proposal review process
3. Scraping Web (WGET/REGEX/tools) – MEDIUM, MEDIUM– Relatively easy but painful for big projects and sometimes
not allowed based on website.4. APIs : application programming interface – EASY, COMMON– Easy but restricted to what the company made available.
![Page 15: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/15.jpg)
Resources for Public Data
• There are many list of lists for public data• Find a link to list of lists for data in
the course website under “resources for learning”• If you have a good source, please
email me so I can link it on the web
![Page 16: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/16.jpg)
Companies
![Page 17: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/17.jpg)
Quick tips• Don’t be afraid to contact random companies • Attend conferences and network like an MBA - think of it like a game • Send a short 2-3 page proposal suggesting a research collaboration • Read about the company you are contacting and make sure to offer
something that interests the company • Low success probability – among many proposals I’ve sent (about 30+
if you count emails).– Mostly no response. – 1 company I was working with for 10 months just decided to drop
the ball due to CTO changing twice.– 4 very easy data – not useful and suitable for research– 2 very useful data I am currently using/working with. – 1 company disputing about NDA
• NDAs: you can request help from upenn legal team here – https://medley05.isc-seo.upenn.edu/researchInventory/jsp/
fast2.do?bhcp=1
![Page 18: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/18.jpg)
NDAs are super important• A horror story I heard– A student worked with a
company for 1+ year and then the company just decided that the result was too good to publish. Wanted it to be a trade secret/IP.
– NDA signed was bad.– No publication.– Most NDAs are OK but some
are not. If bad, get help from that link and negotiate.
– Look out for “Work for hire” type of NDAs
![Page 19: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/19.jpg)
Wharton Specific
![Page 20: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/20.jpg)
Wharton Specific
You probably heard about these organization from wharton doctoral orientation.• WRDS: Wharton Research Data Services – https://wrds-web.wharton.upenn.edu/wrds/
• WCAI: Wharton Customer Analytics Initiative– http://www.wharton.upenn.edu/wcai/
• Other organizations exist but mostly for conferences and not for data.– http://www.wharton.upenn.edu/faculty/research-c
enters-and-initiativ.cfm
![Page 21: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/21.jpg)
Basic Web Scraping
![Page 22: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/22.jpg)
Caveats
• I spent time writing and testing a scraping code for this course where one inputs a list of music artists in csv format and the script queries allmusic.com to obtain information such as the genres associated with the artists.
• Written in March of 2013. • On July, It broke because allmusic.com has updated
their website… • This is one problem with scraping. You never know
when it will stop working and you have to rewrite.
![Page 23: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/23.jpg)
Outline of basic scraping
1. CRAWLING: Instead of using web browsers, use scripts to access html (xml, etc). Or crawl through website recursively and download all htmls or txts or whatever. (WGET or Python or any language such as php)
2. PATTERN SEARCHING: Researcher looks at the raw http output and looks for where the required data is and figure out what the pattern is. (Developer’s toolbox Firefox)
3. EXTRACTION: Use text extracting tool to extract information and store it! (if it’s structured format such as xml then use appropriate tools for each format). (REGEX, Apache Lucene, SED, AWK, etc)
4. Go publish papers with the data
![Page 24: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/24.jpg)
Alternatives
• Want something easier or with GUI? – MOZENDA: Wharton has license and it’s cheap
• More advanced scraping – We will cover this next week with Scrapy
• There are many other tools and packages for this.– http://en.wikipedia.org/wiki/Web_crawler– http://stackoverflow.com/questions/419235/anyone-kno
w-of-a-good-python-based-web-crawler-that-i-could-use
![Page 25: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/25.jpg)
Tools used in our examples
• WGET + Python• REGEX• HTML/DOM inspector –Firefox has Web Developer's Toolbox
which is an add-on you can download. –This is useful for looking for pattern of
data you want to extract
![Page 26: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/26.jpg)
Scraping Example 1
• Facebook SEC filing exploration–Purpose: Exploration before research–What this toy example is doing: Get SEC
filing for Facebook and extract certain parts– I am interested in reading a few words
before and after whenever there is “shares” mentioned
![Page 27: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/27.jpg)
DOWNLOAD HTMLS/TXT/JPG/ETC
• WGET“GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.”
Fire up edgarFBarchive.sh and extractPhrase.py
![Page 28: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/28.jpg)
WGET FB’s SEC filings
wget -r -l1 -H -t1 -nd -N -np -A.txt –e robots=off http://www.sec.gov/Archives/edgar/data/1326801/
-r -H -l1 -np These options tell wget to download recursively.-nd no directory. Keep the downloaded in one folder-A.txt only download txt files -erobots=off ignore robot.txt (avoid using this option if wget without this option works. Make sure to use --wait option if you use this option or your IP may get banned)
![Page 29: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/29.jpg)
Caveats• WGET only works well for certain websites. You can use it
download all photos etc. But if your script makes too many requests, they may ban your IP. You can specify delayed requests.
• Once website gets fancy, you have to use other tools such as PHP or Python packages – ASP– POST (as opposed to GET protocol in HTTP)– Javascript produced cites – AJAX cites
• This is a toy example for learning. You can still use this method for simple scraping but consider learning pro tools (we’ll cover basics of a such tool next week)
![Page 30: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/30.jpg)
Scraping Example 2
• Jambase.com concert venues–This example gets a list of artists and
queries jambase.com to get concert venue information.–Another toy example
![Page 31: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/31.jpg)
![Page 32: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/32.jpg)
Fire up getConcertVenue.py
![Page 33: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/33.jpg)
API ( Application Programming Interface)
![Page 34: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/34.jpg)
Programmable Web
• programmableweb.com– Search engine for freely available APIs online – http://blog.programmableweb.com/2012/02/15/
40-real-estate-apis-zillow-trulia-walk-score/
– Usage examples
• Usually, you have to apply for API keys from the website or the company offering the data
• Mostly free (limited queries)
![Page 35: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/35.jpg)
Idea behind API
1. You obtain a key from the company offering the data
2. Make requests for data – Many different ways based on API
3. Company server grants you the data 4. Data analysis
![Page 36: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/36.jpg)
Commonly Used Protocol in API• REST (REpresentational State Transfer) – guidelines for client-server interaction for
exchanging data as opposed to the alternative SOAP • I recommend this funny explanation for REST vs SOAP (diagram involving Martin
Lawrence)– http://stackoverflow.com/questions/209905/representational-state-transfer-rest-and-simple-object-
access-protocol-soap
• Based on HTTP• You request data via HTTP GET
(http://www.w3schools.com/tags/ref_httpmethods.asp) protocol and server will give you data – HTTP-URL?QueryStrings – QueryStrings: Field=Value separated by &– E.g. http://www.youtube.com/watch?v=5pidokakU4I&t=0m38s– v: stands for video = some value – t: stands for start time= some value
• Usual Data formats – XML eXtensible Markup Language http://www.w3schools.com/xml/– JSON JavaScript Object Notationhttp://www.w3schools.com/json/
![Page 37: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/37.jpg)
XML Example<CATALOG>
<PLANT><COMMON>Bloodroot</COMMON><BOTANICAL>Sanguinaria canadensis</BOTANICAL><ZONE>4</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$2.44</PRICE><AVAILABILITY>031599</AVAILABILITY>
</PLANT><PLANT>
<COMMON>Columbine</COMMON><BOTANICAL>Aquilegia canadensis</BOTANICAL><ZONE>3</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$9.37</PRICE><AVAILABILITY>030699</AVAILABILITY>
</PLANT>
</CATALOG>
Many xml related packageshttp://wiki.python.org/moin/PythonXml
![Page 38: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/38.jpg)
JSON Example (just like python)
newObject = { "first": "Ted", "last": "Logan", "age": 17, "sex": "M", "salary": 0, "registered": false, "interests": ["Van Halen", "Being Excellent", "Partying"]}
Main python moduleimport json
![Page 39: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/39.jpg)
Yahoo Finance Data Example
![Page 40: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/40.jpg)
Python Package Wrapper
• Yahoo provides simple web interface for anyone to download stock information via url– http://finance.yahoo.com/d/quotes.csv?s=%s&f=%s– s: symbol “GOOG”– f: stat (e.g. l1 means last trade price)
• http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=l1 • More info here
– http://www.gummy-stuff.org/Yahoo-data.htm Ordered to take down
– http://web.archive.org/web/20140325063520/http://www.gummy-stuff.org/Yahoo-data.htm
![Page 41: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/41.jpg)
This Wrapper Package does it for you
• ystockquote– https://pypi.python.org/pypi/ystockquote/0.2.3– https://github.com/cgoldberg/ystockquote
• See the simple source code to learn• Open up ystock.py
![Page 42: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/42.jpg)
Example: YQL
• http://developer.yahoo.com/yql/• APIs are written by individual companies and support
different I/O and usually different languages. • Yahoo Query Language is a simple interface that yahoo
has made available to developers combining several APIs
• “Yahoo! Query Language (YQL) enables you to access Internet data with SQL-like commands.”
• Apply for your API Key – http://developer.yahoo.com/yql/
![Page 43: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/43.jpg)
Our example: BBYOPEN
• https://bbyopen.com/bbyopen-apis-overview• Retail information
– Archive query - Returns a single file containing all attributes for all items exposed by the given API
– Basic query - Returns information about a single item– Advanced query - Returns information about one or more items
according to your specifications– Store availability query - Returns information about products
available at specific storesBest buy is providing this API
• API overview – https://developer.bestbuy.com/get-started
![Page 44: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/44.jpg)
Basic QueryBasic query structurehttp://api.remix.bestbuy.com/API/Item.Format?show=&apiKey=Key API - One of {products, stores, reviews, categories} Item - The value of the fundamental attribute for the selected API:
o products - skuo stores - storeIdo reviews - ido categories - id
Format - One of {xml, json} show= - (optional) The item attributes you want displayed Key - Your API keyNote: show= and Key can be specified in either order.
![Page 45: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/45.jpg)
Basic Query Examples
![Page 46: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/46.jpg)
API example
• Open up bestbuyAPI.py
![Page 47: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/47.jpg)
Lab session
• For the next 10-15 minutes, choose your favorite website and try to scrape a few items
• We’ll do this again with scrapy
![Page 48: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/48.jpg)
Data isn’t impossibly hard to get after all. There are many routes but it could take a LONG time
(especially if are going the company route). START EARLY and you’ll get that data.
DATA!
![Page 49: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp](https://reader035.vdocuments.mx/reader035/viewer/2022081418/56649c9b5503460f94959584/html5/thumbnails/49.jpg)
Next Session
• Hugh will be speaking about HPCC
• After that, we will learn the basics of Scrapy
• Brush up on your HTML and look into XPATH– W3school.com is the best
• Intro into Big Data and Empirical Business Research