data science at scale by sarah guido
TRANSCRIPT
![Page 1: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/1.jpg)
Data Science at Scale:Using Apache Spark for Data Science at Bitly
Sarah GuidoSpark Summit Europe 2015
![Page 2: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/2.jpg)
Overview
• About me/Bitly• Spark overview• Using Spark for data science• When it works, it’s great! When it works…
![Page 3: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/3.jpg)
About me
• Data Scientist at Bitly• NYC Python/PyGotham co-organizer• O’Reilly Media author• @sarah_guido
![Page 4: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/4.jpg)
About this talk
• This talk is:– Description of my workflow– Exploration of within-Spark tools
• This talk is not:– In-depth exploration of algorithms– Building new tools on top of Spark– Any sort of ground truth for how you should be
using Spark
![Page 5: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/5.jpg)
A bit of background
• Need for big data analysis tools• MapReduce for exploratory data analysis == • Iterate/prototype quickly• Overall goal: understand how people use not
only our app, but the Internet!
![Page 6: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/6.jpg)
Bitly data!
• Legit big data• 1 hour of decodes is 10 GB• 1 day is 240 GB• 1 month is ~7 TB
![Page 7: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/7.jpg)
Why Spark?
• Fast. Really fast.• Distributed scientific tools• Python! (Sometimes.)• Cutting edge technology• AWS/EMR/S3
![Page 8: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/8.jpg)
Setting up the workflow
• Spark journey– Hadoop server: 1.2 – Python – EMR: 1.3 – Python – EMR: 1.4 – Python/Scala– EMR: 1.5 – Scala
![Page 9: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/9.jpg)
Let’s set the stage…
• Understanding user behavior• How do I extract, explore, and model a subset
of our data using Spark?
![Page 10: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/10.jpg)
Data{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}
![Page 11: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/11.jpg)
Data processing
• Problem: I want to retrieve NYT decodes• Solution: well, there are two…• Spark 1.3
![Page 12: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/12.jpg)
Data processing
![Page 13: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/13.jpg)
Data processing
![Page 14: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/14.jpg)
Data processing
• SparkSQL: 8 minutes• Pure Spark: 4 minutes!!!
![Page 15: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/15.jpg)
Data processing
![Page 16: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/16.jpg)
Topic modeling
• Problem: we have so many links but no way to classify them into certain kinds of content
• Solution: LDA (latent Dirichlet allocation)– Sort of – compare to other solutions
• Spark 1.4
![Page 17: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/17.jpg)
Topic modeling
• LDA in Spark– Generative model– Several different methods– Term frequency vector as input
• “Note: LDA is still an experimental feature under active development...”
![Page 18: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/18.jpg)
Topic modeling
![Page 19: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/19.jpg)
Topic modeling
• Term frequency vector
TERMDOCUMENT
python data hot dogs baseball zoo
doc_1 1 3 0 0 0
doc_2 0 0 4 1 0
doc_3 4 0 0 0 5
![Page 20: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/20.jpg)
Topic modeling
![Page 21: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/21.jpg)
Topic modeling
![Page 22: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/22.jpg)
Trend Detection
• Tell our clients when a particular piece of content is trending
• Transition to Scala• Workflow improvement• EMR + Spark 1.5 + Jupyter + Scala!
![Page 23: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/23.jpg)
Trend Detection
![Page 24: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/24.jpg)
Trend Detection
![Page 25: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/25.jpg)
Architecture
• Right now: not in production– Buy-in
• Streaming applications for parts of the app• Python or Scala?– Scala by force
![Page 26: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/26.jpg)
Some issues
• Hadoop servers• JVM• gzip• 1.4/resource allocation/EMR• Lack of documentation
![Page 27: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/27.jpg)
Where to go next?
• Spark in production!• Use for various parts of our app• Use for R&D and prototyping purposes, with
the potential to expand into the product
![Page 28: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/28.jpg)
Resources/Source Material
• spark.apache.org - documentation• Databricks blog• Cloudera blog• Other Spark users!
![Page 29: Data Science at Scale by Sarah Guido](https://reader036.vdocuments.mx/reader036/viewer/2022070603/5871103d1a28abac6d8b5937/html5/thumbnails/29.jpg)
Thanks!!
@sarah_guido