cloud architectures for data science
TRANSCRIPT
@MargrietGr
Margriet Groenendijk, PhDDeveloper Advocate for IBM Cloud Data Services
O’Reilly Software Architecture ConferenceSan Francisco16 November 2016
Cloud Architectures for Data Science
@MargrietGr
About me• Developer Advocate at IBM Cloud Data Services, UK
•Data science•Python, Spark, R, Cloudant, dashDB
• Research Fellow at University of Exeter, UK•Worked with very large observational datasets and the output of global scale climate models
• PhD at Vrije Universiteit Amsterdam, the Netherlands•Explored large observational datasets of carbon uptake by forests
@MargrietGr
A Brief History of Data Science
• Computer Science• Data Technology• Visualization• Mathematics• Statistics
http://www.datascienceassn.org/content/history-data-science
@MargrietGrhttps://blog.rjmetrics.com/2015/10/05/how-many-data-scientists-are-there/
How many Data Scientists are there?
@MargrietGrhttps://whatsthebigdata.com/2015/11/08/top-skills-and-backgrounds-of-data-scientists-on-linkedin/
@MargrietGr
https://whatsthebigdata.com/2015/11/08/top-skills-and-backgrounds-of-data-scientists-on-linkedin/
@MargrietGr
Toolbox
http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png
@MargrietGr
Data Engineers
Data Scientists
BusinessAnalysts
App Developers
Data Science is a Team Effort
Data
@MargrietGr
Data Science Workflow
DefineQuestion
FindData
ExploreData
CleanData VisualizeandSummarizeData
CreatePredictiveModels
PresentResults
@MargrietGr
RDDs : Resilient Distributed Datasets• Data does not have to fit on a single machine• Data is separated into partitions
• Creation of RDDs•Load an external dataset•Distribute a collection of objects
• Transformations construct a new RDD from a previous one (lazy!)• Actions compute a result based on an RDD
@MargrietGr
Run Spark locally in a Python notebook
https://www.continuum.io/downloads
http://spark.apache.org/downloads.html
Create a new kernel to use in a Jupyter notebook
@MargrietGr
Jupyter Notebooks!
• Server-client application to edit and run notebook documents via a web browser
• Cells with:•Code•Figures and tables•Rich text elements
• Different kernels: Python, R, Scala, Spark
In the Cloud:
@MargrietGr
Define Question
What will the weather be next weekend?
https://unsplash.com/search/autumn?photo=LSF8WGtQmn8https://unsplash.com/search/rain?photo=19tQv51x4-A
@MargrietGr
Explore DataPython packages• requests and json
•API credentials and latitude/longitude of San Francisco•json data returned
• pandas, numpy and datetime•convert json to pandas DataFrame (table with multiple indices)•add time as index
@MargrietGr
Weather forecast for San Franciscohttps://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/
Visualize DataPython packages• pandas - rolling mean• matplotlib• Basemap
@MargrietGr
Weather map - example for UK
https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/
Python packages• matplotlib• Basemap• itertools• urllib
@MargrietGr
Weather, Twitter and Sentiment
• Where to find the data?• Where to store the data?• Where to analyse the data?
• Quick tools to explore
@MargrietGr
• watson tone analyser
EmotionLanguage style
Social propensities
Analyze how you are coming across to others
@MargrietGr
Simpler Workflow
Weather Company Data
crontab -e
0 23 * * * /path/to/file/do_something.sh
python do_something.py
TweetsWeatherSentiment
Watson Tone Analyser
Insights for Twitter
Cloudant NoSQL
@MargrietGr
PixieDust
https://github.com/ibm-cds-labs/pixiedust
Simpler Workflow
@MargrietGr
PixieDust: an Open Source Library that simplifies and improves Jupyter Python Notebooks• PackageManager• Visualizations• Cloud Integration• Scala Bridge• Extensibility• Embedded Apps
https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook/
@DTAIEB55
@MargrietGr
Install Spark packages or plain jars in your Notebook Python kernel without the need to modify configuration file
Uses the GraphFrame Python APIs
Install GraphFrames Spark Package
@MargrietGr
One simple API: display()Call the Options dialog
Panning/Zooming options
Performance statistics
@MargrietGr
Easily export your data to csv, json, html, etc. locally on your laptop or into a cloud-based service like Cloudant or Object Storage
@MargrietGr
Scala Bridge
Define a Python variable
Use the Python var in Scala
Define a Scala variable
Use the Scala var in Python
@MargrietGr
Easily extend PixieDust to create your own visualizations using HTML/CSS/JavaScript
Customized Visualization for GraphFrame Graphs
@MargrietGr
Encapsulate your analytics into compelling User Interfaces better suited for Line of Business Users
@MargrietGr
• Mission: To explore, understand and explain the origin and nature of life in the universe
• Origins: Started in 1959 by two physicists at Cornell
• NASA became interested in 1970, started working with SETI in 1988, funding cut in 1993
SETI@IBMCloud
http://www.seti.org/node/861
@MargrietGr
• The Allen Telescope Array•198 million radio events detected in the last decade•400,000 candidate signals identified •5TB data generated in 10 hours
• No modern analysis or machine learning has been performed on this data• 5 TB of special observations on IBM Object Store
SETI@IBMCloud - the Data
https://github.com/ibm-cds-labs/ibmseti/
@MargrietGr
Public Spark@SETI
4 TB of SETI Data stored in Object Storage
Web API provides Bluemix users access to download SETI data
ObjectStorage
WebAPI Spark Object
Storage
Public Spark@SETI Bluemix Account My Bluemix Account
Spark using Jupyter Notebook and IBM SETI Python Library
Goal: Amateur scientists/data scientists download and analyze SETI data
@MargrietGr
IBM Watson Data Platform• Data Science Experience• Watson Data Platform• Machine Learning
• Sign up for beta: http://datascience.ibm.com/features#machinelearning
@MargrietGr
Data Science in the Cloud• Flexible and quick to iterate, play and explore data• APIs
•Streaming data•Cloud databases•Watson
• Scaling up - add storage or Spark kernels• Easy collaboration and presentation
•Store Data•Share your analyses in notebooks
• Some useful packages: pandas, pyspark, requests, matplotlib, cloudant• Notebooks can be extended! PixieDust
@MargrietGr
https://developer.ibm.com/clouddataservices/author/mgroenen/
Thanks!
Slides will be here: http://www.slideshare.net/MargrietGroenendijk