how to build tools for data scientists that don't suck
TRANSCRIPT
Abstract, Automate,
EnableHow To Build Tools For Data Scientists
That Don’t Suck
Let’s say you want to expose some data this here is you -
enthusiastic and full of lifeWhat might that look like
if you didn’t have many tools at your disposal?
Let’s take a look...
It might look like this PITA:get dat data out!
AW YISSS!
how?
write a bunch of infrastructure code to get data out
decide where to keep data - make sure delimiters and metadata are correctly
configured
okay now can i build the dashboard
nope
okay that took a while...what now
now you gotta build the tools to read the data out of where
you decided to keep it
Oh btw, did you set up an ETL for this whole thing? Go do it.
mkay...
okay now am i done
now you build your dashboard - this is what you’re good at.
alright now here’s the hardest part - deploy it to
production
how...
[data science happens here]
idk, make it fault tolerant, autoscaling, load
balancing, ya get it
And this is the result...
this is you now - the life has been sucked out, you are tired and cranky
AND YOU HAVE TO DO THAT WHOLE THING AGAIN FOR EVERY PROJECT THAT YOU BUILD.
MEH.
This is what it can look likeextract data
push button to deploy
[data science happens here]
First iteration: give them some tools!What are the lowest hanging fruit of data scientists’ needs?
Then quickly build a rudimentary tool to solve the problem.
Do not build tools for problems that do not exist.
What does it take to build a great tool?Abstract [out the complexity]
Data scientists don’t need to know the internal complexities of your tool.
They want to be able to read and write data as they please.
If your tool requires a lot of internal knowledge, it’s a sign that it is not well abstracted.
What does it take to build great tools?Automate [the process]No one wants to sit there and manually do stuff that could be easily automated.
Automated tools save people time and make them more productive. And they will love you for it, too.
Automation further abstracts out the complexity.
What does it take to build great tools?Enable [data scientists to self serve]
Your tool needs to enable the data scientists to do their job, and leave you out of it.
None of that “email me if this breaks” or “send request to platform for this”.
The tool needs to be easy to use and self service even when things go wrong.
Case Study: FlotillaRemoves configuration and installation pain
The story: need for scalable jupyter notebooksAdapt to the data scientists’ workflow, instead of forcing your own workflow on them:
data scientists want shareable ipython/rstudio notebooks
first solution was to stand up a Jupyterhub server on a huge EC2 instance
need to install/configure libraries yourself
That might be okay for a few
people...but Stitch Fix has 70 data scientists...and growing!
THE PROBLEMS:
● one host
● fragile data storage
● not scalable as team grows
The solution: FlotillaWe decided to make a version of jupyterhub that is distributed across many machines, which each notebook running in a Docker container.
personal networked home directory backed by NFS
personal jupyter including python, pyspark, scala+spark notebooks, personal Rstudio
Customizable memory!
flotilla
ECS
Load Balancer
flotilla
docker containers
And this is what it looks like for users
Steps:1. click on New Container button2. name container, choose memory3. use the notebook
Voila, notebook comes with all platform tools preinstalled, as well as other libraries you’d want to use.
Case Study: Book of magicSimplifies and removes complexity of launching visualization tools
The story part i: dashboardsMany data scientists started writing
data dashboards for their business partners.
But online stuff is hard!
real time data needs more infrastructure
code needs to be packaged and deployed...how??
deployment needs to be resilient to failures
They used what tools they had, mostly barebones aws.
IT WAS CLEAR A TOOL WAS NEEDED!
The solution: book of magic
It was clear that we needed a deployment tool for these growing dashboards, so we built one!
Road is paved with python
package into RPM with push of button
bake AMI with push of a button
deploy to AWS with push of a button
Best practices are built into the process
push to package code
push to bake
[data science happens here]
push to deploy
rpm spec file
nexus(rpm repo)
fables repo
your code repo
CODE!(flask)
build rpmJ
J
AMI
bake AMI
autoscaling group(asg)
EC2 EC2
load balancer(elb)
internet traffic
J
deploy
AMI
J = jenkins job
Complexity under the hood:
The story part ii: web services
Now that deployment was easy, data scientists started writing web services to integrate with engineering - literally injecting data science into our core applications:
style recommendations
smart fix assignments
inventory allocation
warehouse allocation
Whoa. Data scientists writing web services - that’s pretty cool.
TakeawaysMake your data scientists’ lives easier! Build tools that they need, and build them in a way that:
abstracts out complexity
automates the process
enables self service
If you have these three parts, you would be surprised how effective and productive your data science team can get doing awesome stuff.
AWWWW YISS.
All pictures used in this presentation credit to Allie Broshhyperboleandahalf.blogspot.com