how to build tools for data scientists that don't suck

Abstract, Automate,

EnableHow To Build Tools For Data Scientists

That Don’t Suck

Let’s say you want to expose some data this here is you -

enthusiastic and full of lifeWhat might that look like

if you didn’t have many tools at your disposal?

Let’s take a look...

It might look like this PITA:get dat data out!

AW YISSS!

how?

write a bunch of infrastructure code to get data out

decide where to keep data - make sure delimiters and metadata are correctly

configured

okay now can i build the dashboard

nope

okay that took a while...what now

now you gotta build the tools to read the data out of where

you decided to keep it

Oh btw, did you set up an ETL for this whole thing? Go do it.

mkay...

okay now am i done

now you build your dashboard - this is what you’re good at.

alright now here’s the hardest part - deploy it to

production

how...

[data science happens here]

idk, make it fault tolerant, autoscaling, load

balancing, ya get it

And this is the result...

this is you now - the life has been sucked out, you are tired and cranky

AND YOU HAVE TO DO THAT WHOLE THING AGAIN FOR EVERY PROJECT THAT YOU BUILD.

MEH.

This is what it can look likeextract data

push button to deploy


First iteration: give them some tools!What are the lowest hanging fruit of data scientists’ needs?

Then quickly build a rudimentary tool to solve the problem.

Do not build tools for problems that do not exist.

What does it take to build a great tool?Abstract [out the complexity]

Data scientists don’t need to know the internal complexities of your tool.

They want to be able to read and write data as they please.

If your tool requires a lot of internal knowledge, it’s a sign that it is not well abstracted.

What does it take to build great tools?Automate [the process]No one wants to sit there and manually do stuff that could be easily automated.

Automated tools save people time and make them more productive. And they will love you for it, too.

Automation further abstracts out the complexity.

What does it take to build great tools?Enable [data scientists to self serve]

Your tool needs to enable the data scientists to do their job, and leave you out of it.

None of that “email me if this breaks” or “send request to platform for this”.

The tool needs to be easy to use and self service even when things go wrong.

Case Study: FlotillaRemoves configuration and installation pain

The story: need for scalable jupyter notebooksAdapt to the data scientists’ workflow, instead of forcing your own workflow on them:

data scientists want shareable ipython/rstudio notebooks

first solution was to stand up a Jupyterhub server on a huge EC2 instance

need to install/configure libraries yourself

That might be okay for a few

people...but Stitch Fix has 70 data scientists...and growing!

THE PROBLEMS:

● one host

● fragile data storage

● not scalable as team grows

The solution: FlotillaWe decided to make a version of jupyterhub that is distributed across many machines, which each notebook running in a Docker container.

personal networked home directory backed by NFS

personal jupyter including python, pyspark, scala+spark notebooks, personal Rstudio

Customizable memory!

flotilla

ECS

Load Balancer

flotilla

docker containers

And this is what it looks like for users

Steps:1. click on New Container button2. name container, choose memory3. use the notebook

Voila, notebook comes with all platform tools preinstalled, as well as other libraries you’d want to use.

Case Study: Book of magicSimplifies and removes complexity of launching visualization tools

The story part i: dashboardsMany data scientists started writing

data dashboards for their business partners.

But online stuff is hard!

real time data needs more infrastructure

code needs to be packaged and deployed...how??

deployment needs to be resilient to failures

They used what tools they had, mostly barebones aws.

IT WAS CLEAR A TOOL WAS NEEDED!

The solution: book of magic

It was clear that we needed a deployment tool for these growing dashboards, so we built one!

Road is paved with python

package into RPM with push of button

bake AMI with push of a button

deploy to AWS with push of a button

Best practices are built into the process

push to package code

push to bake


push to deploy

rpm spec file

nexus(rpm repo)

fables repo

your code repo

CODE!(flask)

build rpmJ

J

AMI

bake AMI

autoscaling group(asg)

EC2 EC2

load balancer(elb)

internet traffic

J

deploy

AMI

J = jenkins job

Complexity under the hood:

The story part ii: web services

Now that deployment was easy, data scientists started writing web services to integrate with engineering - literally injecting data science into our core applications:

style recommendations

smart fix assignments

inventory allocation

warehouse allocation

Whoa. Data scientists writing web services - that’s pretty cool.

TakeawaysMake your data scientists’ lives easier! Build tools that they need, and build them in a way that:

abstracts out complexity

automates the process

enables self service

If you have these three parts, you would be surprised how effective and productive your data science team can get doing awesome stuff.

AWWWW YISS.

All pictures used in this presentation credit to Allie Broshhyperboleandahalf.blogspot.com

how to build tools for data scientists that don't suck

Engineering