h2o world - collaborative, reproducible research with h2o - nick elprin

Collaborative, Reproducible Research with H2O

November 10, 2015

dominodatalab.com

Confidential |August 28, 2014Investor Deck |

Who am I?

Enterprise data science platform for analytically sophisticated organizations

Previously built analytical software at a big hedge fund

BA, MS in computer science


My goals for this talk

Convey why reproducibility and collaboration are important

Share insights, tips, principles, technologies to help implement best practices


Motivation

Individual produc.vity Less wasted time tracking, reproducing past work Less wasted time on environment setup

Team efficiency Work compounds; don’t re-invent the wheel More feedback, faster iteration Faster onboarding of new team members

More insights Shared context and discussion facilitates idea generation

Methodology/regulatory Some disciplines / industries have auditing requirements


Challenges in a data science context

• Analytical work is much more than just source code

• Data, results, parameters all important to tracking progress and sharing

• Generating results requires running code — can’t just store files

• Running code requires hardware, and so;ware/packages Setting these up can be a pain Software/packages can differ between people and over time

• Source control (e.g., git) too complex for many data scientists

• Hard to mandate behavior top down — have to incentivize it bottom up

Technical

Organiza.onal


Ten Simple Rules for Reproducible Computational Research

1. For Every Result, Keep Track of How It Was Produced

2. Avoid Manual Data Manipulation Steps

3. Archive the Exact Versions of All External Programs Used

4. Version Control All Custom Scripts

5. Record All Intermediate Results, When Possible in Standardized Formats

6. For Analyses That Include Randomness, Note Underlying Random Seeds

7. Always Store Raw Data behind Plots

8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

9. Connect Textual Statements to Underlying Results

10. Provide Public Access to Scripts, Runs, and Results

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285


Strategy

Give individuals something they want.

Package it in a solu.on that facilitates best prac.ces.


Solution: a central hub to run, track, and share work

Easy Access to Scalable Compute

• Long-running scripts or interactive work • Run multiple experiments in parallel • Elastic resources via cloud infrastructure

Turnkey Deployment & Opera.onaliza.on

• Package analyses into self-service web UIs • Execute models through REST APIs • Schedule automated recurring tasks

Version Control & Reproducibility

• Automatic tracking of code, data, and results • Supports concurrent development

Collabora.on

• Share code, data, and results • Discuss, comment on results and other work • Search and security

Incentivize centralization:

Capitalize on centralization:


Containers (Docker)

• Standardized, centrally managed “environments” with software and

configuration already set up

• Any language/software, even commercial / proprietary (e.g., Matlab)

• Including interactive tools, e.g., Jupyter, RStudio

• Able to change environments per project, per run — possibly without

going through IT

• Containers can be tracked and stored

• Can change environment without affecting others

Flexibility for users

Reproducibility and comparison


Examples


Useful tools and tips

1. For Every Result, Keep Track of How It Was Produced

2. Avoid Manual Data Manipulation Steps

3. Archive the Exact Versions of All External Programs Used

4. Version Control All Custom Scripts

5. Record All Intermediate Results, When Possible in Standardized Formats

6. For Analyses That Include Randomness, Note Underlying Random Seeds

7. Always Store Raw Data behind Plots

8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing

Detail to Be Inspected

9. Connect Textual Statements to Underlying Results

10. Provide Public Access to Scripts, Runs, and Results

Automatic

Automatic

Docker

Pickle, Rda/Rds

“stats” json

Discussion / commentsNotebooks / knitr

Pickle, Rda/Rds


Questions / Feedback?