h2o world - collaborative, reproducible research with h2o - nick elprin

12
Collaborative, Reproducible Research with H2O November 10, 2015 dominodatalab.com

Upload: sri-ambati

Post on 24-Jan-2018

371 views

Category:

Software


0 download

TRANSCRIPT

Collaborative, Reproducible Research with H2O

November 10, 2015

dominodatalab.com

Confidential |August 28, 2014Investor Deck |

Who am I?

Enterprise data science platform for analytically sophisticated organizations

Previously built analytical software at a big hedge fund

BA, MS in computer science

Confidential |August 28, 2014Investor Deck |

My goals for this talk

Convey why reproducibility and collaboration are important

Share insights, tips, principles, technologies to help implement best practices

Confidential |August 28, 2014Investor Deck |

Motivation

Individual produc.vity Less wasted time tracking, reproducing past work Less wasted time on environment setup

Team efficiency Work compounds; don’t re-invent the wheel More feedback, faster iteration Faster onboarding of new team members

More insights Shared context and discussion facilitates idea generation

Methodology/regulatory Some disciplines / industries have auditing requirements

Confidential |August 28, 2014Investor Deck |

Challenges in a data science context

• Analytical work is much more than just source code

• Data, results, parameters all important to tracking progress and sharing

• Generating results requires running code — can’t just store files

• Running code requires hardware, and so;ware/packages Setting these up can be a pain Software/packages can differ between people and over time

• Source control (e.g., git) too complex for many data scientists

• Hard to mandate behavior top down — have to incentivize it bottom up

Technical

Organiza.onal

Confidential |August 28, 2014Investor Deck |

Ten Simple Rules for Reproducible Computational Research

1. For Every Result, Keep Track of How It Was Produced

2. Avoid Manual Data Manipulation Steps

3. Archive the Exact Versions of All External Programs Used

4. Version Control All Custom Scripts

5. Record All Intermediate Results, When Possible in Standardized Formats

6. For Analyses That Include Randomness, Note Underlying Random Seeds

7. Always Store Raw Data behind Plots

8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

9. Connect Textual Statements to Underlying Results

10. Provide Public Access to Scripts, Runs, and Results

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285

Confidential |August 28, 2014Investor Deck |

Strategy

Give individuals something they want.

Package it in a solu.on that facilitates best prac.ces.

Confidential |August 28, 2014Investor Deck |

Solution: a central hub to run, track, and share work

Easy Access to Scalable Compute

• Long-running scripts or interactive work • Run multiple experiments in parallel • Elastic resources via cloud infrastructure

Turnkey Deployment & Opera.onaliza.on

• Package analyses into self-service web UIs • Execute models through REST APIs • Schedule automated recurring tasks

Version Control & Reproducibility

• Automatic tracking of code, data, and results • Supports concurrent development

Collabora.on

• Share code, data, and results • Discuss, comment on results and other work • Search and security

Incentivize centralization:

Capitalize on centralization:

Confidential |August 28, 2014Investor Deck |

Containers (Docker)

• Standardized, centrally managed “environments” with software and

configuration already set up

• Any language/software, even commercial / proprietary (e.g., Matlab)

• Including interactive tools, e.g., Jupyter, RStudio

• Able to change environments per project, per run — possibly without

going through IT

• Containers can be tracked and stored

• Can change environment without affecting others

Flexibility for users

Reproducibility and comparison

Confidential |August 28, 2014Investor Deck |

Examples

Confidential |August 28, 2014Investor Deck |

Useful tools and tips

1. For Every Result, Keep Track of How It Was Produced

2. Avoid Manual Data Manipulation Steps

3. Archive the Exact Versions of All External Programs Used

4. Version Control All Custom Scripts

5. Record All Intermediate Results, When Possible in Standardized Formats

6. For Analyses That Include Randomness, Note Underlying Random Seeds

7. Always Store Raw Data behind Plots

8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing

Detail to Be Inspected

9. Connect Textual Statements to Underlying Results

10. Provide Public Access to Scripts, Runs, and Results

Automatic

Automatic

Docker

Pickle, Rda/Rds

“stats” json

Discussion / commentsNotebooks / knitr

Pickle, Rda/Rds

Confidential |August 28, 2014Investor Deck |

Questions / Feedback?