a step towards reproducibility in r

25
A Step Towards Reproducibility in R H 2 O World November 18 - 19, 2014

Upload: revolution-analytics

Post on 02-Jul-2015

550 views

Category:

Documents


0 download

DESCRIPTION

Joseph Rickert presentation to H20 World, November 2014.

TRANSCRIPT

A Step Towards Reproducibility in R

H2O WorldNovember 18 - 19, 2014

2

R’s popularity is growing rapidly

IEEE Spectrum Top Programming Languages

#15: R

• IEEE Spectrum, July 2014 • RedMonk Programming Language

Rankings, 2013

3

R is used more than other data science tools

• O’Reilly Strata 2013 Data Science

Salary Survey

• KDNuggets Poll: Top Languages for

analytics, data mining, data science

4

R is among the highest-paid IT skills in the US

• Dice Tech Salary Survey, January

2014

• O’Reilly Strata 2013 Data Science

Salary Survey

Companies Using R

5

Google

6

“The great beauty of R

is that you can modify

it to do all sorts of

things.”

— Hal Varian

Chief Economist,

Google

• Advertising

Effectiveness

“R is really

important to the

point that it's hard

to overvalue it.” —

Daryl Pregibon

Head of

Statistics,

Google• Economic forecasting

Facebook

• Exploratory Data

Analysis

• Experimental Analysis

“Generally, we use R to move

fast when we get a new data

set. With R, we don’t need to

develop custom tools or write

a bunch of code. Instead, we

can just go about cleaning

and exploring the data.” —Solomon Messing, data

scientist at Facebook

8

Twitter

• Data Visualization • Semantic clustering

“A common pattern for me is that I'll code a MapReduce

job in Scala, do some simple command-line munging on

the results, pass the data into Python or R for further

analysis, pull from a database to grab some extra fields,

and so on, often integrating what I find into some

machine learning models in the end” — Ed Chen, Data

Scientist, Twitter

11

John Deere

Statistical Analysis:

• Short Term Demand Forecasting

• Crop Forecasting

• Long Term Demand Forecasting

• Maintenance and Reliability

• Production Scheduling

• Data Coordination

12

Monsanto

Statistical Analysis:

• Plant Breeding

• Fertility mapping

• Precision Seeding

• Disease Management

• Yield forecasting

14

Ph

arm

ace

utica

ls“R use at the FDA is completely

acceptable and has not caused

any problems.” — Dr Jae

Brodsky, Office of

Biostatistics, Food and Drug

Administration

Regulatory Drug Approvals

• Reproducible research

• Accurate, reliable and consistent statistical analysis

• Internal reporting (Section 508 compliance)

16

Revolution Analytics Open Source development

– Revolution R Open, RHadoop,

ParallelR, DeployR Open, Reproducible

R Toolkit

– Project funding

Community Support

– User Group Sponsorship

– Meetups

– Events sponsorship

– Revolutions Blog

Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method …Wikipedia

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. Roger Peng

Reproducibility – why do we care?Academic / Research

Verify results

Advance Research

Business

Production code

Reliability

Reusability

Collaboration

Regulation

18

www.nytimes.com/2011/07/08/health/research/08genes.html

http://arxiv.org/pdf/1010.1092.pdf

19

An R Reproducibility Problem

Adapted from http://xkcd.com/234/ CC BY-NC 2.5

20

Revolution Analytics’ Reproducibility Environment

A Distribution of R (RRO) that points to a static CRAN mirror

The Checkpoint Server: the static CRAN mirror– CRAN packages fixed with each Revolution R Open update (currently 10/1/14)

Daily CRAN snapshots– Storing every package version since September 2014

– Binaries and sources

– At mran.revolutionanalytics.com/snapshot

CRAN package checkpoint

CRAN

RRDaily

snapshots

http://mran.revolutionanalytics.com/snapshot/

checkpoint

package

library(checkpoint)

checkpoint("2014-09-17")

CRAN mirror

http://cran.revolutionanalytics.com/

checkpoint

server

Midnight

UTC

21

Using Revolution Analytics’ Reproducibility Tools Scenario 1: Set up a consistent, company wide R environment

– Have users download RRO

– All users will get the base and recommended packages as of 10/1/14

– For each project, R user run checkpoint to download a consistent set of packages

that are appropriate for that project

Scenario 2: With or w/o RRO share scripts synced to a snapshot

– Have the user with whom you are sharing put your scripts in a separate project and

download the checkpoint package

– Have the user run checkpoint(“yyyy-mm-dd) with a date appropriate for your

project

– Checkpoint will automatically download the correct version of the packages used in

the scripts

22

Using checkpoint Easy to use: add 2 lines to the top of each script

library(checkpoint)

checkpoint("2014-09-17")

For the package author:

– Use package versions available on the chosen date

– Installs packages local to this project

• Allows different package versions to be used simultaneously

For a script collaborator:

– Automatically installs required packages

• Detects required packages (no need to manually install!)

– Uses same package versions as script author to ensure reproducibility

23

# Create a local checkpoint library

library(checkpoint)

checkpoint("2014-11-14")

> library(checkpoint)

checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics

http://projects.revolutionanalytics.com/rrt/

Warning message:

package ‘checkpoint’ was built under R version 3.1.2

> checkpoint("2014-11-14")

Scanning for loaded pkgs

Scanning for packages used in this project

Installing packages used in this project

Warning: dependencies ‘stats’, ‘tools’, ‘utils’, ‘methods’, ‘graphics’, ‘splines’, ‘grid’, ‘grDevices’ are not available

also installing the dependencies ‘bitops’, ‘stringr’, ‘digest’, ‘jsonlite’, ‘lattice’, ‘RCurl’, ‘rjson’, ‘statmod’,

‘survival’, ‘XML’, ‘httr’, ‘Matrix’

package ‘bitops’ successfully unpacked and MD5 sums checked

package ‘stringr’ successfully unpacked and MD5 sums checked

package ‘digest’ successfully unpacked and MD5 sums checked

package ‘jsonlite’ successfully unpacked and MD5 sums checked

package ‘lattice’ successfully unpacked and MD5 sums checked

package ‘RCurl’ successfully unpacked and MD5 sums checked

package ‘rjson’ successfully unpacked and MD5 sums checked

package ‘statmod’ successfully unpacked and MD5 sums checked

package ‘survival’ successfully unpacked and MD5 sums checked

package ‘XML’ successfully unpacked and MD5 sums checked

package ‘httr’ successfully unpacked and MD5 sums checked

package ‘Matrix’ successfully unpacked and MD5 sums checked

package ‘h2o’ successfully unpacked and MD5 sums checked

package ‘miniCRAN’ successfully unpacked and MD5 sums checked

package ‘igraph’ successfully unpacked and MD5 sums checked

24

MRAN: The Managed R Archive Network Download RRO

Learn about R and RRO

Daily CRAN snapshots

Explore Packages

– and dependencies

Explore Task Views

Thank YouJoseph Rickert

blog.revolutionanalytics.com

[email protected], @revojoe