a step towards reproducibility in r
DESCRIPTION
Joseph Rickert presentation to H20 World, November 2014.TRANSCRIPT
2
R’s popularity is growing rapidly
IEEE Spectrum Top Programming Languages
#15: R
• IEEE Spectrum, July 2014 • RedMonk Programming Language
Rankings, 2013
3
R is used more than other data science tools
• O’Reilly Strata 2013 Data Science
Salary Survey
• KDNuggets Poll: Top Languages for
analytics, data mining, data science
4
R is among the highest-paid IT skills in the US
• Dice Tech Salary Survey, January
2014
• O’Reilly Strata 2013 Data Science
Salary Survey
6
“The great beauty of R
is that you can modify
it to do all sorts of
things.”
— Hal Varian
Chief Economist,
• Advertising
Effectiveness
“R is really
important to the
point that it's hard
to overvalue it.” —
Daryl Pregibon
Head of
Statistics,
Google• Economic forecasting
• Exploratory Data
Analysis
• Experimental Analysis
“Generally, we use R to move
fast when we get a new data
set. With R, we don’t need to
develop custom tools or write
a bunch of code. Instead, we
can just go about cleaning
and exploring the data.” —Solomon Messing, data
scientist at Facebook
8
• Data Visualization • Semantic clustering
“A common pattern for me is that I'll code a MapReduce
job in Scala, do some simple command-line munging on
the results, pass the data into Python or R for further
analysis, pull from a database to grab some extra fields,
and so on, often integrating what I find into some
machine learning models in the end” — Ed Chen, Data
Scientist, Twitter
9
Insu
ran
ce
• Marketing Analytics• Risk Analysis
• Catastrophe Modeling
10
Fin
an
ce
an
d B
an
kin
g
• Credit Risk Analysis • Financial Networks
11
John Deere
Statistical Analysis:
• Short Term Demand Forecasting
• Crop Forecasting
• Long Term Demand Forecasting
• Maintenance and Reliability
• Production Scheduling
• Data Coordination
12
Monsanto
Statistical Analysis:
• Plant Breeding
• Fertility mapping
• Precision Seeding
• Disease Management
• Yield forecasting
13
Pu
blic
Affa
irs
• Casualty estimation in Warzones • Political Analysis
14
Ph
arm
ace
utica
ls“R use at the FDA is completely
acceptable and has not caused
any problems.” — Dr Jae
Brodsky, Office of
Biostatistics, Food and Drug
Administration
Regulatory Drug Approvals
• Reproducible research
• Accurate, reliable and consistent statistical analysis
• Internal reporting (Section 508 compliance)
15
We
ath
er
an
d C
lima
te
• Flood Warnings• Climate change forecasts
16
Revolution Analytics Open Source development
– Revolution R Open, RHadoop,
ParallelR, DeployR Open, Reproducible
R Toolkit
– Project funding
Community Support
– User Group Sponsorship
– Meetups
– Events sponsorship
– Revolutions Blog
Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method …Wikipedia
Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. Roger Peng
Reproducibility – why do we care?Academic / Research
Verify results
Advance Research
Business
Production code
Reliability
Reusability
Collaboration
Regulation
18
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
20
Revolution Analytics’ Reproducibility Environment
A Distribution of R (RRO) that points to a static CRAN mirror
The Checkpoint Server: the static CRAN mirror– CRAN packages fixed with each Revolution R Open update (currently 10/1/14)
Daily CRAN snapshots– Storing every package version since September 2014
– Binaries and sources
– At mran.revolutionanalytics.com/snapshot
CRAN package checkpoint
CRAN
RRDaily
snapshots
http://mran.revolutionanalytics.com/snapshot/
checkpoint
package
library(checkpoint)
checkpoint("2014-09-17")
CRAN mirror
http://cran.revolutionanalytics.com/
checkpoint
server
Midnight
UTC
21
Using Revolution Analytics’ Reproducibility Tools Scenario 1: Set up a consistent, company wide R environment
– Have users download RRO
– All users will get the base and recommended packages as of 10/1/14
– For each project, R user run checkpoint to download a consistent set of packages
that are appropriate for that project
Scenario 2: With or w/o RRO share scripts synced to a snapshot
– Have the user with whom you are sharing put your scripts in a separate project and
download the checkpoint package
– Have the user run checkpoint(“yyyy-mm-dd) with a date appropriate for your
project
– Checkpoint will automatically download the correct version of the packages used in
the scripts
22
Using checkpoint Easy to use: add 2 lines to the top of each script
library(checkpoint)
checkpoint("2014-09-17")
For the package author:
– Use package versions available on the chosen date
– Installs packages local to this project
• Allows different package versions to be used simultaneously
For a script collaborator:
– Automatically installs required packages
• Detects required packages (no need to manually install!)
– Uses same package versions as script author to ensure reproducibility
23
# Create a local checkpoint library
library(checkpoint)
checkpoint("2014-11-14")
> library(checkpoint)
checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics
http://projects.revolutionanalytics.com/rrt/
Warning message:
package ‘checkpoint’ was built under R version 3.1.2
> checkpoint("2014-11-14")
Scanning for loaded pkgs
Scanning for packages used in this project
Installing packages used in this project
Warning: dependencies ‘stats’, ‘tools’, ‘utils’, ‘methods’, ‘graphics’, ‘splines’, ‘grid’, ‘grDevices’ are not available
also installing the dependencies ‘bitops’, ‘stringr’, ‘digest’, ‘jsonlite’, ‘lattice’, ‘RCurl’, ‘rjson’, ‘statmod’,
‘survival’, ‘XML’, ‘httr’, ‘Matrix’
package ‘bitops’ successfully unpacked and MD5 sums checked
package ‘stringr’ successfully unpacked and MD5 sums checked
package ‘digest’ successfully unpacked and MD5 sums checked
package ‘jsonlite’ successfully unpacked and MD5 sums checked
package ‘lattice’ successfully unpacked and MD5 sums checked
package ‘RCurl’ successfully unpacked and MD5 sums checked
package ‘rjson’ successfully unpacked and MD5 sums checked
package ‘statmod’ successfully unpacked and MD5 sums checked
package ‘survival’ successfully unpacked and MD5 sums checked
package ‘XML’ successfully unpacked and MD5 sums checked
package ‘httr’ successfully unpacked and MD5 sums checked
package ‘Matrix’ successfully unpacked and MD5 sums checked
package ‘h2o’ successfully unpacked and MD5 sums checked
package ‘miniCRAN’ successfully unpacked and MD5 sums checked
package ‘igraph’ successfully unpacked and MD5 sums checked
24
MRAN: The Managed R Archive Network Download RRO
Learn about R and RRO
Daily CRAN snapshots
Explore Packages
– and dependencies
Explore Task Views