speeding up r with rcpp - biostatistics - departmentsjbai/compclub/rcpp_presentation.pdf ·...

29
Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University March 1, 2014

Upload: ledang

Post on 13-Apr-2018

234 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Speeding up R with Rcpp

Stephen CristianoDepartment of BiostatisticsJohns Hopkins University

March 1, 2014

Page 2: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

What is Rcpp?

I Rcpp: Seamless integration between R and C++.

I Extremely simple to connect C++ with R.

I Maintained by Dirk Eddelbuetter and Romain Francois

Page 3: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Simple examples

library('Rcpp')

cppFunction('int square(int x) { return x*x; }')square(7L)

## [1] 49

cppFunction('

int add(int x, int y, int z) {int sum = x + y + z;

return sum;

}')

add(1, 2, 3)

## [1] 6

Page 4: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Everything revolves around .Call

C++ Level:

SEXP foo(SEXP a, SEXP b, SEXP C, ...);

R Level:

res <- .Call("foo", a, b, C, ..., package="mypkg")

## Error: object ’a’ not found

Page 5: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Why C++?

I One of the most frequently used programming languages.Easy to find help.

I Speed.

I Good chance what you want is already implemented in C++.

I From wikipedia: ’C++ is a statically typed, free-form,multi-paradigm, compiled, general-purpose, powerfulprogramming language.’

Page 6: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Why not C++?

I More difficult to debug.

I more difficult to modify.

I The population of potentials users who understand both Rand C++ is smaller.

Page 7: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Why Rcpp

I Easy to use (honest).

I Clean and approachable API that enable for high performancecode.

I R style vectorized code at C++ level.

I Programmer time vs computer time: much more efficient codethat does not take much longer to write.

I Enables access to advanced data structures and algorithmsimplented in C++ but not provided by R.

I Handles garbage collection and the Rcpp programmer shouldnever have to worry about memory allocation anddeallocation.

Page 8: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

C++ in 2 minutes

cppFunction('

double sumC(NumericVector x) {int n = x.size();

double total = 0;

for(int i = 0; i < n; ++i) {total += x[i];

if(total > 100)

break;

}return total;

}')

Page 9: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

I Need to initialize your variables with data type.

I for loops of structure for(initialization; condition; increment).

I conditionals are the same as R.

I End every statement with a semicolon.

I Vectors and arrays are 0-indexed.

I size() is a member function on the vector class - x.size()returns the size of x.

I While C++ can be a very complex language, just knowingthese will enable you to write faster R functions.

Page 10: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Typical bottlenecks in R

I Loops that depend on previous iterations, eg MCMC methods.

I Function calls in R slow, but very little overhead in C++.Recursive functions are very inefficient in R.

I Not having access to advanced data structures algorithms inR but available in C++.

Page 11: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

When to use Rcpp

I Before writing C++ code, you should first ask if it’s necessary.

I Take advantage of vectorization when possible.

I Most base R functions already call C functions. Make surethere isn’t already an efficient implementation of what you aretrying to do.

Page 12: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Data Structures

I All R objects are internally represented by a SEXP: a pointerto an S expression object.

I Any R object can be passed down to C++ code: vectors,matrices lists. Even functions and environments.

I A large number of user-visible classes for R objects, whichcontain pointers the the SEXP object.

I IntegerVectorI NumericVectorI LogicalVectorI CharacterVectorI NumericMatrixI and many more

Page 13: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

library(inline, quietly=TRUE)

##

## Attaching package: ’inline’

##

## The following object is masked from ’package:Rcpp’:

##

## registerPlugin

src <- '

Rcpp::NumericVector invec(vx);

Rcpp::NumericVector outvec(vx);

for(int i=0; i<invec.size(); i++) {outvec[i] = log(invec[i]);

}return outvec;

'

fun <- cxxfunction(signature(vx="numeric"), src, plugin="Rcpp")

x <- seq(1.0, 3.0, by=1)

cbind(x, fun(x))

## x

## [1,] 0.0000 0.0000

## [2,] 0.6931 0.6931

## [3,] 1.0986 1.0986

Note: outvec and invec point to the same underlying R object.

Page 14: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Use clone to not modify original vector.

src <- '

Rcpp::NumericVector invec(vx);

Rcpp::NumericVector outvec = Rcpp::clone(vx);

for(int i=0; i<invec.size(); i++) {outvec[i] = log(invec[i]);

}return outvec;

'

fun <- cxxfunction(signature(vx="numeric"), src, plugin="Rcpp")

x <- seq(1.0, 3.0, by=1)

cbind(x, fun(x))

## x

## [1,] 1 0.0000

## [2,] 2 0.6931

## [3,] 3 1.0986

Page 15: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Creating R packages

Inspection of R source code for any R package will reveal thedirectories::

I R: for R functions

I vignettes: LATEXpapers weaving R code and indicating theintended workflow of an analysis.

I man: documentation for exported R functions.

I src: compiled code

The file DESCRIPTION provides a brief description of the project, aversion number, and any packages for which your package depends.

Page 16: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Creating R packages

I All compiled code coes in package/src directory.

I Code in src/ will be automatically compiled and sharedlibraries created when building the package.

I Instantiate an Rcpp package: Rcpp.package.skeleton

Page 17: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Anatomy of a .Call function

Example: discrete approximation to a joint posterior distribution.A nested for loop is unavoidable..h file (src/grid.h):

#ifndef _RCPP_GBBS_H

#define _RCPP_GBBS_H

#include <Rcpp.h>

RcppExport SEXP rcpp_discrete(SEXP grid, SEXP meangrid,

SEXP precgrid, SEXP dtheta, SEXP dprec, SEXP y);

#endif

Page 18: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Anatomy of a .Call function

.cpp file (src/rcpp discrete.h):

#include "grid.h"

#include "Rmath.h"

RcppExport SEXP rcpp_discrete(SEXP grid, SEXP meangrid, SEXP precgrid,

SEXP dtheta, SEXP dprec, SEXP y) {

//Rcpp::RNGScope scope;

using namespace Rcpp;

RNGScope scope;

// convert SEXP inputs to C++ vectors

NumericMatrix xgrid(grid);

NumericVector xmeangrid(meangrid);

NumericVector xprecgrid(precgrid);

NumericVector xdtheta(dtheta);

NumericVector xdprec(dprec);

NumericVector xy(y);

Page 19: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

int G = xmeangrid.size();

int H = xprecgrid.size();

int N = xy.size();

NumericVector tmp(N);

double lik; // likelihood

//dprec = dgamma(xprecgrid, xnu0[0]/2, xnu0[0]/2*xsigma20[0]);

//dtheta = dnorm(xmeangrid, xmu0, sqrt(xtau20[0]));

for(int g = 0; g < G; g++) {

for(int h = 0; h < H; h++) {

tmp = dnorm(xy, xmeangrid[g], 1/(sqrt(xprecgrid[h])));

lik = 1;

for(int n = 0; n < N; n++) {

lik *= tmp[n];

}

xgrid(g, h) = xdtheta[g] * xdprec[h] * lik;

}

}

return xgrid;

}

Page 20: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

In R:

y <- c(1.64, 1.7, 1.72, 1.74, 1.82, 1.82, 1.82, 1.9, 2.08)

mu0 <- 1.9

tau20 <- 0.95^2

nu0 <- 1

sigma20 <- 1e-02

set.seed(1)

G <- H <- 500

mean.grid <- seq(1.5, 2, length = G)

prec.grid <- seq(1.75, 175, length = H)

post.grid <- matrix(NA, G, H)

dtheta <- dnorm(mean.grid, mu0, sqrt(tau20))

dprec <- dgamma(prec.grid, nu0/2, nu0/2 * sigma20)

for (g in seq_len(G)) {for (h in seq_len(H)) {

post.grid[g, h] <- dtheta[g] *

dprec[h] * prod(dnorm(y,mean.grid[g], 1/sqrt(prec.grid[h])))

}}

Page 21: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

System time for R

system.time(for (g in seq_len(G)) {for (h in seq_len(G)) {

post.grid[g, h] <- dtheta[g] * dprec[h] * prod(dnorm(y,

mean.grid[g], 1/sqrt(prec.grid[h])))

}})

## user system elapsed

## 2.028 0.006 2.033

Page 22: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

System time for .Call

library(CNPBayes, quietly=TRUE, verbose=FALSE)

## Loading required package: grid

## Loading required package: lattice

## Loading required package: survival

## Loading required package: splines

## Loading required package: Formula

##

## Attaching package: ’Hmisc’

##

## The following objects are masked from ’package:base’:

##

## format.pval, round.POSIXt, trunc.POSIXt, units

post.grid2 <- matrix(0, G, H)

system.time(grid2 <- .Call("rcpp_discrete", post.grid2,

mean.grid, prec.grid, dtheta, dprec, y))

## user system elapsed

## 0.054 0.000 0.055

This is a speed improvement by a factor of 35x.

Page 23: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Rcpp Sugar

I Rcpp sugar brings a higher level of abstraction to C++ codewritten in Rcpp.

I Avoid C++ loops with code that strongly resembles R.

I Takes advantage of operator overloading.

I Despite the similar syntax, peformance is much faster in C++,though not quite as fast as manually optimized C++ code.

Page 24: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Example

pdistR <- function(x, ys) {(x - ys)^2

}

cppFunction('NumericVector pdistC2(double x, NumericVector ys) {return pow((x-ys), 2);

}')

pdistR(5.0, c(4.1,-9.3,0, 13.7))

## [1] 0.81 204.49 25.00 75.69

pdistC2(5.0, c(4.1,-9.3,0, 13.7))

## [1] 0.81 204.49 25.00 75.69

Page 25: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Logical Operators

// two integer vectors of the same size

NumericVector x;

NumericVector y;

// expressions involving two vectors

LogicalVector res = x < y;

LogicalVector res = x != y;

// one vector, one single value

LogicalVector res = x < 2;

// two expressions

LogicalVector res = (x + y) == (x*x);

// functions producing single boolean result

all(x*x < 3);

any(x*x < 3);

Page 26: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Logical OperatorsThere are many functions similar to what exists inside R

is_na(x);

seq_along(x);

sapply( seq_len(10), square<int>() );

ifelse( x < y, x, (x+y)*y );

pmin( x, x*x);

diff( xx );

intersect( xx, yy); //returns interserct of two vectors

unique( xx ); // subset of unique values in input vector

// math functions

abs(x); exp(x); log(x); ceil(x);

sqrt(x); sin(x); gamma(x);

range(x);

mean(x); sd(x); var(x);

which_min(x); which_max(x);

// A bunch more

Page 27: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Density and random number generation functions

Rcpp has access to the same density, distribution, and RNGfunctions used by R itself. For example, you can draw from agamma distribution with scale and shape parameters equal to 1with:

cppFunction('NumericVector getRGamma() {RNGScope scope;

NumericVector x = rgamma( 10, 1, 1 );

return x;

}')

getRGamma()

## [1] 0.08369 0.83618 1.22254 1.15836 0.99002 0.30737 0.09462 0.15720

## [9] 0.31081 0.46873

Page 28: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

RcppArmadillo

I Armadillo is a high level and easy to use C++ linear algebralibrary with syntax similar to Matlab.

I RcppArmadillo is an Rcpp interface allowing access to theArmadillo library.

Page 29: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Resources

I ’Seamless R and C++ integration with Rcpp’ by DirkEddelbuettel. Excellent book for learning Rcpp. Available forfree through JHU library.

I Hadley Wickham’s Rcpp tutorial:http://adv-r.had.co.nz/Rcpp.html

I A huge number of examples at http://gallery.rcpp.org

I Stack exchange.

I Rumen