r on biohpc · explicit parallelization in r 18 our optimized r automatically parallelizes linear...
TRANSCRIPT
R on BioHPCRstudio, Parallel R and BioconductoR
1 Updated for 2017-10-18
Today we’ll be looking at…
2
Why R?
3
• The dominant statistics environment in academia
• Large number of packages to do a lot of different analyses
• Excellent uptake in Bioinformatics – specialist packages
• (Relatively) easy to accomplish complex stats work
• Very active development right nowR Foundation, R Consortium, Revolution Analytics, RStudio, Microsoft…
Why not R?
4
• Quirky language – painful for e.g. Python programmers
• Generally thought to be quite slow – except for optimized linear algebra
• Complex ‘old-fashioned’ documentation
• Parallelization packages can be complex / outdated
… but it’s getting better quickly….
Exciting Recent Developments in R
5
RStudio – An IDE for R, on the web
6
http://rstudio.biohpc.swmed.edu
BioHPC optimized R, access to cluster storage, persistent sessions
When to use RStudio
7
• Development work with small datasets
• Creating R Markdown documents
• Working with Shiny for dataset visualizations
• Any small, short-running data analysis tasks
Large datasets, very long running jobs, parallel code?
Must use R on the cluster…
Using R on the cluster / clients
8
Default is R/3.3.2-gccmkl – also used by rstudio.biohpc.swmed.edu
R/3.2.1-intel (older) or R/3.4.1-gccmkl also recommended
Use ‘R’ for command line R, or run scripts with ‘Rscript’
Rstudio in a GUI Session
9
Start a webGUI Session
$ module load R/3.3.2-gccmkl
$ module load rstudio-desktop
$ rstudio
Standard 20 hr limit
Whole node to yourself
You can choose which version of R
Start & connect to dedicated Python, R, and DIGITS environments
Directly from the BioHPC Portal
Portal DIGITS, RStudio & Jupyter – Coming 2018
10
Installing Packages
11
We have a set of common packages pre-installed in the R module.
You can install your own into your home directory (~/R)
install.packages(c("microbenchmark", "data.table"))
Some packages need additional libraries, won’t compile successfully.- Ask us to install them for you ([email protected])- gccmkl R is more compatible than intel R
You need to install at least one package manually before you can use install.packages via RScript
This is for packages from CRAN – BioconductoR packages install differentlySee later!
Our R is faster than standard downloads
12
Compiled using Intel compiler and Intel Math Kernel Library
Task Standard R BioHPC R Speedup
Matrix Multiplication 139.15 1.80 77x
Cholesky Decomposition 19.53 0.32 61x
SVD 45.66 1.95 23x
PCA 201.30 6.25 32x
LDA 135.37 17.60 7x
This is on a cluster node – speedup is less on clients with fewer CPU cores
For your own Mac or PC see http://www.revolutionanalytics.com/revolution-r-open
mkl_test.R
Benchmarking functions in R (and compiling them)
13
Compiling a function that is called often can increase speedThe microbenchmark package allows you to benchmark functions
library(compiler)f <- function(n, x) for (i in 1:n) x = (1 + sin(x))^(cos(x))g <- cmpfun(f)
library(microbenchmark)compare <- microbenchmark(f(1000, 1), g(1000, 1), times = 1000)
library(ggplot2)autoplot(compare)
functions.R
For speed – always vectorize!
14
54x speedup!
Using a function compilation improved median some (< 2x)Using vector form was much faster
distnorm <- function(){
x <- seq(-5, 5, 0.01)y <- rep(NA,length(x))
for(i in 1:length(x)) {y[i] <- stdnorm(x[i])
}
return(list(x=x,y=y))}
vdistnorm <- function(){
x <- seq(-5, 5, 0.01)y <- stdnorm(x)
return(list(x=x, y=y))
}
functions.R
Our Example Application
15
# Define a function that performs a random walk with a# specified bias that decaysrw2d <- function(n, mu, sigma){
steps=matrix(, nrow=n, ncol=2)for (i in 1:n){
steps[i,1] <- rnorm(1, mean=mu, sd=sigma )steps[i,2] <- rnorm(1, mean=mu, sd=sigma )mu <- mu/2
}return( apply(steps, 2, cumsum) )
}
mc_parallel.R
A bigger task…
16
# Generate random walks of lengths between 1000 and 5000# foreach loopsystem.time(
results <- foreach(l=1000:5000) %do% rw2d(l, 3, 1))# user system elapsed# 85.872 0.145 86.242
# Applysystem.time(
results <- lapply( 1000:5000, rw2d, 3, 1))# user system elapsed# 81.175 0.114 81.511
mc_parallel.R
Start a cluster (of R slave workers on a single machine)
17
Single node, multiple cores running multiple R slaves
#Parallel Single nodelibrary(parallel)library(doParallel)
# Create a cluster of workers using all corescl <- makeCluster( detectCores() )# Tell foreach with %dopar% to use this clusterregisterDoParallel(cl)
…
stopCluster(cl)
mc_parallel.R
Explicit Parallelization in R
18
Our optimized R automatically parallelizes linear algebra on a single machine- enough in a lot of cases!
Always prefer using vector/matrix form over for loops and apply functions to get the most out of these optimizations.
If you need more options you can control the parallelization:
library(parallel) # Single-node and cluster parallelization# apply functions and explicit execution
library(doParallel) # Simple parallel foreach loops
Can run parallel code on a single node (multicore) or across nodes (MPI)
R parallel vs MKL conflict
19
Intel MKL tries to use all cores for every linear algebra operationR is running multiple iterations of a loop in parallel using all cores
If used together too many threads/processes are launched – far more than cores!
export OMP_NUM_THREADS=1 # on terminal before running R
sys.setenv(OMP_NUM_THREADS="1") # within R
~ 5% improvement by disabling MKL multi-threading
This time in parallel!
20
cl <- makeCluster( detectCores() )RegisterDoParallel(cl)Sys.setenv(OMP_NUM_THREADS="1")
# Generate 1000 random walks of increasing length# Parallel foreach loopsystem.time(
results <- foreach(l=1000:5000) %dopar% rw2d(l, 3, 1))# user system elapsed# 2.928 0.441 17.374
# Parallel applysystem.time(
results <- parLapply( cl, 1000:5000, rw2d, 3, 1))# user system elapsed# 0.339 0.171 8.460
stopCluster(cl)
5x Speedup
9x Speedup
mc_parallel.sh
MPI parallelization – for really big jobs
21
MPI is available on R/3.3.2-gccmkl only – contact if you need othersMust ‘module add R/3.3.2-gccmkl openmpi/gcc/64/1.6.5-mlnx-ofed’
We will continue to use the simple parallel and doParallel packages
Lots online about ‘snow’ – this is now behind the scenes in new versions of R
Please join us for coffee to discuss MPI projectsusing R
Work in progress optimizations with your help
MPI parallelization – easy!
22
cl <- makeCluster( 128, type="MPI" )
Number of MPI tasks
cores per node * nodes (or less if RAM limited)
56 cores per node for 256GBv1/GPUv1 partition48 cores per node for 256GB partition32 cores per node for other partitions
mpi_parallel.R
mpi.exit()
Add to bottom of your R code to ensure tidy exit
MPI parallelization – submitting the job
23
#!/bin/bash
#SBATCH --job-name R_MPI_TEST
# Number of nodes required to run this job#SBATCH -N 4# Distribute n tasks per node#SBATCH --ntasks-per-node=32
#SBATCH -t 0-2:0:0#SBATCH -o job_%j.out#SBATCH -e job_%j.err#SBATCH --mail-type ALL#SBATCH --mail-user [email protected]
module load R/3.3.2-gccmklmodule load openmpi/gcc/64/1.6.5-mlnx-ofed
ulimit -l unlimitedR --vanilla < mpi_parallel.R
# END OF SCRIPT
No mpirun!
mpi_parallel.sh
MPI Performance
24
# Sequential (with MKL multi-threading)system.time(
results <- lapply( 1000:10000, rw2d, 3, 1))# user system elapsed# 329.173 0.610 330.607
# Parallel apply, 4 nodes, 128 MPI taskssystem.time(
results <- parLapply( cl, 1000:10000, rw2d, 3, 1))# user system elapsed# 18.815 0.951 19.848 16x Speedup
Rmarkdown / Knitr
25
Write R code inside markdown documents
Create attractive HTML, PDF, Word output that includes the code and output
BioconductoR
26
A comprehensive set of Bioinformatics related packages for R
Software and datasets
Bioconductor
27
Base packages installed, plus some commonly used extras
Install additional packages to home directory:
source("http://bioconductor.org/biocLite.R")biocLite('limma')
Ask [email protected] for packages that fail to compile
BioconductoR
28
Bioconductor workflows are fantastic tutorials
http://www.bioconductor.org/help/workflows/
BioconductoR Example
29
DEMO
RNA-Seq AnalysisBioconductor, Rmarkdown/Knitr
See bioconductor.Rmd
Dallas R Users Group
30
http://www.meetup.com/Dallas-R-Users-Group/
University of Dallas, Irving, Saturdays