integrating r with azure for high-throughput analysisintegrating r with azure for high-throughput...

22
Integrating R with Azure for High- throughput analysis Hugh Shanahan Integrating R with Azure for High-throughput analysis Hugh Shanahan Department of Computer Science Royal Holloway, University of London [email protected] @HughShanahan Hugh Shanahan Integrating R with Azure for High-throughput analysis

Upload: others

Post on 28-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Integrating R with Azure for High-throughputanalysis

Hugh Shanahan

Department of Computer ScienceRoyal Holloway, University of London

[email protected]@HughShanahan

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 2: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Applicability to other domains

This project started out doing something very specificfor the domain I work in (Computational Biology).I promise that there will be no Biology in this talk !!Realised can be extended to running high-throughputjobs in R.Contrast with MapReduce / R formalisms(HadoopStreaming, Rhipe, Revolution Analytics, ... )- parallelisation happens outside of individual R script.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 3: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Applicability to other domains

This project started out doing something very specificfor the domain I work in (Computational Biology).I promise that there will be no Biology in this talk !!Realised can be extended to running high-throughputjobs in R.Contrast with MapReduce / R formalisms(HadoopStreaming, Rhipe, Revolution Analytics, ... )- parallelisation happens outside of individual R script.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 4: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Applicability to other domains

This project started out doing something very specificfor the domain I work in (Computational Biology).I promise that there will be no Biology in this talk !!Realised can be extended to running high-throughputjobs in R.Contrast with MapReduce / R formalisms(HadoopStreaming, Rhipe, Revolution Analytics, ... )- parallelisation happens outside of individual R script.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 5: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Applicability to other domains

This project started out doing something very specificfor the domain I work in (Computational Biology).I promise that there will be no Biology in this talk !!Realised can be extended to running high-throughputjobs in R.Contrast with MapReduce / R formalisms(HadoopStreaming, Rhipe, Revolution Analytics, ... )- parallelisation happens outside of individual R script.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 6: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

IaaS clouds

We all now know what clouds are !Infrastructure as a Service (IaaS)Access Virtual Machine via the command lineAmazon, Rackspace, OpenStack ...

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 7: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

IaaS clouds

Platform as a Service (PaaS)Access Virtual Machine programatically.Explicitly allows for batch control, more complicatedworkflows etc.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 8: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Microsoft Azure and Generic Worker Libraries

Azure offers both IaaS and PaaS.IaaS VM’s can run a variety of different flavours of Linuxand Windows OS’sPaaS (they refer to this as a Cloud Service) only runsWindows Server.Mass Storage (not storage associated with VM).Programatic access is via ASP.NET and C#Access mass storage via a variety of languages.Set of libraries which allow control of jobs running onVM’s.Generic Worker (GW)

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 9: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Scaling up

Needed to scale up a problem based on six data sets tonearly six hundred (100 Mbyte → 1 Tbyte).Calculations based on an R script.Each data set can be analysed one at a time (batchmode).Individual data sets can vary by two orders ofmagnitude.

Storage

.

.

.

.

R Script

on VM

R Script

on VM

Log

Data

Data

Raw

Mass

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 10: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Implementation

Made use of Azure PaaS with GW libraries.Written using a combination of C# and Java.R executables + library uploaded to mass storage.Data to be analysed placed in separate container ofmass storage.R script uploaded at run time.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 11: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Operation

R executable +App

Container

Data

Mass Storage

Local Windows PC

libraries

.) R script

.) List of Id’s

.) Additonal data

1

2

Container

(pre−loaded)

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 12: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Launching

Id k

App

Container

Data

Mass Storage

Container

VM 1

.

.

.

.

.

.

.

.

VM n

Cloud Worker Roles 1 + 2

1 + 2

Id i

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 13: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Running

Data set k

App

Container

Data

Mass Storage

Container

VM 1

.

.

.

.

.

.

.

.

VM n

Cloud Worker Roles

Data set i

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 14: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Logging it all

Log file k

App

Container

Data

Mass Storage

Container

VM 1

.

.

.

.

.

.

.

.

VM n

Cloud Worker Roles

Log file i

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 15: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

In reality .....this is less than 100 lines of C#

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 16: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Extending to any R script

This can be extended to any case whereyou have data sets to be analysed by an R script,the data is analysed individually.Set of complex financial instrumentsParameter sweeps

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 17: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Extending to any R script

This can be extended to any case whereyou have data sets to be analysed by an R script,the data is analysed individually.Set of complex financial instrumentsParameter sweeps

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 18: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Key issues to fix this SummerGetting set up (configuration files and keys).Adding GUI.https://github.com/hughshanahan/GWydiRhttps://github.com/hughshanahan/RAzureEssentialsWill port over to a more suitable github address forgroup development this Summer.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 19: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Conclusions

C# and ASP.NET can be a learning curve for Linuxusers.Nonetheless PaaS explicitly allows control of VM’s.Batch mode implementation for a specific problem.Allows analysis on Tbyte-sized data setModified to run any R script in batch mode - much moregeneral.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 20: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Shameless Plug

M.Sc. in Data Science and AnalyticsM.Sc. in Machine LearningM.Sc. in Computational FinanceAll starting this year at Royal Holloway.Please go tohttp://bit.ly/1418DOSfor further details.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 21: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Shameless Plug

M.Sc. in Data Science and AnalyticsM.Sc. in Machine LearningM.Sc. in Computational FinanceAll starting this year at Royal Holloway.Please go tohttp://bit.ly/1418DOSfor further details.

Hugh Shanahan Integrating R with Azure for High-throughput analysis

Page 22: Integrating R with Azure for High-throughput analysisIntegrating R with Azure for High-throughput analysis Hugh Shanahan Microsoft Azure and Generic Worker Libraries Azure offers both

Integrating Rwith Azure for

High-throughput

analysis

HughShanahan

Acknowledgments

Andrew (Harry) Harrison

Anne Owen

Funded by Venus-C EU NetworkContact [email protected]

@hughshanahanThank you for your time !

Hugh Shanahan Integrating R with Azure for High-throughput analysis