building applications for interactive data exploration in systems … · for each project. our...

13
Building applications for interactive data exploration in systems biology Bjørn Fjukstad 1 , Vanessa Dumeaux 2 , Karina Standahl Olsen 3 , Michael Hallet 2 , Eiliv Lund 3 , and Lars Ailo Bongo 1 1 Department of Computer Science, UiT The Arctic University of Norway 2 Department of Biology, Concordia University 3 Department of Community Medicine, UiT The Arctic University of Norway Abstract As the systems biology community generates and col- lects data at an unprecedented rate, there is a grow- ing need for interactive data exploration tools to ex- plore the datasets. These tools need to combine ad- vanced statistical analyses, relevant knowledge from biological databases, and interactive visualizations in an application with clear user interfaces. To answer specific research questions tools must provide special- ized user interfaces and visualizations. While these are application-specific, the underlying components of a data analysis tool can be shared and reused later. Application developers can therefore compose appli- cations of reusable services rather than implementing a single monolithic application from the ground up for each project. Our approach for developing data exploration ap- plications in systems biology builds on the microser- vice architecture. Microservice architectures sepa- rates an application into smaller components that communicate using language-agnostic protocols. We show that this design is suitable in bioinformatics ap- plications where applications often use different tools, written in different languages, by different research groups. Packaging each service in a software con- tainer enables re-use and sharing of key components between applications, reducing development, deploy- ment, and maintenance time. We demonstrate the viability of our approach through a web application, MIxT blood-tumor, for exploring and comparing transcriptional profiles from blood and tumor samples in breast cancer pa- tients. The application integrates advanced statis- tical software, up-to-date information from biological databases, and modern data visualization libraries. The web application for exploring transcriptional profiles, MIxT, is online at mixt-blood-tumor. bci.mcgill.ca and open-sourced at github.com/ fjukstad/mixt. Packages to build the supporting microservices are open-sourced as a part of Kvik at github.com/fjukstad/kvik. Introduction In recent years the biological community has gener- ated an unprecedented ammount of data. While the cost of data collection has drastically decreased, data analysis continue to be a larger fraction of the total experiment cost.[1] An important part of data anal- ysis includes the time spent by human experts inter- preting the results. This calls for novel methods for building data analysis tools for data exploration and interpretation. In the field of systems biology, data exploration ap- plications need to link results to relevant prior knowl- edge. In the later years there has been a tremendous effort to curate databases with relevant information on genes and processes. Databases such as the Molec- 1 . CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/141630 doi: bioRxiv preprint first posted online May. 24, 2017;

Upload: others

Post on 17-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

Building applications for interactive data exploration in systems

biology

Bjørn Fjukstad1, Vanessa Dumeaux2, Karina Standahl Olsen3, Michael Hallet2,Eiliv Lund3, and Lars Ailo Bongo1

1Department of Computer Science, UiT The Arctic University of Norway2Department of Biology, Concordia University

3Department of Community Medicine, UiT The Arctic University of Norway

Abstract

As the systems biology community generates and col-lects data at an unprecedented rate, there is a grow-ing need for interactive data exploration tools to ex-plore the datasets. These tools need to combine ad-vanced statistical analyses, relevant knowledge frombiological databases, and interactive visualizations inan application with clear user interfaces. To answerspecific research questions tools must provide special-ized user interfaces and visualizations. While theseare application-specific, the underlying componentsof a data analysis tool can be shared and reused later.Application developers can therefore compose appli-cations of reusable services rather than implementinga single monolithic application from the ground upfor each project.

Our approach for developing data exploration ap-plications in systems biology builds on the microser-vice architecture. Microservice architectures sepa-rates an application into smaller components thatcommunicate using language-agnostic protocols. Weshow that this design is suitable in bioinformatics ap-plications where applications often use different tools,written in different languages, by different researchgroups. Packaging each service in a software con-tainer enables re-use and sharing of key componentsbetween applications, reducing development, deploy-ment, and maintenance time.

We demonstrate the viability of our approach

through a web application, MIxT blood-tumor, forexploring and comparing transcriptional profiles fromblood and tumor samples in breast cancer pa-tients. The application integrates advanced statis-tical software, up-to-date information from biologicaldatabases, and modern data visualization libraries.

The web application for exploring transcriptionalprofiles, MIxT, is online at mixt-blood-tumor.

bci.mcgill.ca and open-sourced at github.com/

fjukstad/mixt. Packages to build the supportingmicroservices are open-sourced as a part of Kvik atgithub.com/fjukstad/kvik.

Introduction

In recent years the biological community has gener-ated an unprecedented ammount of data. While thecost of data collection has drastically decreased, dataanalysis continue to be a larger fraction of the totalexperiment cost.[1] An important part of data anal-ysis includes the time spent by human experts inter-preting the results. This calls for novel methods forbuilding data analysis tools for data exploration andinterpretation.

In the field of systems biology, data exploration ap-plications need to link results to relevant prior knowl-edge. In the later years there has been a tremendouseffort to curate databases with relevant informationon genes and processes. Databases such as the Molec-

1

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 2: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

ular Signatures Database (MSigDB)1 and the KyotoEncyclopedia of Genes and Genomes (KEGG)2 bothprovide interfaces to retrieve data that can be usedto better the understanding of data analysis results.

Several tools for biological data analysis are nowavailable in various programming languages. Theseinclude a wide variety of bioinformatics methodolo-gies and graphical analysis tools. In the R statisti-cal programming language developers share softwarethrough repositories such as CRAN3 or Bioconduc-tor4. In other languages, libraries for biological bom-putation are often availalbe like BioPython[2] andbiogo[3] for Python or Go, respectively. Projectssuch as Galaxy5 and Common Workflow Language(CWL)6 enable resarchers to build and run biologicaldata analysis pipelines consisting of a wide range oftools. Although these framework are tremendouslyhelpful, we need novel approaches to build applica-tions that integrate high-performance bioinformaticstools with specialized user-interfaces and interactivevisualizations.

Different programming languages solve differenttasks. For example, new biological data analysis tech-niques are quickly realeased in R and its packagerepositories; high-performance computer vision tasksare performed in C++ and OpenCV; and portableuser interfaces more easily built in HTML, CSS andJavaScript. Therefore applications that integratenovel statistical analysis tools, interactive visualiza-tions, and biological databases likely need to includeseveral components written in different languages.

A microservice architecture structures an applica-tion into small reusable, loosely coupled parts. Thesecommunicate via lightweight programming language-agnostic protocols such as HTTP, making it possi-ble to write single applications in multiple program-ming languages. This way the most suitable pro-gramming language is used for each specific part. Tobuild a microservice application, developers bundleeach service in a container. Containers are built from

1software.broadinstitute.org/gsea/msigdb2kegg.jp3cran.r-project.org4bioconductor.org5galaxyproject.org6commonwl.org

configuration files which describe the operating sys-tem, software packages and versions of these. Thismakes reproducing the analyses, database lookups,library versions in an application a trivial task. Themost popular implementation of a software containeris Docker7, but others such as Rkt8 exist. Initia-tives such as BioContainers9 now provide contain-ers pre-installed with different bioinformatics tools.While the enabling technology is available, the mi-croservices approach is not yet widely adopted inbioinformatics.[4]

From our experience we have identified a set ofcomponents and features we needed to build data ex-ploration applications.

1. A low-latency language-independent approachfor integrating, or embedding, statistical soft-ware, such as R, directly in a data explorationapplication.

2. Low latency language-independent interface toonline reference databases in biology that userscan query to better understand resulst from sta-tistical analyses.

3. A simple method for deploying and sharing thecomponents of an application between projects.

In this paper, we describe a novel approach forbuilding data exploration applications in systems bi-ology. We show that by building applications as aset of services we can reuse and share its compo-nents between applications. The key services of abiological data exploration application are i) a com-pute service for executing statistical analyses in lan-guages such as R, ii) a database query service forretrieving information from biological databases, andiii) the user-facing visualizations and user-interfaces.In addition, by packaging the services using containertechnology they are easy to deploy, simple to repro-duce, and easy to share between projects. We haveused our approach to build a number of applications,both command-line and web-based. In this paper wedescribe how we used our approach to develop MIxT,

7docker.com8coreos.com/rkt9biocontainers.pro

2

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 3: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

a web application for exploring and comparing tran-scriptional profiles from blood and tumor samples.

Methods

In this section we first motivate our microservice ap-proach based on our experiences developing the MIxTweb application. We describe the process from initialdata analysis to the final application, highlighting theimportance of language-agnostic services to facilitatethe use of different tools in different parts of the ap-plication. We show that these services are easy todeploy and provide performance to use them to builddata exploration applications. We then generalize theideas to a set of principles and services that can bereused and shared between applications, and showtheir design and implementation.

Motivating Example

The aim of the Matched Interactions Across Tissues(MIxT) study was to identify genes and pathwaysin the primary breast tumor that are tightly linkedto genes and pathways in the patient blood cells.[5]We generated and analyzed expression profiles fromblood and matched tumor cells in 173 breast cancerpatients included in the Norwegian Women and Can-cer (NOWAC) study. The MIxT analysis starts byidentifying sets of genes tightly co-expressed acrossall patients in each tissue. Each group of genes ormodules were annotated based on known a priori bi-ological knowledge about gene functionality. Focuswas placed on the relationships between tissues byasking if specific biologies in one tissue are linkedwith (possibly distinct) biologies in the second tis-sue, and this within different subgroup of patients(i.e. subtypes of breast cancer).

We built an R package10 with the statistical meth-ods and static visualizations for identifying associ-ations between modules across tissues. The explo-ration of the results encompass the examination of∼ 20 modules and their functional enrichments. Thatis ∼ 23× 19 = 437 associations computed for each ofthe 22 patient subgroups.

10github.com/vdumeaux/mixtR

To explore this large amount of data and resultswe needed an interactive point-and-click applicationthat could interface with the statistical methods andlink the results to online databases. This would makeit possible for users to explore the results without anycoding background. We needed an application thatcould interface with MSigDB to fetch gene set meta-data and the Entrez Programming Utilities (E-utils)to retrieve gene meta-data. To make the interfacesto the statistical analyses and databases re-usable byother applications that we aim to implement later itwas necessary for these to communicate using openprotocols.

In addition, we required that the application wascontainerized. This allows us to deploy the appli-cation on a wide range of hardware, from local in-stallations to deployments to cloud providers such asAmazon Web Services (AWS)11.

Design

Our experience can be generalized into the followingdesign principles for building applications in bioinfor-matics:

Principle 1: Build applications as collections oflanguage-agnostic microservices. This enables re-useof components and does not enforce any specific pro-gramming language on the user-facing logic or theunderlying components of the application. Withinbioinformatics, researchers and application develop-ers use a wide range of programming languages tobuild tools. By composing an application of servicescommunicating over standard protocols it is possibleto re-use existing tools in new applications.

Principle 2: Use software containers to packageeach service. This has a number of benefits; it sim-plifies deployment, ensures that dependencies and li-braries are installed, and it simplifies sharing of ser-vices between developers.

Using these design principles we built the MIxTweb application using the microservices in Kvikto interface with statistical analyses and biologicaldatabases. Kvik provides a compute service for exe-cuting statistical analyses and a database service for

11aws.amazon.com

3

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 4: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

retrieving relevant information on genes and biolog-ical processes. Using these it is possible to developspecialized data exploration application in any mod-ern programming language. In the rest of the sectionwe discuss the design choices we made to build themicroservices that power the MIxT web application.

Compute Service

The main cornerstone of every data exploration ap-plication in systems biology is a dataset. Datasets gothrough a series of transformations before they can beinterpreted by experts. Because of the complexity ofthese datasets, package repositories such as BioCon-ductor provide software for reading and analyzing thedata. Because of the complex datasets, there are anumber of requirements for systems that can explorethem i) datasets require specialized software to beread, ii) datasets can be too big to fit on a desktopcomputer, iii) statistical methods for analyzing thedatasets are too computationally intensive to run ona desktop computer, iv) users may want to modifystatistical analyses directly from an application, andv) users need to know exactly what transformationshas been done to a dataset to reproduce the analyses.

From these requirements we have designed thecompute service in Kvik. The compute service in-terfaces directly with the R programming language,making it possible to call functions from any R pack-age. Application developers can use the compute ser-vice to execute analyses and return results. By inter-facing directly with R developers can leverage this toproduce dynamic applications. For example, if an ap-plication uses a clustering method to color nodes in agraph, end-users can tweak parameters that interac-tively within the application that changes the nodecoloring in graph visualization. Also, by interfacingwith R directly from an application, the applicationcan store provenance data on the statistical methodsand input parameters being used. By placing datastorage and analysis into a service, application devel-opers can deploy it on a powerful server while keep-ing the user-facing application logic on the desktop.Packaging the service into a software container alsoprovides the necessary functionality to ensure thatthe statistical analyses can be reproduced later.

The compute service offers three main operationsto interface with R. i) to call a function from an Rpackage, ii) to get the results from a previous func-tion call, and iii) a catch-all term that both calls afunction and returns the results. We use the same ter-minology as OpenCPU[6] and have named the threeoperations Call, Get, and RPC respectively. Thesethree operations provide the necessary interface forapplications to include data in the applications.

Database Service

To understand analysis results experts querydatabases and scientific literature. There are a wealthof online databases, some of which provide open APIsin addition to web user interfaces that applicationdevelopers can make use of. While the databasescan provide helpful information, there are some chal-lenges including them in interactive data explorationapplications:

i) the APIs aren’t fast enough to use in interac-tive applications where the application has to per-form multiple database calls, ii) some databases putrestrictions on the number of database calls, and iii)there is no uniform way for storing database lookupprovenance to reproduce the database lookups.

To solve these problems we built a database ser-vice for application developers to include. The ser-vice includes a simple caching mechanism that solvesthe two challenges. If we cache queries to thedatabase we can speed up subsequent calls and re-duce the load on the respective databases. Both thequery from the application and the response from thedatabases are stored for later use. The database ser-vice provides an open HTTP interface to biologicaldatabases for retrieving meta-data on genes and pro-cesses. We have currently packages for interfacingwith E-utilities12, MSigDB13, Hugo Gene Nomencla-ture Committe (HGNC)14, and Kyoto Encyclopediaof Genes and Genomes (KEGG)15.

12eutils.ncbi.nlm.nih.gov13software.broadinstitute.org/gsea/msigdb14genenames.org15kegg.jp

4

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 5: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

Applications

Kvik provides services to perform database lookupand execute statistical analyses. Since we provideboth services as software containers, application de-velopers simply run these and write applications thatinterface with them. Using our services applicationdevelopers have a platform for running statisticalanalyses, and a fast clean interface to get metadataabout genes and processes.

Figure 1 shows how the the MIxT application isbuilt using Kvik microservices. In MIxT we built aspecialized web application that interfaces with Kvikto get data from biological databases and to run sta-tistical analyses from the mixt R package16.

Figure 1: An overview of the relationship betweenthe MIxT application and Kvik. MIxT contains aweb application (online at mixt-blood-tumor.bci.

mcgill.ca) and the R Package that provides analysesand data to the web application. Kvik provides theservices for running the statistical analyses from theR package, and the database lookups found in theweb application.

16github.com/vdumeaux/mixtR

Implementation

In ths section we describe the implementation detailsof the microservices we provide in Kvik.

Kvik is implemented as a collection of Go packagesrequired to build services that can integrate statisti-cal software in a data exploration and provide an in-terface to up-to-date biological databases. We chosethe Go programming language because of its perfor-mance, ease of development, and simple deployment.To integrate R we provide two packages gopencpu andr, that interface with OpenCPU and Kvik R serversrespectively. To interface with biological databaseswe provide the packages eutils, gsea, genenames, andkegg that interface with E-utils, MsigDB, HGNC andKEGG respectively. In addition to these packageswe provide Docker images that implement the tworequired microservices.

Both the compute and the databases service inKvik builds on the standard http package in Go. Thedatabase service use the gocache17 package to cacheany query to an online database. In addition we de-ploy each service as Docker containers.18

Compute Service

The compute service is an HTTP server that com-municates with a pre-set number of R processes toexecute statistical analyses. On start of the computeservice launches a user-defined number of R workersessions for executing analyses, default is 5. The com-pute service uses a round-robin scheduling scheme todistribute incoming requests to the workers. We pro-vide a simple FIFO queue for queuing of requests.The compute service also provides the opportunityfor users to cache analysis results to speed up subse-quent calls.

The compute service in Kvik is built using a hy-brid state pattern. A hybrid state pattern originsfrom functional programming, where output from amethod only depends on its inputs and not the pro-gram state.[6]. In practice this means that the com-pute services stores the output results from function

17github.com/fjukstad/gocache18Available at hub.docker.com/r/fjukstad/kvik-r and

hub.docker.com/r/fjukstad/db

5

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 6: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

calls, but can compute them if the service has torestart. This makes it possible for us to scale the com-pute service horizontally to handle more requests. Ifan R worker session for some reason crashes the com-pute service simply starts up a replacement.

Matched interactions across tis-sues

We show the viability of the microservices approachin Kvik by describing the MIxT web application forexploring and comparing transcriptional profiles fromblood and tumor samples. We also evaluate the per-formance of the microservices in Kvik to show itsusefulness in building MIxT.

Analysis Tasks

The web application provides functionality to per-form six data analysis tasks (A1-A6):A1: Explore co-expression gene sets in tumor and

blood tissue. We simplify the process of exploringthe computed co-expression gene sets, or modules,through the web-application. The application visu-alize gene expression patterns together with clinico-pathological variables for each module. In additionwe enable users to study the underlying biologicalfunctions of each module by including gene set anal-yses between the module genes and known gene sets.A2: Explore co-expression relationships between

genes. Users can explore the co-expression relation-ship as a graph visualization. The network visualizeseach gene as a node and a significant co-expressionrelationship as an edge.A3: Explore relationships between modules from

each tissue. Users can explore the relationship be-tween modules from different tissues. We providetwo different metrics to compare modules, and theweb application enables users to interactively browsethese relationships. In addition to providing visual-izations the compare modules from each tissue, userscan explore the relationships, but for different breastcancer patient groups.A4: Explore relationships between clinical vari-

ables and modules. In addition to comparing the

association between modules from both tissues, usersalso have the possibility to explore the associationwith a module and a specific clinical variable. It isalso possible to explore the associations stratifyingon breast cancer patient group.

A5: Explore association between user-submittedgene lists and computed modules. We want to enableusers to explore their own gene lists to explore themin context of the co-expression gene sets. The web ap-plication must handle uploads of gene lists and com-pute association between the genelist and the MIxTmodules on demand.

A6: Search for genes or gene lists of interest. Tofacilitate faster lookup of genes and biological pro-cesses, the web application provides a search func-tionality that lets users locate genes or gene lists andshow association to the co-expression gene sets.

Design and Implementation

From these six analysis tasks we designed and im-plemented MIxT as a web application that inte-grates statistical analyses and information from bi-ological databases together with interactive visual-izations. The MIxT web application consists of threeservices: i) the web application itself containing theuser-interface and visualizations; ii) the compute ser-vice performing the MIxT analyses delivering datato the web application; and iii) the database ser-vice providing up-to-date information from biologicaldatabases. Each of these services run within Dockercontainers making the process of deploying the appli-cation simple.

We structured the MIxT application with a sepa-rate view for each analysis task. To explore the co-expression gene sets (A1) we built a view that com-bines both static visualizations from R together withinteractive tables with gene overlap analyses. Figure3 shows the web page presented to users when theyaccess the co-expression gene set ’darkturquoise’ fromblood. Using the Kvik compute service we can gen-erate plots on demand and provide users with high-resolution PDFs or PNG files. To explore the co-expression relationship between genes (A2) we use an

6

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 7: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

interactive graph visualization build with Sigmajs19.We have built visualization for both tissues, withgraph sizes of 2705 nodes and 90 348 edges for theblood network, and 2066 nodes and 50 563 edgesfor the biopsy network. The sigmajs visualizationlibrary has functionality for generating a layout forlarge networks, but we generate this layout server-side to reduce the computational load on the client.To generate this layout we use the GGally package20.By generating the network layout using the computeservice we relieve the clients.

To visualize relationships between modules fromdifferent tissues (A3), or their relationship to clini-cal variables (A4) we built a heatmap visualizationusing the d321 library. Figure 2 shows an example ofthis heatmap visualization, showing the associationbetween the clinical variables and the modules frombiopsy for all samples.

Figure 2: Heatmap visualization of the associa-tion between clinical variables and the modulesin biosy. The visualization is built using thed3 JavaScript library. The visualization can beviewed online at mixt-blood-tumor.bci.mcgill.

ca/clinical-comparison

Since we interface directly with R we can run anal-yses on demand. We built a simple upload pagewhere users can upload their gene sets or type themin manually (A5). The file is uploaded to the web

19sigmajs.org20cran.r-project.org/web/packages/GGally21d3js.org

application which redirects it to the compute servicethat runs the analyses. Similarly we can take userinput to search for genes and processes (A5).

Evaluation

To investigate if it is feasible to implement parts ofan application as separate services, we evaluate theresponse times for a set of queries. We have alsoinvestigated the time the MIxT web application useto produce the visualizations for the analysis tasks.

To evaluate the database service we measure thequery time for retrieving information about a spe-cific gene with and without caching. This illustrateshow we can improve performance in an applicationby using a database service rather than accessing thedatabase directly. We evaluate the query time for1, 2, 5, 10, and 15 concurrent requests. Since thedatabase service is just a lightweight HTTP serverwe use a AWS EC2 t2.micro22 instance to host it.

From the results in Table 1 we see a significant im-provement in response time when the database ser-vice caches the results from the database lookups. Inaddition by serving the results out of cache we reducethe number of queries to the online database down toone.

1 2 5 10 15No cache 956ms 1123ms 1499ms 2147ms 2958msCache 64ms 64ms 130ms 137ms 154ms

Table 1: Time to retrieve a gene summary for a sin-gle gene, comparing different number of concurrentrequests.

We evaluate the compute service by running asmall microbenchmark. The benchmark consists oftwo operations: first generate a set of numbers, thenplot them and return the resulting visualization. Weshow that the latency is low enough to use it in aninteractive application, and compare our solution tothe OpenCPU system. To show that it performs wellunder heavy load we perform the same operations us-ing 1, 2, 5, 10, and 15 concurrent requests. We use

22See aws.amazon.com/ec2/instance-types for more infor-mation about AWS EC2 instance types.

7

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 8: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

Figure 3: MIxT module overview page. The screenshot show the user interface for exploring a singlemodule. It consists of three panels. The top left panel contains the gene expression heatmap. The top rightpanel contains a table of the genes found in the module. The bottom panel contains the results of geneoverlap analyses from the module genes and known gene sets from MSigDB. The visualization is online atmixt-blood-tumor.bci.mcgill.ca/modules/blood/darkturquoise/cohort/all

two c4.large instances on AWS EC2 running the Kvikcompute service and OpenCPU base docker contain-ers. The servers have caching disabled.

With single requests the the mean execution timein Kvik is 274ms while OpenCPU uses 500ms. Wethen investigate how each service handles concurrentrequests. Table 2 shows the time to complete thebenchmark for different number of concurrent con-nections. We see that the compute service in Kvikperforms better than the OpenCPU alternative.

We also investigate the performance of the MIxTweb application to discover potential areas of im-provement. We measured the time from an user clickson a link to open a specific view, until the user caninteract with the results. Table 3 show the results

1 2 5 10 15Kvik 274ms 278ms 352ms 374ms 390msOpenCPU 500ms 635ms 984ms 1876ms 2700ms

Table 2: Time to complete the R microbenchmarkwith different number of concurrent connections.

from our evaluation, with anlysis tasks A1 and A2being the most time consuming. A1 generates theview in Figure 3 which contains large HTML tableswith results from the gene set tests that take the ma-jor fraction of the time to completion. The time tocompletion in analysis task A2 comes from retrievingand rendering the two large graphs.

8

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 9: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

Analysis task Number of calls Time to completionA1 10 10 secondsA2 5 30 secondsA3 1 2 secondsA4 1 2 secondsA5 2 3 secondsA6 NA NA

Table 3: An overview of the number of calls to thecompute service and completion time before the re-sults for the different analysis tasks are ready to beexplored. The number of calls and completion timefor analysis task 6 depends on the search query.

Related Work

In this section we discuss methods for integrating sta-tistical analyses in data exploration applications, dif-ferent visualization frameworks, interfaces to biolog-ical reference databases, and methods for container-izing applications.

Integrate Statistical Analyses

Shiny is a web application framework for R23 It al-lows developers to build web applications in R with-out having to have any knowledge about HTML, CSSor Javascript. Its widget library to provides more ad-vanced Javascript visualizations such as Leaflet24 formaps or three.js25 for WebGL-accellerated graphics.Developers can choose to host their own web serverwith the user-built Shiny Apps, or host them on pub-lic servers. Shiny forces users to implement data ex-ploration applications in R, limiting the functionalityto the widgets and libraries in Shiny.

OpenCPU is a system for embedded scientific com-puting and reproducible research.[6] Similar to thecompute service in Kvik, it offers an HTTP API tothe R programming language to provide an interfacewith statistical methods. It allows users to makefunction calls to any R package and retrieve the re-sults in a wide variety of formats such as JSON orPDF. Users can chose to host their own R server or

23shiny.rstudio.com24leafletjs.com25threejs.org

use public servers, and OpenCPU works in a single-user setting within an R session, or a multi-usersetting facilitating multiple parallel requests. Thismakes OpenCPU suitable for building a service thatcan run statistical analyses. OpenCPU provides aJavascript library for interfacing with R, as well asDocker containers for easy installation. OpenCPUhas been used to build multiple applications.26. Thecompute service in Kvik follows many of the designpatterns in OpenCPU. Both systems interface withR packages using a hybrid state pattern over HTTP.Both systems provide the same interface to executeanalyses and retrieve results. While OpenCPU isimplemented on top of R and Apache, Kvik is im-plemented from the ground up in Go. Because ofthe similarities in the interface to R in Kvik we pro-vide packages for interfacing with our own R serveror OpenCPU R servers

Renjin is a JVM-based interpreter for the R pro-gramming language.[7] It enables integrating the Rinterpreter in web applications. Since it is built ontop of the JVM it allows developers to write data ex-ploration applications in Java that interact directlywith R code, both running on top of the JVM. Al-though Renjin provides as a service pre-built versionsof packages from CRAN and BioConductor, not allpackages can be built for use in Renjin. This makes itnecessary to re-implement the packages that cannotbe built for use in Renjin.

Parallel and Distributed Execution

Biogo is a bioinformatics library written in Go.It provides functionality to analyze genomic andmetagenomic datasets in the Go programminglanguage.[3] Using the Go programming language thedevelopers are able to provide high-performance par-allel processing in a clean and simple programminglanguage. Go provides the ease of programming aspopular scripting languages such as R, but the per-formance of low-level langauges such as C++.

MapReduce is a popular framework and program-ming model to analyze large-scale datasets using dis-tributed and parallel computing.[8] While MapRe-

26opencpu.org/apps.html

9

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 10: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

duce has shown its usefulness in certain applica-tions, it does not fit the different statistical meth-ods in bioinformatics. An alternative to MapReduceis Spark, which supports high-performance parallelexecution of iterative machine learning methods bypartitioning data across a set of machines and opti-mizing the data access.[9] Spark is growing in pop-ularity, especially with the Spakrlyr27 and SparkR28

R packages that allow R users to run analyses on topof Spark without having to port the analysis code toanother language.

ADAM and VariantSpark are systems that areimplemented on top of Spark to analyse genomicdata. ADAM provides a set of formats, APIs andprocessing stage implementations [10] while Vari-antSpark provides an method for clustering genomicvariants[11].

Pachyderm29 is a system for running containerizeddata analysis pipelines. Each step in an analysispipeline is run within a software container and theoutput data is version controlled. Pachyderm auto-matically partitions and distributes the input data toenable parallel processing. Pachyderm runs on top ofKubernetes30, a system for managing and deployingcontainerized applications.

Visualization tools

Cytoscape is an open source software platform forvisualizing complex networks and integrating thesewith any type of attribute data[12]. It allows for anal-ysis and visualization in the same platform. Userscan add additional features, such as databases con-nections or new layouts, through Apps. One such appis cyREST which allows external network creationand analysis through a REST API[13]. To bring thevisualization and analysis capabilities to the web thecreators of Cytoscape have developed Cytoscape.js31,a Javascript library to create interactive graph visu-alizations.

Caleydo is a framework for building applications

27spark.rstudio.com28spark.apache.org/docs/latest/sparkr.html29pachyderm.io30kubernetes.io31js.cytoscapejs.org

for visualizing and exploring biomolecular data[?].Until 2014 it was a standalone tool that needed to bedownloaded, but the Caleydo team are now makingthe tools web-based. There have been several appli-cations built using Caleydo: StratomeX for explor-ing stratified heterogeneous datasets for disease sub-type analysis[14]; Pathfinder for exploring paths inlarge multivariate graphs[15]; UpSet to visualize andanalyse sets, their intersections and aggregates[16];Entourage and enRoute to explore and visualize bi-ological pathways [17][18]; LineUp to explore rank-ings of items based on a set of attributes[19]; andDomino for exploring subsets across multiple tabulardatasets[20].

BioJS is an open-source JavaScript frameworkfor biological data visualization.[21] It provides acommunity-driven online repository with a widerange components for visualizing biological data con-tributed by the bioinformatics community. BioJSbuilds on node.js32 providing both server-side andclient-side libraries.

Containerized analysis

In the later years software containers have beenwidely adopted by both the software industry aswell as research communities. Containers provide anisolated execution environment that can be used topackage and run an application with all its depen-dencies, library versions and configuration files. Inresearchers containers have become popular becausethey provides a reproducible environment that canbe shared between research projects.

In bioinformatics researchers are starting tobundle their software using software containers,such as Docker. There are repositories such asBioContainers[22] and BioBoxes[23] that provide con-tainers preinstalled with software for doing analy-ses and running different applications. Systems suchas Galaxy now allow researchers to build analysispipelines where each step is executed within a soft-ware container.

32nodejs.org

10

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 11: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

Kvik and Kvik Pathwys

We have previously built a system for interactivelyexploring gene expression data in context of biologi-cal pathways.[24] Kvik Pathways is a web applicationthat integrates gene expression data from the Norwe-gian Women and Cancer (NOWAC) cohort togetherwith pathway images from the Kyoto Encyclopediaof Genes and Genomes (KEGG). We used the experi-ence building Kvik Pathways to completely re-designand re-implement the R interface in Kvik. From hav-ing an R server that can run a set of functions from anR script, it now has a clean interface to call any func-tion from any R package, not just retrieving data as atext string but in a wide range of formats. We also re-built the database interface, which is now a seperateservice. This makes it possible to leverage its cachingcapabilities to improve latency. This transformed theapplication from being a single monolithic applica-tion into a system that consists of a web applicationfor visualizing biological pathways, a database serviceto retrieve pathway images and other metadata, anda compute service for interfacing with the gene ex-pression data in the NOWAC cohort. We could thenre-use the database and the compute service in theMIxT application.

Discussion

We argue that developing data exploration applica-tions using a microservice architecture is a viable al-ternative to the traditional monolithic approach. Bypackaging each component of an application in a soft-ware container application developers can reuse andshare parts of an application across research teamsand projects.

Although the partition of components can helpbreak up an application into manageable parts, thereis more overhead with deploying and monitoringthese than a single application. Leslie Lamport’s fa-mous quote You know you have a distributed systemwhen the crash of a computer you’ve never heard ofstops you from getting any work done. perfectly de-scribes the possible challenges application develop-ers face when moving into a microservice architec-

ture. Monitoring the health of the different servicesand keeping the services running is a challenge, butsystems such as Kubernetes33 provide the necessaryfunctionality to manage containerized applications.

Future work

Although we have a first working prototype of themicroservices and the MIxT web application, thereare a few points we aim to address in future work.

The first issue is to improve the user experiencein the MIxT web application. Since it is performingmany of the analyses on demand, the user interfacemay seem unresponsive. We are working on mecha-nisms that gives the user feedback when the compu-tations are taking a long time.

The database service provides a sufficient inter-face for the MIxT web application. While we havedeveloped the software packages for interfacing withmore databases, these haven’t been included in thedatabase service yet. In future versions we aim tomake the database service be a interface for all ourapplications. We also aim to improve how we cap-ture data provenance. We aim to provide databaseversions and meta-data about when a specific itemwas retrieved from the database.

One large concern that we haven’t addressed in thispaper is security. In particular one security concernthat we plan to address in Kvik is the restrictions onthe execution of code in the compute service. We planto address this in the next version of the computeservice, using methods such as AppArmor34 that canrestrict a program’s resource access.

We also aim to explore different avenues for scalingup the compute service. Since we already interfacewith R we can use the Sparklyr or SparkR packagesto run analyses on top of Spark. Using Spark as anexecution engine for data anlyses will enable applica-tions to explore even larger datasets.

33kubernetes.io34wiki.ubuntu.com/AppArmor

11

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 12: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

Conclusions

We have designed an approach for building data ex-ploration applications in systems biology that buildson a microservice architecture. Using this approachwe have built a web application that leverages this ar-chitecture to integrate statistical analyses, interactivevisualizations, and data from biological databases.While we have used our approach to build an ap-plication in systems biology, we believe that the mi-croservice architecture can be used to build data ex-ploration systems in other disciplines as well.

Competing interests

The authors declare that they have no competing in-terests.

References

[1] A. Sboner, X. J. Mu, D. Greenbaum, R. K. Auer-bach, and M. B. Gerstein, “The real cost of se-quencing: higher than you think!,” Genome bi-ology, vol. 12, no. 8, p. 125, 2011.

[2] P. J. Cock, T. Antao, J. T. Chang, B. A.Chapman, C. J. Cox, A. Dalke, I. Fried-berg, T. Hamelryck, F. Kauff, B. Wilczyn-ski, et al., “Biopython: freely available pythontools for computational molecular biology andbioinformatics,” Bioinformatics, vol. 25, no. 11,pp. 1422–1423, 2009.

[3] R. D. Kortschak and D. L. Adelson, “bıogo: asimple high-performance bioinformatics toolkitfor the go language,” bioRxiv, 2014.

[4] C. L. Williams, J. C. Sica, R. T. Killen, andU. G. Balis, “The growing need for microservicesin bioinformatics,” Journal of Pathology Infor-matics, vol. 7, 2016.

[5] V. Dumeaux, B. Fjukstad, H. Fjosne E, J.-O. Frantzen, M. Muri Holmen, E. Rodegerdts,E. Schlichting, A.-L. Børresen-Dale, L. A.

Bongo, E. Lund, and M. T. Hallett, “Interac-tions between the tumor and the blood systemicresponse of breast cancer patients,” Under re-view, 2017.

[6] J. Ooms, “The opencpu system: Towardsa universal interface for scientific computingthrough separation of concerns,” arXiv preprintarXiv:1406.4806, 2014.

[7] A. Bertram, “Renjin: The new r interpreterbuilt on the jvm,” in The R User Confer-ence, useR! 2013 July 10-12 2013 Universityof Castilla-La Mancha, Albacete, Spain, vol. 10,p. 105, 2013.

[8] J. Dean and S. Ghemawat, “Mapreduce: simpli-fied data processing on large clusters,” Commu-nications of the ACM, vol. 51, no. 1, pp. 107–113,2008.

[9] M. Zaharia, M. Chowdhury, M. J. Franklin,S. Shenker, and I. Stoica, “Spark: Cluster com-puting with working sets.,” HotCloud, vol. 10,no. 10-10, p. 95, 2010.

[10] M. Massie, F. Nothaft, C. Hartl, C. Kozanitis,A. Schumacher, A. D. Joseph, and D. A. Pat-terson, “Adam: Genomics formats and process-ing patterns for cloud scale computing,” Uni-versity of California, Berkeley Technical Report,No. UCB/EECS-2013, vol. 207, 2013.

[11] A. R. O’Brien, N. F. Saunders, Y. Guo, F. A.Buske, R. J. Scott, and D. C. Bauer, “Vari-antspark: population scale clustering of geno-type information,” BMC genomics, vol. 16,no. 1, p. 1052, 2015.

[12] P. Shannon, A. Markiel, O. Ozier, N. S.Baliga, J. T. Wang, D. Ramage, N. Amin,B. Schwikowski, and T. Ideker, “Cytoscape: asoftware environment for integrated models ofbiomolecular interaction networks,” Genome re-search, vol. 13, no. 11, pp. 2498–2504, 2003.

[13] K. Ono, T. Muetze, G. Kolishovski, P. Shan-non, and B. Demchak, “Cyrest: Turbocharging

12

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;

Page 13: Building Applications For Interactive Data Exploration In Systems … · for each project. Our approach for developing data exploration ap-plications in systems biology builds on

cytoscape access for external tools via a restfulapi,” F1000Research, vol. 4, 2015.

[14] A. Lex, M. Streit, H.-J. Schulz, C. Partl,D. Schmalstieg, P. J. Park, and N. Gehlenborg,“Stratomex: Visual analysis of large-scale het-erogeneous genomics data for cancer subtypecharacterization,” in Computer graphics forum,vol. 31, pp. 1175–1184, Wiley Online Library,2012.

[15] C. Partl, S. Gratzl, M. Streit, A. M. Wasser-mann, H. Pfister, D. Schmalstieg, and A. Lex,“Pathfinder: Visual analysis of paths in graphs,”in Computer Graphics Forum, vol. 35, pp. 71–80,Wiley Online Library, 2016.

[16] A. Lex, N. Gehlenborg, H. Strobelt, R. Vuille-mot, and H. Pfister, “Upset: visualization ofintersecting sets,” IEEE transactions on visual-ization and computer graphics, vol. 20, no. 12,pp. 1983–1992, 2014.

[17] A. Lex, C. Partl, D. Kalkofen, M. Streit,S. Gratzl, A. M. Wassermann, D. Schmalstieg,and H. Pfister, “Entourage: Visualizing rela-tionships between biological pathways using con-textual subsets,” IEEE transactions on visual-ization and computer graphics, vol. 19, no. 12,pp. 2536–2545, 2013.

[18] C. Partl, A. Lex, M. Streit, D. Kalkofen,K. Kashofer, and D. Schmalstieg, “enroute: Dy-namic path extraction from biological pathwaymaps for in-depth experimental data analysis,”in Biological Data Visualization (BioVis), 2012IEEE Symposium on, pp. 107–114, IEEE, 2012.

[19] S. Gratzl, A. Lex, N. Gehlenborg, H. Pfister,and M. Streit, “Lineup: Visual analysis of multi-attribute rankings,” IEEE transactions on visu-alization and computer graphics, vol. 19, no. 12,pp. 2277–2286, 2013.

[20] S. Gratzl, N. Gehlenborg, A. Lex, H. Pfister,and M. Streit, “Domino: Extracting, compar-ing, and manipulating subsets across multiple

tabular datasets,” IEEE transactions on visu-alization and computer graphics, vol. 20, no. 12,pp. 2023–2032, 2014.

[21] J. Gomez, L. J. Garcıa, G. A. Salazar, J. Villave-ces, S. Gore, A. Garcıa, M. J. Martın, G. Launay,R. Alcantara, N. D. T. Ayllon, et al., “Biojs: anopen source javascript framework for biologicaldata visualization,” Bioinformatics, p. btt100,2013.

[22] F. da Veiga Leprevost, B. A. Gruning, S. A.Aflitos, H. L. Rost, J. Uszkoreit, H. Barsnes,M. Vaudel11, P. Moreno, L. Gatto13, J. We-ber, et al., “Biocontainers: An open-source andcommunity-driven framework for software stan-dardization,”

[23] P. Belmann, J. Droge, A. Bremges, A. C.McHardy, A. Sczyrba, and M. D. Barton,“Bioboxes: standardised containers for in-terchangeable bioinformatics software,” Giga-Science, vol. 4, no. 1, p. 47, 2015.

[24] B. Fjukstad, K. S. Olsen, M. Jareid, E. Lund,and L. A. Bongo, “Kvik: three-tier data explo-ration tools for flexible analysis of genomic datain epidemiological studies,” F1000Research,vol. 4, 2015.

13

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/141630doi: bioRxiv preprint first posted online May. 24, 2017;