transmart ‘glowing bear’ 2 · 2015-02-18 · custom plugins. it will contain the following...

TranSMART ‘Glowing Bear’ 2.0

Architecture Roadmap

Authors:

Sjoerd van Hagen, The Hyve Florian Guitton, Imperial College Michael McDuffie, Harvard Medical School Gustavo Lopes, The Hyve Kees van Bochove, The Hyve Peter Rice, Imperial College Ruslan Forostianov, The Hyve Terry Weymouth, University of Michigan

Version: 1.0 Date: February 17, 2015

Table of Contents

Introduction Executive Summary TranSMART 1.2 TranSMART 2.0

General architecture Components

TranSMART 2.0 Core TranSMART 2.0 Standard distribution transmartApp transmartbatch Architecture diagram

Extensions Repositories General requirements

Core Plugins i2b2 Integration Data model

General Clinical data Highdimensional data

API Internal API updates Internal API extensions RESTful API

transmartserver User interface

Technical requirements Framework

AngularJS Why Angular?

New developers MVC Security

Architecture User Experience

General improvements Use Cases

Summary Statistics Current workflow Proposal(s)

Proposal 1 Proposal 2 Proposal 3

Discussion

(Crossstudy) cohort comparison Current workflow Proposal(s)

Proposal 1 Proposal 2

Discussion Plugins

Architecture Front End Plugin Backend Plugin

Proposed plugins transmartactivity transmartresults transmartanalytics

Deferred ETL

Data Repository interfaces Data updates

Timeline Initial plans

UI Backend plugin architecture Miscellaneous/Maintenance

Next steps

Introduction

Executive Summary TranSMART is an open source platform for translational research with a large backing community from pharmaceutical companies, university hospitals, patient organizations and technology vendors. TranSMART 1.2 is the current stable version of the platform, but has a number of problems, detailed below. This whitepaper is the result of an in person meeting of the core developers of tranSMART held in Utrecht in January 2015, and proposes an architectural roadmap for tranSMART 2.0, the next major version of the platform. It also describes a framework to formalize the development workflow of tranSMART, in order to make it easier to maintain and to encourage contributions. To achieve this we will isolate, into a core, those elements that are truly foundational, and specify well defined boundaries between that core and the other elements (data backend, business logic, and user interface) with Application Program Interfaces (APIs). In this paper, after discussing some details on the background of this project, we explore the architecture of the core, the details of the APIs, and the implications for plugins that implement the parts of tranSMART that are beyond the core, for example the plugins for advanced analysis and the associated user interface elements. We conclude with a roadmap for design, implementation, and testing of v2.0: design and implement a framework for three types of plugins (Backend, UI, and Analysis); implement and vet a prototype of a skeleton core with clinical data, cohort selection, and summary statistics; then extend the APIs to include high dimensional data and other data types; and create a set of standard plugins for advanced analysis. The skeleton prototype is planned for the end of 2015 Q2 and during its development we will plan details of a more complete implementation roadmap. The aim is to present the consensus roadmap at the tranSMART Annual Meeting in fall 2015.

TranSMART 1.2 TranSMART was made open source by Johnson & Johnson in February 2012. In early 2013 the tranSMART Foundation was founded, and since has worked with the community, over the course of 2013 and 2014, to produce two new versions of the platform: v1.1 the first wholly open source version of the codebase, and v1.2 a full, feature rich and tested codebase with a code governance program. During this actionpacked period of extending and developing tranSMART, some corners were cut and temporary fixes made to achieve results in the short time. From an architectural and code quality perspective, the current codebase contains a lot of technical debt, making it increasingly difficult to maintain and debug the code, or add features. Our current developers may lose interest in contributing to tranSMART as their productivity decreases and their time gets taken up by fixing bugs. We would like very much to keep these valuable individuals involved in the development of tranSMART and the need for rethinking the architecture has been repeatedly voiced last year. Another observation is that growth of the open source developer community is slow, and this is probably also caused by problems in the current codebase. New people that would like to join and contribute are faced with a very steep learning curve and could easily be scared away by the complexity of installing, programming and debugging tranSMART, let alone the complexity of the architecture and code.

When it comes to the user interface, there are several problems with tranSMART 1.2. The development practices for designing and implementing web applications have changed a lot during the last couple of years. People have come to expect a better user experience from web applications than tranSMART currently delivers. There are some issues across the user interface that will probably take longer to fix than it would take to replace the user interface altogether. With tranSMART 1.2 being tested and brought into production all over the world, now is an excellent time to address these issues in earnest, and lay the foundations for future development of tranSMART, establishing it as the de facto standard for translational research data warehousing.

TranSMART 2.0 For the development of tranSMART 2.0 we have the following main goals. This list is not exhaustive but lists the most important ones.

Make it easier to for new developers and less tedious for our current developers to add features and contribute to tranSMART

Improve the maintainability of the code base Make tranSMART more suitable as a component in larger IT landscapes Standardize the way of working in the open source community Replace the current user interface with a modern web application to enhance the user experience Improve testability and robustness

In this document we will present the roadmap to achieve these goals. We will try to simplify and modularize the architecture. This will make it easier to isolate problems, improve maintainability and testability, and make it easier for new developers to get started contributing to tranSMART. We will do this in two ways: create a plugin architecture to make adding features easier and introduce a strict division between client and server code. We will also remove obsolete code and clean up code that is to be reused. By using plugins it will also be easier to integrate tranSMART 2.0 into a larger ITlandscape and it will improve testability and robustness. A large part of of this whitepaper will be dedicated to the user interface as there is a lot of technical debt to be found there and we decided that it would be easier to replace than to reuse. Apart from using a modern approach to designing a web application with a modern framework, we made the decision to make a major architectural change: a strict division between server and client code. Other than the already mentioned advantages of modularization we create a split between client and server side technology enabling someone who is just familiar with either of them to contribute to the project. We also propose a separate plugin architecture to make it a lot easier to create a plugin that only extends the client side. Before we dive into the more detailed discussions about architecture and web application frameworks we will give a short description of what tranSMART 2.0 is going to be. The core of tranSMART 2.0 is an enterpriselevel data warehouse for translational research, which is able to store clinical and biomarker data for various platforms. On top of this core data warehouse, the standard distribution of tranSMART 2.0 also contains a web client user interface which allows users to browse clinical (patient) data and highdimensional (biomarker) data observations in the familiar treelike fashion, quickly generate summary statistics and a couple of more advanced analyses out of the box. This user interface will have a modular architecture, and the standard distribution of tranSMART 2.0 will have modules for the cohort creation and summary statistics, as well as plugins for advanced analyses and browsing analysis results.

General architecture In this section we will discuss the highlevel architecture of tranSMART 2.0 and the division into components, some of which we will discuss in more detail later on. We will describe how these components are going to be distributed and the setup of the repositories. We will also give some requirements that hold for each of the components.

Components The following is a list of components that make up the tranSMART 2.0 core and standard distribution. Each component has its responsibilities. Most of these components already exist in tranSMART 1.2, some of which will be extended. We will first list the components with a short description. For some of the components we will give some additional information in this section, others e.g. UI will have a larger section somewhere else in this document. Components that are plugins are handled in the plugin section.

TranSMART 2.0 Core The tranSMART core should be a minimal package that provides enough functionality for developers to create and deploy plugins. It is to be used if a user just needs a bare installation on which to put some custom plugins. It will contain the following modules:

transmartdata (various): Utilities for management of database schema and installation transmartcoredb (Grails plugin): base implementation of the API on top of PostgreSQL or Oracle transmartcoreapi (Java/Groovy API library): internal backend API for data retrieval transmartrestapi (Grails plugin): RESTful endpoint exposing clinical and highdimensional data transmartserver (Grails app): application container for the RESTful endpoint

TranSMART 2.0 Standard distribution The tranSMART 2.0 standard distribution is a package containing the core and a set of plugins that provide the most common features that are used in tranSMART today. The goal of this package is to provide a proper data warehouse to the user containing all of the functionality you would expect. It contains:

TranSMART 2.0 Core transmartbaseui (client web app): basis user interface for tranSMART, with modules for cohort

creation and summary statistics transmartanalytics: a plugin or set of plugins to perform advanced analytics (e.g. clustering)

including user interface modules transmartactivity: a plugin that will interface both at the backend level and the UI level allowing the

developer to create notifications for their plugin behaviour. It would optionally propose to send email and export in a flexible RSS or ATOM format.

transmartresults: a plugin or set of plugins to store analysis results (e.g. microarray heatmaps, GWAS etc.)

transmartbatch (Groovy): utilities for ETL

transmartApp The most dramatic change will be the removal of transmartApp. All the view related code and images will go into in the new UI component. Serverside, the logic will be moved to the RESTful API and all the communication between the UI and the server will happen through RESTful calls returning either JSON, XML or protobuf response entities.

transmart-batch One of the wishes that has been expressed multiple times, but has already received a lot of attention is the wish to simplify data loading, make it less database dependent and more robust. For tranSMART 2.0 this means that the current stored procedures combined with Kettle will be abandoned in favor of transmartbatch. This is a Groovy application based on Spring Batch that already supports loading clinical data and mRNA data without relying on databasedependent stored procedures or Kettle.

Architecture diagram The architecture diagram lays out out these components:

Extensions By creating a full REST interface for all tranSMART functionality it is very easy to create new clients e.g. a Desktop or Android/iOS app. We could facilitate this further by writing client API’s for various programming languages, for example R. Due to the decoupling, one could also make his own web client but we would prefer if they would write a plugin for the web client supplied in the core distribution. In addition, to extending the client interface it may also be necessary to extend the back end functionality. In principle it is possible to

create any kind of plugin with any kind of interface and plug it into the core, however we should encourage people to simply extend the REST interface when they need more server side functionality.

Repositories We will create a new repository for the UI as there is not much we can reuse from the current UI. The current core repositories which are still included in tranSMART 2.0 will get a new branch named 2.0master as soon as this is necessitated by divergence from 1.2. This way some of the new functionality will make it into tranSMART 1.2. If the complete REST interface makes it to tranSMART 1.2 the new web client could also be put in front of a tranSMART 1.2 installation. If it is possible to backport the REST interface this would be desirable. We have also discussed the workflow for committing to these repositories, the result of which can be found on the wiki.

General requirements Here we will describe the requirements that each of the components must meet. These are needed to ensure that developers can very quickly get a high level picture of a component.

All involved codebases should be identified (called out on wiki), ensured to have an appropriate license (GPL or LGPL), ensured to have a corresponding repository on the tranSMART github, and initiate a 2.0dev branch

Component architecture should be drawn to explain structure of the tranSMART 2.0 codebase to prospective developers

The programming language, and if applicable, framework choices for all components should be called out and documented

Components should be decoupled as much as possible, any dependencies between the components should be documented

It should be documented how authentication and authorization are addressed in the various components

Core In this section we will discuss the changes to be performed to the core components of tranSMART. We will first give an overview of everything we move from the typical tranSMART 1.2 installation into plugins. We will also look at changes we need or may want to do to the data model.

Plugins We will describe the functionality that we would like to move into plugins if possible. These are not necessarily steps we would like to take in the short term as we have enough work already but eventually the core should become as small as possible.

The browse and search functionality can probably be moved outside of the core as it is not vital to tranSMART.

Extensions of the data model that might be needed to hold state of UI / user preferences to support new decoupled UI should go into a plugin that extends the REST API.

https://wiki.transmartfoundation.org/display/TSMTGPL/How+to+Contribute+and+who+can+contribute

i2b2 Integration The i2b2 platform has a proven track record for storing and querying complex phenotypic data, acting as the base on which tranSMART was built. After divergence in recent versions, a December 2014 Hackathon brought tranSMART’s i2b2 integration back to working condition. This was accomplished by providing an implementation of the tranSMART API that communicates to the i2b2 web services, a first step in aligning our two communities again. This API approach allows for new functionality developed within i2b2 to be accessible through tranSMART after updates to tranSMART’s API. The social architecture of code governance and how the the two communities interact is still to be determined as the leaders from each meet and discuss futures.

Data model

General Implicit assumptions in the data model should be documented; this will be done, in part, through the

API. Delete unused tables and columns, except ones that are needed to conform to i2b2 v1.7 database

specification The data model needs to be specified in a DB independent way. For example, domain models and

hibernate. Support for upgrade path: There should be a defined way to make changes to the database

schema, and for every release the changes should be convertible to a script to upgrade from a previous version (e.g. Liquibase)

Data model might have to be extended to hold state of computational workflows / jobs in line with workflow API

The core libraries need to track the changes in the data model so there also work has to be done.

Clinical data The Clinical Data Model will be be identical to i2b2 1.7, which involves updates to the tranSMART

1.2 schema. The i2b2 schemas will remain unmodified by tranSMART development to ensure future compatibility with i2b2 updates and any application utilizing the i2b2 Clinical Data Model (SHRINE, SMART, i2b2 plugins, etc..). The crosstrials implementation will need to be altered so that is compatible with i2b2.

The data model for dictionaries will be reconsidered so that is both simpler (no separation between search_keywords and bio_markers) and powerful (do not rely on a global set of caseinsensitive search keywords instead allow searching for biomarkers relevant for the species/platform in context).

The database schema is still to be managed with transmart-data. The database schema should be defined using ORM as much as possible. Parts of the schema that cannot be mapped using ORM should be defined in a SQL script that is as portable as possible and well documented as it may need to be adapted for other database management systems.

High-dimensional data Although we are aware that there are some problems with the way we store highdimensional data in tranSMART 2.0, we do not propose any big changes to the highdimensional model in tranSMART 2.0. In the bioinformatics community people are moving away from databases when it comes to storing genomics and proteomics data, in favor of indexed file formats (e.g. VCF) and big data systems leveraging column stores (e.g. ADAM or Cassandra) for the numerical data and document stores (e.g. MongoDB) for the metadata. Instead of implementing our own API we could leverage HTSJDK or ADAM depending on the amount of data involved. Moreover, in our current proposal, using the new plugin architecture, it makes a lot more sense to have a plugin to implement new ways of handling highdimensional data, especially if it does not use the (relational) database for storage anyway. Changes that we may want to push due to user requests:

Data model specification / assumptions might have to be extended to make sure high dimensional data is stored sample centric rather than patient centric; but if so then patient/subject links must be made clear.

API

Internal API updates Clinical query API specification should conform to i2b2 v1.7. Should the i2b2 community

update/extend the API, tranSMART should stay in sync and follow those changes. We need an internal SPI (Service Provider Interface) to make it easier to implement the coreapi.

Currently, you will get a domain object when doing a query which can in turn point to other domain objects which can be accessed by the user of the API. This way the API is divided across the domain objects. However we cannot just change the client API and from the perspective of the user of the API it is convenient the way it works now. To get around this we need to add another layer of indirection here, a flat SPI that can more easily be implemented. The client API calls are forwarded to this flat SPI, which in turn can be implemented by different providers for different data storage solutions.

Internal API extensions An API for search data needs to be designed, developed and implemented. The scope for search is

not clearly defined, in tranSMART 1.2, mostly based on Sanofi RC2 Browse functionality. What functions should browse and search encompass? Gene based search? Fulltext search? In addition, the area of study discovery and filtering needs rethinking. This should probably be taken back to the user community as a request for more feedback.

RESTful API For the new web UI all necessary functionality must be accessible through the RESTful API. The easiest way to implement this is by letting the implementation of the GUI drive it, by implementing calls as they are needed. We have had some good discussions about the division of work between the client side and server side. There are a number of considerations:

authorization: is the user allowed to see the patient level data, or only summarized data? By doing more on the server less data is exposed, access control and access logs can be more finegrained.

performance: if the user just wants a sum it will be faster to compute serverside and send the sum, as opposed to sending all the data to the client, when working with large datasets we have to do computation on the server.

implementation effort: when getting the deidentified patient level data into the client we could use D3 to generate a wide variety of different plots, but if we have to do the computations on the server we have to do more ourselves, especially implementing a wider variety of REST calls.

For the first prototype of the UI we should start with letting D3.js do the heavy lifting. This way we will need little adaptation on the server side, just a few REST calls need to be implemented. This also allows us to experiment with highly responsive cohort selection using cross filter.

transmart-server All this stuff has to live somewhere of course and this container needs to be configured somewhere. This will be in the transmartserver repository, which is new in tranSMART 2.0.

User interface A longcherished wish for tranSMART over the last couple of years has been redesigning and reimplementing the UI. The current UI is severely lacking in security, responsiveness and ease of use. Also, it does not behave like a user would expect from a web interface; the back button and refresh to not work as expected and it is not possible to create a bookmark. We would also like to simplify the development by untying the back end and the front end. In this section we will describe our plans for the 2.0 UI, starting with the requirements. We will go into the framework that we plan to use. In the last part we will be describing some use cases to give an idea of how we intend to make it more user friendly.

Technical requirements Should be decoupled from the server code. This makes it easier to get into when not familiar with

the back end or back end technologies in general. It should run on each platform with a modern browser. All communication with the back end must be done via the RESTful interface. This way all access

passes through a single interface making it easier to secure and handle things like logging. A framework should be chosen that imposes structure and good habits on the programmers,

preferably we use a single framework that is easy to get into. First version should functionality wise follow current tranSMART 1.2 functionality as much as

possible. We can improve and iterate in later phases, and do A/B testing etc. This does not mean we need to repeat past mistakes or recreate anything we plan to throw away later!

There are a few points we will need to rethink, as they are fundamental limitations that the current UI has:

Support for longitudinal data should be implemented. A single patient can have multiple observations over time for a certain concept, and the UI should support this

Support for crosstrial analysis should be implemented. (This is already partly supported in 1.2, but should work across all parts of the application)

Where it is possible to enhance the user experience with minimal improvements we should consider this (e.g. more logical choice of widgets, icons, dropanddrag behaviour etc.)

While redesigning the UI is not a task for the core development team alone and not fully within the scope of the workshop, the resulting restructuring of the architecture and the clear definition of the design boundaries (i.e. through APIs) will facilitate a clear definition for both the user interface and the business logic.

Framework For the UI architecture a lot depends on the choice of framework. The number of options here is mindboggling and a good indication that the opinions differ a lot when it comes to creating the ‘best’ framework. Some frameworks are trying to present a fully integrated solution, others try to do a small subset very well, some take a modelviewcontroller(MVC) approach, others are component based, and the list goes on. After some fierce discussions the dust cleared and we decided to go with AngularJS for various reasons we will present after giving a short description of what AngularJS is.

AngularJS AngularJS, or Angular, is an opensource web application framework maintained by Google and a community of developers for developing single page applications. Its goals are to simplify development and testing by providing a clientside MVC framework. It is easy to learn and encourages programmers to adhere to the MVC design pattern.

Why Angular?

New developers We have stated various reasons for redesigning tranSMART at the start of this document. One of the most important ones is to try and lower the bar for new developers joining the team. We already decided to make a clear cut between the back end and front end, so developers who know only HTML, CSS and JavaScript would still be able to contribute, as well as developers who just know how to program the back end. We decided early on not go wild on frameworks so we wanted a single framework that is easy to learn and not too exotic. With Angular we have made a rather conservative choice compared to other frameworks and libraries we have considered like Polymer. Angular has been around for some time, is fairly well known, easy to learn and adheres to the MVC pattern which every programmer should be familiar with.

MVC The current implementation of tranSMART is written in Groovy using Grails as application container according to the MVC pattern. By picking a MVC framework for the client we can use the same structure for the parts that were designed correctly. If we would have picked a framework with a different design pattern it could require us to rethink the whole structure.

Security One of the areas where tranSMART is severely lacking is security. As each request can be forged outside of the client the majority of security measures is implemented on the back end. On the client there is just the issue of displaying user provided content. If this content contains characters that are ‘special’ in HTML this can destroy the layout of the page. A malicious user would even be able to serve malware by injecting script tags. Angular facilitates serving content that has been sanitized for inclusion in a HTML page. Another approach to enforcing security relates to the implementation of a bankgradelike encryption of the critical data independently of the transport layer being used. TranSMART will be provided with a documentation, certainly as part of its “Administrator Manual”, containing the best practices in terms of installing and instance on premise with regard to compatibility versus security. This will be a very short document.

Architecture The architecture for the UI is pretty simple. There is a client side web interface that talks to the back end through a RESTful API by means of AJAXcalls. This way we will have a strict separation between server and client allowing for additional clients, easier testing and debugging and making it easier for developer to join the effort and contribute to tranSMART. More precisely we propose to build on the top of jQuery and Angular a pluggable inbrowser application (transmartbaseui) that would exposes its own sets of clientside APIs to custom plugins for manipulate views and states. This application would provide a skeleton layout with which it would be possible to interact (e.g. menus, tabs, panels, …) using the aforementioned APIs. This would ensure a controlled growth of the application code complexity. Besides providing a bootstrap interface, transmartbaseui would rely on a router component that would handle the RESTful logic and security on the network layer. This router should be able to comprehensively interpret URL schemes presented to tranSMART while structuring the relevant views in order to enhance drastically the usability of tranSMART 2.0 (i.e. back/forward buttons). Also, the application should be implementing some sort of message bus allowing crossplugin inbrowser communication and isolation of data inmemory whether it be through the use of an already existing Angularcompatible component or a custom developed plugin.

User Experience After opening this section with some improvements across the entire UI, we would like to describe some use cases to give an indication of what we envisage for the new UI and how this will help the user to be more productive and also have a more enjoyable experience. We will start each use case with a goal, followed by a description of the current workflow and present one or more workflows which we think will improve user experience and productivity. These proposed workflows are an illustration of what we can do using modern web application development methods. Naturally, the process of recommending changes needs user input. So, herein, we are hoping to encourage reactions, comments, and contributions by the illustration of some possibilities.

General improvements Create more compact user interface elements. The current diagrams, for example, are quite big

because they include labels and are drawn quite big. Now that we will be using SVG we can use mouse overs instead of having many labels. We can also adjust the size based on the data, or give the user the possibility of enlarging them as required instead of drawing everything the same size.

Have a proper back button and bookmarking. While these seem like very dissimilar concepts they both require the implementor of a web application to think about state. This state is needed to reproduce the page both when using the back button or a bookmark.

Make the UI more helpful. Using visual cues we can show the user what he/she can do, instead of just showing an error after the user performs an action that is not allowed.

Have a more dynamical UI. By having a more direct feedback to the user less actions are required. Make use of a single, homogeneous, design language to ensure good composition of the interface

and easy to understand user logic. Material Design guidelines provided by Google have been mentioned several times during our original workshop (http://www.google.com/design)

Use Cases

Summary Statistics This section examines the sections of the tranSMART user interface that deal with the concept tree, cohort selection, and viewing the summary statistics.

Current workflow 1. Find the concept in the tree by expanding nodes 2. Add to cohort selection: clickdragdrop a node from the concept tree into a subset selection “box”

This will select the concept and all its subconcepts if any. Additional constraints can be set for numerical values e.g. drag the age item to select the age concept and add a constraint to get all subjects

3. Switch to the Summary Statistics view 4. Additional concepts can be added by dragging them from the concept tree to the summary statistics

view

Proposal(s)

We have several proposals for this use case. The basic idea for improvement is the same: using interactive diagrams to do cohort selection. The difference is in how we use this idea in combination with the concept tree. We will list the proposals followed by some comments.

http://www.google.com/design

Proposal 1 1. Find the concept in the tree and just click it. This will select the concept. Immediately diagrams will

be shown to the right containing the summary statistics for the subjects in this concept. 2. (Optionally) Interact with the diagrams to apply additional constraints or remove them again. The

diagrams will be live updated to show the statistics of the selection. 3. (Optionally) Save the cohort for later use.

Proposal 2 4. Find the concept in the tree and just click it. This will select the concept. Immediately diagrams will

be shown to the right containing the summary statistics for the subjects in the parent cohort but with the concept filters already applied.

5. (Optionally) Interact with the diagrams to apply additional constraints or remove them again. The diagrams will be live updated to show the statistics of the selection.

6. (Optionally) Save the cohort for later use.

Proposal 3 1. Find the cohort in the tree and click it. This will select the cohort. Immediately diagrams will be

shown to the right containing the summary statistics for the subjects in this cohort. 2. (Optionally) Interact with the diagrams to apply additional constraints or remove them again. The

diagrams will be live updated to show the statistics of the selection. 3. (Optionally) Save the cohort for later use.

Discussion In proposal 1 the diagrams will show just the subjects in the selected concept, in proposal 2 it will show the subjects in the cohort with the concept filters already enabled in the diagram, in proposal 3 we actually remove the concepts from the concept tree, because they are not needed anymore, and just the cohorts are shown. Proposal 1 has the drawback that the filters that are applied by choosing the concept are not visible in the diagrams and thus cannot be adjusted or disabled. Proposal 3 provides a strict separation between selecting the cohort and the concepts making it easier to implement and understand. Proposal 2 is between one and three but also has a major drawback. When an adjustment on the diagrams is made this should also be reflected in the selection in the concept tree to be consistent but this is not always possible and therefore can be confusing. Not every constraint imposed by adjusting the diagrams can be reflected in the concept tree. We would recommend proposal 3, especially because we would like saved cohorts to appear in the cohort tree as well. If all the concepts are also in there it could get more difficult to find what you are looking for. You may wonder whether this will not put too many diagrams on the screen. We can handle this by collapsing them according to the tree structure. We would also like to give the option of rearranging the diagrams so users can create a view that fits their purposes.

(Cross-study) cohort comparison In this section we will look at crossstudy comparison and how we can streamline the workflow there. The first few steps of the workflow are basically the same as the steps for the summary statistics so we will want to reuse the mechanism here.

Current workflow 1. Find the concept in the tree by expanding nodes. 2. Add to cohort selection: clickdragdrop a node from the concept tree into a subset selection “box”

This will select the concept and all its subconcepts if any. Additional constraints can be set for

numerical values e.g. drag the age item to select the age concept and add a constraint to get all subjects

3. (Optional) Switch to the Summary Statistics view if you would like to examine the selection in more detail

4. (Optional) Go to 1 if you would like to add more concepts until you are happy with the selection. 5. Go to the summary statistics to see the comparison.

Proposal(s)

For the proposals to tackle this use case we have examined an integrated approach, which is powerful but slightly more complicated both to implement and to use. We also consider an easier approach but it is less fluent and requires either more planning or more actions by the user. More on this in the discussion, first we will have a look at the proposals.

Proposal 1 1. Find the cohort in the tree and click it. This will select the cohort. Immediately diagrams will be

shown to the right containing the summary statistics for the subjects in this cohort. Additional cohorts can be added by a plus sign next to the cohort, and removed by clicking a minus sign. This way a cohort can also be selected multiple times e.g. to compare males and females in the same study. The diagrams will show the data for both cohorts.

2. (Optionally) Interact with the diagrams to apply additional constraints or remove them again. The diagrams will be live updated to show the filtered statistics of the selected cohorts.

3. (Optionally) Save the cohorts for later use.

Proposal 2 1. Go through the process of selecting a cohort as described in the summary statistics use case and

save it. 2. Go to the (crossstudy) comparison tab and select the saved cohorts to display the diagrams. 3. (Optionally) Save the cohorts for later use.

Discussion Proposal 1 is great because it allows the user to do everything in a single screen. There does not even need to be a distinction between summary statistics and comparison, when more than one cohort is selected we have a comparison, otherwise he have a summary. A problem may be the amount of data that is to be displayed in a single screen. Giving the user the option of rearranging or collapsing will mitigate this problem. The nice thing about proposal 2 is that it is very simple and easy to understand, and it encourages the user to save cohort selections. However, when the user wants to adjust the selection we can either send him back to cohort selection, which is tedious, or we allow him to work on the diagrams directly, which has the same drawbacks proposal 1 may have. For this reason we recommend proposal 1.

Plugins TranSMART 2.0 needs to be an extremely flexible and maintainable piece of software. With a growing community which has a heavy potential for contribution, the core team should propose elegant structures to ease the extension of the application at the UI level. It should also promote the use of common structures for writing these extensions and enforce specific design pattern that we can rely on to avoid compatibility breakage within the lifetime of the program.

Do do this we must present a defined layout for the application and decide on which elements within this UI can be modified/extended/complemented (e.g. menus, content items, panels, …) All of these actionable elements should be documented in some sort of developer guide and exposed via comprehensive APIs. These APIs must be implemented in such a way that their underlying logic can be exploited by the plugins themselves for declaring their own extension areas and components. Ultimately some testing mechanism should be in place for the plugin to notify of their good behaviour. This could be done through versioning control, method existence test, signature matching, … This must remain simple enough and comprehensive. Ideally, if a problem is detected with the plugin, this one should not be activated to insure the interface remain responsive and usable. The plugins must also have access to a range of information about the system straight from within the UI and a control of them should be possible. From a security stand point, we might want to limit access to a given part of the application (seen as plugin) to a given user. Plugins could possibly be offering possibility to encrypt data flow up to the server.

Architecture We should provide a structure for packaging and dependency management. We may need to provide some additional services such as job management or a notification system. These could be implemented as plugins. Everything we do in this area must be well documented and we should make it as easy as possible to get started.

Front End Plugin We will need a way to give access to the extension of the UI. We should define a mechanism through which plugins can share data e.g. a cohort selection. The core UI needs mechanism to allow a plugin to use a section of the screen. On interactive screens the coreui needs to provide metadata about the objects that are shared

across the coreplugin boundary. For example, if a cohort is selected in the concept tree for advanced analysis in a plugin the plugin must know about this.

Common functions should be provided in a library so plugin creators do not have to write these themselves. Examples could be cohort creation, getting data from the back end, components for things like value selection, and so on.

Backend Plugin Backend plugin developers should be able to extending the database schema e.g. to save things

like state. If extension of core classes appears to be needed we need to decide whether the new behavior is

common enough to put in the actual core, or that it should be handled in the plugin. By adding REST endpoints a back end developer can provide additional serverside functional to a

client plugin.

Proposed plugins

transmart-activity As part of the Standard Distribution we plan to release a “transmartactivity” and “transmartactivityui” which would be a set of plugins respectively for backend and frontend providing an API endpoint allowing developers to send operational information and notifications to an activity feed. This activity feed can be

viewed by the users who can also set alerts in order to receive emails for certain notifications. It is not meant to be a replacement for the audit capability of the platform which will remain in the core.

transmart-results One of the things people would like to be able to do is to store results of their computations. Although there is something there in tranSMART 1.2, it is not complete and many users are not aware of this. We would like to propose a plugin to handle storing these results, which should be usable by other plugin makers to store results as well. We did not, however, discuss this in very much detail.

transmart-analytics We would like to move the advanced analyses into plugins but this requires quite a bit of planning. For example, it would be nice if we would have one central location where all long running jobs can be displayed but this may not be possible when using frameworks (e.g. Shiny) that have both a server and a client side component. We did define some steps we could take to get to a more detailed planning.

Identify current components involved in creating analysis results (assumptions during ETL, Rmodules (Rscripts and jobs), transmartcoredb / projections, configuration JSON in DB and controllers)

Propose more streamlined way of creating an analysis UI environment, with better analysis transparency and reproducibility (see also UI below)

Technologies for expressing and encapsulating computations, such as Shiny (Rbased) or iPython notebook (Python based), need to be evaluated in the context of an Advanced Analyses plugin. That is, what should the architecture of tranSMART be, to enable developers to quickly stand up an advanced analysis interface in tools such as Shiny? → Importance of (RESTful) API as basis of application. We could test this by identifying how the RESTful API’s allow to build specific deepdive analysis interfaces in e.g. Shiny, iPython and Spotfire

Deferred Of course we did not manage to discuss every possible improvement to include in tranSMART 2.0 in just one week. That does not mean that these points are not important or will be tackled only after we did everything else that is described in this document. Also, we have discussed things of which the implementation will be deferred either because we the discussion did not lead to a viable conclusion, or we concluded that it would be to much work for little gain. We mainly focussed the discussion on the plugin architecture and the UI because the UI is in dire need of repair and the plugin architecture allows for easier division of work and hence higher productivity. This way we hope that we will get started on some of these deferred points as soon as possible, or maybe someone else will pick them up by writing a plugin!

ETL The wish was expressed to extend the current API, which is essentially a set of read operations on the data model, with operations to load data. Creating such an API would unify the ETL process with the query API and make it easier for third party developers to create data loading tools. We do not plan to grant this wish by providing either a RESTful API via http or an internal Java API. We are already building transmartbatch to simplify the ETL. Integration with the RESTful API is particularly ill fitted because the entities that the API would need to expose (concepts, observations, patients, …) should be validated simultaneously. Though it would be possible to reuse these resources for data upload in conjunction with some mechanism to group the new entities together and perform a final validation step, the

development cost would be prohibitive when weighed against the benefits. In fact, the current workflow is already quite simple, and together with transmartdata, downloading the dataset and pushing it into tranSMART can be done with one command. A simpler alternative would be creating a separate resource, /job, with a subresource for each job (here meaning “job type”). This subresource could then be POSTed into with a multipart request or tarball containing the input that the current command line tools need (transmartbatch could also be run as a daemon application). If transmartbatch is used as the backend for loading the data, all the job instance and step information could be provided in the responses and the job could be resumed, paused or cancelled at any point. This is a more feasible solution, but still provides few benefits when compared to an offtheshelf solution like the Springbatch admin interface (though this wouldn’t cover job instance creation).

Data Repository interfaces Data Repository interfaces with tranSMART, under the version 2 architecture, are handled creating an interface to the data model as an implementation of the APIs. As such access to data repositories (e.g. the eTRIKS Data Repository) become backend plugins but this is not the responsibility of the core development team.

The Sanofi Search & Browse data, business logic, and user interface were not covered in this workshop. While these features would be implemented as a set of plugins, we still need to do some design work to support the creation, inclusion, and activation of a bundle of plugins such as that implied by this feature.

Data updates The need for incremental updates of the data is recognized as a problem. There is a lot involved in facilitating this i.e. versioning and audit trails, also it is a crosscutting concern from ETL through the backend to the UI. This requires a design effort that is outside the scope of the workshop and the current push to v2.0. It is noted here “for the record.”

Also, calculations, such as the Zscore calculation, should probably be moved from the database to the appropriate analysis plugins. This would not be fixed in ETL.

Timeline Being that the design, implementation, testing, and deployment of demos of tranSMART 2.0 is a large undertaking, planning and execution will be one of the most challenging parts of this project. We have given an extensive overview of things discussed during the tranSMART architectural workshop and afterwards, and obviously a lot of work is involved. Given the goals we set for this workshop and document, a lot of emphasis was put on plugin architecture and the separation of serverside and clientside, because these will bring the biggest improvements and make it easier to tackle other issues later on. We would like to start work on these improvements as soon as possible. They can be started in parallel as there is no strong dependency between them. Note that this document is by no means exhaustive with respect to the changes that are needed. The process of fixing bugs and adding features is also driven by demand by customers and institutions which are financially backing the core developers.

Initial plans

UI The UI is fairly separate from the rest of the work so it should be quite easy to get people working on this. On the server side the RESTful API will need to be extended to support all necessary operations. This will be driven by demand from the UI implementation. Parts of the code of transmartapp can be reused but it may be quite a bit of work to do so. A lot of course depends on the number of people working on this and their level of cooperation. Ideally we would have a small team working on it in the same geographical location, along with some users of the system for testing, but this will be hard to achieve. Given the extent of tranSMART and the number of graphs, building a complete UI could easily take an entire year. However, coming up with an initial prototype should be doable in just a couple of months. This should at least contain the concept tree browser, cohort selection, summary statistics and perhaps the grid view. It should be possible for the user to experience the look and feel of the UI so we can gather additional user input. We also need to setup the plugin architecture and supply the necessary resources to get started writing plugins. Due to the fact that the UI is completely new we could experiment a bit without there being any major consequences. After the client plugin architecture is finished and we have a UI prototype we will need to write the necessary documentation for developers who would like to extend it with a plugin. When this is in place it will be a lot easier to distribute the implementation of the rest of the UI.

Backend plugin architecture Estimating the time and requirements needed for modularizing the back end is a lot more difficult. During the workshop we have had some rather abstract discussions about it, as reflected by this document as well; the specifics still need to be fleshed out. This can only be done on close examination of the system and the way the pieces interact. We may even find that some pieces are too entangled to separate in a reasonable period of time. Another problem is resources: this will not be something most users will be interested in, or will even understand, so procuring the necessary work force may prove to be difficult. Given the importance of this goal and the reliance of the other components on it, we do want to get started as soon as possible. During the process of building the GUI prototype we should flesh out the details for the plugin architecture.

Miscellaneous/Maintenance In several locations in this document smaller changes are being proposed, some of them more urgent than others. We are not going to incorporate them in a long term planning, it would not be worth the effort and it would be hard to prioritize these. Some changes are fixes of current issues and need to be applied when users request them, others are needed to facilitate the transition to tranSMART 2.0 and will be driven by the other developments.

Next steps As for the next steps, after completing the first prototype and fleshing out the details for the plugin architecture, the obvious ones are of course to continue working on the UI and implementing the backend plugin architecture. This will involve adapting the core but also the various modules that are being turned into a plugin. The good news here is that the majority of the proposed backend components are already developed and present in tranSMART 1.2, but several substantial changes still have to be made. When the plugin architecture is in place it would be a great time to continue to add features to tranSMART by writing new plugins. This would include plugins to extend of the API to include high dimensional data, search and filtering plugins and a set of standard plugins for advanced analysis. Further details and a concrete timeline for these next steps will be determined when we will get there.

transmart ‘glowing bear’ 2 · 2015-02-18 · custom plugins. it will contain the following...

Documents