[ieee 2014 ieee eighth international conference on research challenges in information science (rcis)...

An integration architecture framework for e-• • genomlcs services

David Roldan Martinez

Oscar Pastor Lopez

Mercedes Rossana Fernandez Alcala Universidad Politecnica de Valencia

Abstract-Next Generation Sequencing (NGS) technologies

are becoming a commodity and, thus, genomic services delivered

on line (e-genomics) is a growing market. This has led to the

development of a plethora of tools traditionally covering only a

small set of features of the process, though it's true than recently

more complete tools have arisen. In such a scenario, integration

between tools from different precedence becomes key to market

evolution and consolidation. This paper proposes a service

oriented framework factoring of the core services required to

support NGS applications by identifying these services. This

conforms a sound basis to provide a toolkit and a reference

implementation, whose details are introduced in the work.

Keywords- e-genomics, NGS, sequencing, ADN, integration,

services, interoperatibility

I. INTRODUCTION

Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results [3].

While whole-exome and, in the near future, whole genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of analysis workflow or providing a complete solution [1].

This paper is organized as follows. The next section shows the stated of the art, identifying the problem and its current solutions. Its intention is to visualize why a different solution strategy is required. in this genomic domain. In Section 3 the need of an open framework is described, explaining why we have chosen a service oriented paradigm and highlighting the expected benefits of this approach and how the framework is derived from current NGS platform and applications. Section 4 briefly overviews the current state of the framework, identifies services included and show how to apply the framework to a sample case consisting on a genetic test report generation. In

Section 5 we explore some bindings that serve as implementation notes and, fmally, in Section 6 we remark conclusions and future work.

II. STATE OF THE ART

Although we are going to focus our attention into NGS, it worth to consider that NGS services should be integrated into larger e-Health IT solutions whose objectives will be, among others, to provide a better patience care and to reduce costs while minimizing overhead and facilitating synergies between e-Health stakeholders [18].

In the last years a number of review articles have been recently published to facilitate the choice of the most suitable tool for a particular application [3]. However, most of them review only selected components of the NGS data analysis ([4]-[8]) or cover multiple steps of workflow [3] but not data analysis and interpretation.

To the best of our knowledge, a comprehensive review studying the problem from a holistic Information Systems Engineering perspective based on conceptual modeling concepts, [13,14] that should include a well-designed Service Oriented view to provide an efficient integration framework, has not been reported yet. Such a framework would be tremendously helpful for researchers planning a NGS project but also for biosoftware developers [15,17]. In addition, by focusing on services and processes, the framework addresses issues such as data handling and tool compatibility, which are neglected when only individual components are reviewed. We therefore initiated this study to design an integration framework to provide genomic services. The intention facilitate the integration of commercial, home-grown, and open source components and applications by agreeing common service defmitions, data models [16], and protocols.

Different tools have been developed depending on data format, because the data format is the generic trouble involved in the extraction procedure. The GFF format (Generic Feature Format Version) [20] specifies genomic/genetic features (genes, exons, CDS, etc.) and their properties (name, sequence, dbxref, etc.) using some predefined fields and rules and also

978-1-4799-2393-9/14/$31.00 ©2014 IEEE

ontology terms. It is widely used by the community to annotate variations with structural data. Alternatively, the GVF (Genome Variation Format) is a specialization of the GFF format to describe variations relative to a reference genome. It is widely used by the community to annotate variations with database data [21].

Nowadays Bioinformatics Environment uses some bioinfomatics tools like; "Functional annotation of genetic variants from high-throughput sequencing data" (ANNO V AR) [22], Genetic variant annotation and effect prediction toolbox (SNPEff) [23] and Variant Effect Predictor (VEP) [24]. Table 1 shows the resume of bioinformatics data management by bioinformatics tools:

TABLE!. SUMMARY OF SOME BIOINFORMATICS TOOLS

.. . . . Structural Genes, exons

Structural Ensembl

RS

HGVS

Population Allele frequency

Chr&Gene

Variation type

Pol morphism type

Position Range

Allele frequency

Coveraae

Transcri t

. .

-I -I -I -I

-I

Il�

-I -I -I -I X -I -I -I -I

-I -I -I X X -I -I -I -I -I X -I -I

The current state of integration of different genomic tools and platforms rely commonly on a data loading approximation: platforms are integrated though a data connection where a platform consumes the data that the other platform produces. The main drawback of this scheme is that a middleware between both platforms is needed to make this work. Additionally, that middleware is hardly reusable. Fig. 1 depicts this situation.

DDBB1

MIDDLEWARE Platform I-Platform 2

DOSS 2

Fig. 1. Integration of genomic platforms (traditional approach).

However, there have been efforts that try to reuse this kind of middleware. One of the most remarkable ones is BioDAS [19]. BioDAS (Bioinformatic Distributed Annotation Systems) defines a communication protocol used to exchange annotations on genomic or protein sequences that allows publishing this information in a distributed fashion building a DAS Server. On the other side, you can implement a DAS

Client what will be in charge of visualization. Fig. 2 depitcs the architecture of a DAS network. The advantages of this system are that control over the data is retained by data providers, data is freed from the constraints of specific organizations and the normal issues of release cycles, API updates and data duplication are avoided. From a technology point of view, DAS communication protocol between DAS client and DAS server is based on XML/Rest web services.

o o Annotation Server Reference Server Annotation Server Annotation Server

I I

I • g

Fig. 2. DAS network architecture.

Although DAS IS heavily used in the genome bioinformatics community, some aspects as metadata management or format conversion are out of the scope of the specification. Thus, there is still a need of an approach giving a global solution, as e-Genomic Framework does.

Additionally, if we look at most genomic working environments today, whether implemented or planned, they tend to look something like Fig. 3. In such a scenario we have three main labs tools and a portal that links the functions together to be able to deliver some kind of services to the enduser. Think, for example, in an e-commerce site selling genetic test over the Internet.

Fig. 3. Common genetics services delivery scenario

Usually, there would be some amount of communication between the components, which becomes more obvious when we open up the boxes to see what these components actually contain as shown in Fig. 4:

Fig. 4. Architecture of a genetics services delivery scenario, expanded to show the components they contain

The communication going on between the components is often difficult and complex because there is considerable overlap of functions and data within the components. As it can be seen, each system tries to manage authentication (making single sign-on more difficult). These overlapping functions mean a lot of data replication and process overhead.

However, if shared functions are moved out of the tools and packaged in common framework in such a way that they'll be available to any application needing them, we'll evolve to the situation depicted in Fig. 5 that depicts graphically the goal of this work and its intended contribution.

L ____ , _____ I ---- r -----I ---- r -----I I I I I I

_____ y ___________ t ____________ t _____ _ I I I I I I I I I I I I I I I I I I L ____________________________________ I Fig. 5. Architecture of a genetics services delivery scenario with common services moved out of application

Thus, with the use of a framework several things change:

• Shared components have to be very well defined so that can be used by the big part of current and future application.

• Data and process replication is avoided, as all the applications are using the same common shared resources.

• Individual tools are simpler, eaSIer to maintain and easier to code.

The e-Genomics Framework is a service-oriented factoring of the core services required to support NGS applications, portals and other user agents. Each service defmed by the Framework is envisaged as being provided as a networked service within an organization, typically using either Web Services or a REST-style HTTP protocol (as BioDAS [19]).

When we embark on this kind of analysis, we focus on what kinds of services are needed in the overall architecture to provide certain kinds of behavior from NGS tools. The framework supports the development by companies of their own architectures, using a flexible service-oriented approach.

The framework does not aim to build a generic solution; in fact one of the primary goals of the framework is to encourage "coherent diversity", by providing common toolkits and service definitions which can then be used to meet the diverse goals of the NGS and e-genomics market.

Although the framework provides support for organizations developing service-oriented architectures, it does not presume that all organizations will want to do so, or that those who do adopt this approach will want to do so across the whole of the organization.

The ultimate aim of the Framework is, for each identified service, to be able to reference an open specification or standard that can be used to implement the service. Further work will consider the ability to provide open-source implementation toolkits such as Java and C# code libraries to assist developers.

Subsequently, this paper goal is to introduce a technical framework designed to support e-genomics. This is not intended to be prescriptive, nor will it restrict the choices of systems that organizations may purchase (whether commercial, freeware or open source). Instead, what we hope to present in this work is a set of patterns that can be used to implement a variety of e-genomics tools.

III. THE CASE FOR A TECHNICAL FRAMEWORK TO SUPPORT E

GENOMTCS

If a framework is worthwhile it must lead to benefits for scientists, researchers and for companies [9]. At this point, it is key to distinguish between these three groups of stakeholders: scientists are users generating genomic data in raw; researchers are users consuming this information to generate processed genomic data; and, finally, companies, are the user whose use of genomic data is focused on a business model. These three groups are quite general and in some situations can be overlapped.

It is our belief that the framework will provide support in these situations and we describe very briefly the benefits it offers.

Although we set out a service framework for e-genomics, we make no assumptions about how many services are deployed in a particular instance. Our starting point has been the "very high level Use Case" of laboratory willing to complete a genetic test. We are aware about that this is a simplification, and part of the evolution of this framework depends upon teasing out the complexities of much finergrained processes and looking at NGS applications from a variety of viewpoints. We are also very aware that this is only one perspective, and there are other areas, such as logistics, HR and finance, which may also benefit from the approach taken. Although services defined for this framework may be usable for purposes other than e-genomics, this is the domain of interest for this work.

A. The inclusion of "Service Oriented Architecture into a Genomic Environment" EGF A service-oriented architecture is an approach to joining up

systems within enterprises [9]. It is a relatively new approach, but is rapidly gaining popularity because of the lower costs of integration coupled with flexibility and simplification of configuration. Service-oriented architecture builds upon the experience of using Web Services for integration.

In a service-oriented architecture, the application logic contained in the various systems across the organization is exposed as services, which can then be utilized (consumed) by other applications. This approach is somewhat different to two other common ways of integrating systems, which are to integrate at the user interface level using web portals, or at the data level by creating large combined datasets.

A service-oriented approach does not preclude also using portals or data warehouses, and is in fact agnostic about how the rest of the enterprise is configured, which is why it makes a good approach for a framework.

However, because integration occurs in this fashion, it becomes a simpler task to replace the systems that provide services within the architecture. Because service consumers are configured to access a service without any knowledge of the system that provides the service, we can replace the underlying system without affecting systems dependent on its capabilities.

In summary, service-oriented architectures have a number of features that make them attractive for NGS applications over the Internet [10][11]:

• They are agnostic with regard to platform choices and types of existing systems

• They are less expensive to implement

• Services can be used without knowledge of the internal workings of the system providing the service, allowing systems to be replaced without causing widespread disruption

• Services enable non-replaceable legacy systems to interact with new applications

• By providing access to functionality rather than user interfaces or data it enables organizations to develop applications that relate better to the tasks they want to

•

perform without duplicating the functionality of existing systems, but instead leveraging existing investment In software.

Service-based architectures changing operational organizational change.

can be reconfigured to meet requirements or reflect

B. EGF: the service oriented solution The framework is intended to support development of

flexible, service-oriented architectures in a number of ways:

• Providing a reference set of service definitions

• Providing toolkits to assist developers

• Coordinating related efforts such as standards and shared services

By providing a common set of service definitions, we enable communities to have a shared vocabulary for discussing their scientific, research or commercial activities: the framework enables different organizations in different sectors to communicate with one another. In the future, by providing toolkits - not complete solutions - we both enable organizations to build solutions, and also provide assistance for both the commercial sector and the open source community to provide solutions that operate within organizational architectures.

Applications developed using the framework for guidance can, because they have a common specification, be reused far more easily by other organizations, facilitating collaboration between organizations, and inter-organizational integration.

TABLE II. E-GENOMlC FRAMEWORK MAIN ADVANTAGES

A dvantages for R&D • Supporting •

diversity: it becomes

possible to

support a very diverse set of

working models

as it becomes feasible to

configure the

low-level elements of the •

architecture to fit a variety of working and institutional

business

models. • Enabling users

driven implementations

by exposing modular •

processes as separate services, which can be configured in mUltiple ways,

Commercial advantages Providing better returns on technology investment: applications can be developed or

acquired as needed, which means that only those parts of the system

that really need to be changed are replaced retaining the rest of the systems so reducing both purchasing and implementation costs, particularly in terms of staff development and training.

Enabling faster deployment of technology: as components are independent it will often be easier to deploy new components so long as the needs of the new

components are compatible with

the existing interfaces. Even where this is not the case it may still be simpler to alter or replace

other components to supply the requirements of new systems.

Providing a modular and flexible technology base: the rationale for

the framework is specifically to enable the development of modular and flexible systems,

where the individual components

can be added or replaced more

A dvantages for R&D Commercial advantages the construction easily than in traditional models,

of technology • Making collaboration between solutions can companies easier: through a become driven common framework and thus a by user common service oriented imperatives. architecture it becomes easier to

define the application interfaces

which are needed and thus to share information between companies. It

may also make sharing of applications easier, as it will be simpler to define small

applications which are needed in

common and can be developed to meet the needs of each company.

IV. SERVICES INCLUDED IN EGF FRAMEWORK

Fig. 6 shows the set of services defined within the framework. We do not consider this to be a defmitive list of the services that can be provided, and we hope that additional services are identified, or identified services refmed, in the light of future requirements analysis.

The upper sets of boxes identify services specifically within the domain of e-genomics; the lower set identifies services that may be common across multiple domains. The services are clustered into logical groups to aid readability; however there are no dependencies or explicit associations between service definitions. In practice, if several services with similar capabilities are exposed in an environment, the service interfaces may be realized using a shared implementation.

NGS BIOSOFTWARE TOOLS

Tool 1 II Tool 2 II II Too'N I GenomlcAppllcationServlces

Catalogue II Annota�on I F====;' i==;;:;;:;:;;:::==i '--"="--' '-'===:!..l '--_---'II '",",,'oe I

I'lIIering I I VllIdaUon

Common Servlc:u

I Authentication II Ruol�r II Mapping II Scheduling II Penon II m�::;�:nt I I Archll/lng Search II FedeflltedSearc:hl l Wor1ltlow II Service Registry II

:====: I Autnolizallon II Harvesting II Logging II Croup

Context II I(len�r II c:n,:,r:�n I

Fig. 6. EGF framework services

c 0 M M

R C

u.J A '" � � p 0 R 0 � 0

D U C T 5

As it can be seen in Figure 6 we have identified four layers of the framework:

• NGS Biosoftware Tools: interact with users directly, such as portals or specific tools. NGS Biosoftware Tools based on this framework can be either very small and focused or span many processes to provide a coherent workflow.

• Genomic Application Services: provide functionality required by user agents, such as alignment or variation annotation, or storing content in a repository. Application Services may

be implemented so that they have some sort of user interface, but the key requirement for an application service is that it exposes its functionality for reuse by any number of user agents or other application services, and that it implements a standard interface to support this reuse

• Common Services: provide lower-level functionality which is not genomic-specific, such as authentication and authorization services, but upon which application services and user agents depend.

• Infrastructure is the underlying network, storage, and processing capability provided for an implementation. This is assumed by the framework, but not defmed.

A. Service specifications In order to be able to implement a service, it is needed a

blueprint or document specifying services operations, input data and formats and expected output data and formats. How to convert input data in to output data, that is, service implementation is hidden by service specification. In our case, we will follow recommendations pointed at [9], that states that a service specification should clearly include:

• A narrative description of the component and its role in the framework

• A set of Interface defmitions, or references to relevant specification

• A set of Data Type definitions, or references to relevant specifications

• A binding to the implementation technology

As an example, we will defme here Service Registry using the following template:

TABLE Ill. SERVICE DESCRIPTION TEMPLATE

Objective This service provides information about the services that are available on the network, including details of what content is available, what functionality is

supported by the service and what protocols and standards are used to access the service (i.e. details of the API).

Key functions provide machine-readable descriptions of services allow searching for services by 'keyword', type, protocol, etc.

provide human-readable view of available services

Standards and • Universal Description, Discovery, and specifications with Integration (UDOI) applicability to this • REST area • Simple Object Access Protocol (SOAP)

• Web Services Description Language (WSDL)

Related services

B. EGF applied To demonstrate the applicability of the proposed

framework we'll take a look at genetic testing process. When a

DNA sequence arrives to a laboratory together with a diagnostic report, scientists read the report (tl) and select the gene to be analyzed (t2). Depending on it, a reference sequence is obtained from remote or local sources (t3) and the patience DNA sequence (t4) is aligned (t5) to find differences (t6) and, thus, being able to generate a more reliable diagnostic report (t7). Fig. 7 illustrates this process:

Fig. 7. Genetic test report generation process

The whole process can be delivered by framework services as depicted in Fig. 8. The scientist will use Tool X to handle de diagnostic report. Once scientist has been logged on the system (by the Authentication Service), he will be presented to the contents scientist has access privileges to (Authorization Service and Content Management) and will start the process (Workflow Service). During this process, the sequence will be read (Read Sequence Service), compared with a reference (Reference Genome Search Service) and analyzed (Alignment Service, Validation Service and Filtering Service) to generate the final diagnostic report (Visualization Service).

AulllfoflUUlllon

1 � .. � 1 I C�""'-=' � PI �

I"" '-�I

��II A'�' 1 Ir=�'·"''''=l

Fig. 8. EGF applied to genetic test report generation

1

V. CONCLUSIONS AND FURTHER WORK

A correct management of Genome Information Systems (GIS) requires to apply in the Bioinformatics domain the wellknown methods and techniques provided by Information Systems Engineering (lSE). Our experience in the recent years in this context has made us to conclude that much work is to be done to take full advantage of the of advanced ISE know-how. To provide concrete results in this direction, we introduce in this paper EGF, an architecture framework for e-genomics services, intended to provide an unified, holistic perspective of the different components that an effective and efficient GIS needs.

It is important to highlight that the work presented here it is intended as a starting point for a long, ambitious trip. The service definitions will need refinement and expansion and many of the details have to be worked out. Additional services will almost certainly be identified and some of the existing ones may be merged or dropped. It is also likely that the standards and specifications listed will not be completely correct, with new ones emerging from time to time.

We plan to test the framework in a particular scenario oriented to the use of Genome information in a Personalized Medicine context. A long-term project with the Genetic unit of a local Hospital specialized in the treatment of Breast Cancer is just on its way.

In fact, once services had been defined, next step will be to implement them. To do that, a Service requires a Service Specification - a set of documents that provides a 'blueprint' for building the service. To obtain the results explained in this paper is important construct a solid know-how at the same time that complete the next work flow:

• A narrative description of the component and its role in the framework

• A set of Interface defmitions, or references to relevant specification

• A set of Data Type definitions, or references to relevant specifications

• A binding to the implementation technology

REFERENCES

[I] Gonzaga-Jauregui C, Lupski JR, Gibbs RA. "Human genome sequencing in health and disease". Annu Rev Med 2012;63:35-6\.

[2] Schadt EE, Linderman MD, Sorenson J, et al. "Computational solutions to large-scale data management and analysis". Nat Rev Genet 2010;11:647-57.

[3] Pabinger at al. "A survey of tools for variant analysis of next-generation genome sequencing data". Briefings on Bioinformatics, 2013:1

[4] Bao S, Jiang R, Kwan W, et al. "Evaluation of next generation sequencing software in mapping and assembly". J Hum Genet 2011;56:406-14.

[5] Nielsen R, Paul JS, Albrechtsen A, et al. "Genotype and SNP calling from next-generation sequencing data". Nat Rev Genet 2011; 12:443-51.

[6] Li H, Homer N. "A survey of sequence alignment algorithms for nextgeneration sequencing". Brief Bioinformatics 2010;11 :473-83.

[7] Koboldt DC, Larson DE, Chen K, et al. "Massively parallel sequencing approaches for characterization of structural variation". MethodsMol BioI 2012;838:369-84.

[8] Datta S, Kim S, et al. "Statistical analyses of next generation sequence data: a partial overview". J Proteomics Bioinform 2010;3:183-90.

[9] JSIC, "Technical Framework to support e-Ieaming". Available at http://www.jisc.ac.uk/uploaded _ documents/Technical%20Framework% 20feb04.doc (last access 10/20/2013).

[10] Stevens, M. 'The Benefits of a Service-Oriented Architecture". Developer.com, http://www. developer.com/services/article.php/1041191 (last accessed 10/20/2013).

[Il] Stevens, M. "Service-Oriented Architecture Introduction (2 parts)". Developer.com, http://www.developer.com/services/article.php/1010451 (last accessed 10/20/2013).

[12] "The Java EE Tutorial". Available at http://docs.oracle.com/javaeel7/tutorialldoc/javaeetutoriaI7 . pdf (last accessed 1112412013).

[13] Pastor M.A., Burriel V., Pastor O. "Conceptual Modeling of Human Genome Mutations: A Dichotomy Between What we Have and What we Should Have". BIOSTEC Bioinformatics 2010, pp: 160-166. ISBN: 978-989-674-019-1

[14] Pastor, O. 2008. "Conceptual Modeling Meets the Human Genome". In Conceptual Modeling-ER 2008, Li Q.,Spaccapietra S. , Yu E. and Olive A. (eds.). LNCS, vol 5231, pp I-II Springer, Berlin Heidelberg.

[15] Pastor 0, Levin AM, Celma M, Casamayor JC, Eraso LE, Villanueva MJ and Perez-Alonso M. : "Enforcing Conceptual Modeling to Improve the Understanding of Human Genome". Procs of the IVth In!.

Conference on Research Challenges in Information Science, RCIS 2010, Nice, France, IEEE Press, Print Version ISBN #978-1-4244-4840-1

(2010).

[16] Pastor 0., Levin AM., Casamayor J.e., Celma M., Virrueta A, Eraso L.E., Perez-Alonso M.: "Model driven-based engineering applied to the interpretation of the Human Genome". In: The Evolution of Conceptual Modeling, R. Kaschek, L. Delcambre. Springer-Verlag, editor: H. Mayo

(2010).

[17] Martinez AM, Martin A, Villanueva MJ, Valverde F, Levin AM and Pastor 0.: "Facing the Challenges of Genome Information Systems: a Variation Analysis Prototype". Caise Forum 2010.

[18] AbuHhouse E., Mohamed, N., AI-Jaroodi, J. (2012). "E-Health Cloud: opportunities and Challenges". Future Internet 2012, 4, 621-645. 1SSN 1999-5903.

[19] BioDAS web site. http://www.biodas.org/wiki/Main_Page (last accessed 0112112014).

[20] Sequence Ontology web site. http://www.sequenceontology.org/gff3.shtml (last accessed 04/17/2014).

[21] Reese MG, Moore B, Batchelor C, Salas F, Yandell M, Eilbeck K. "A standard variation file format for human genome sequences" (2010). Genome Biology, available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945790/ (last accessed 04/17/2014).

[22] Open Bioinformatics web site. http://www.openbioinformatics.org/annovar/ (last accessed 04/17/2014)

[23] Snpeff project web site. http://snpeff.sourceforge.netl(last accessed 04/17/2014)

[24] Ensembl web site. http://www.ensembl.org/info/docs/tools/vep/index.html(last accessed 04/17/2014).

[ieee 2014 ieee eighth international conference on research challenges in information science (rcis)...

Documents