applying linked data approaches to pharmacology: architectural

10
Semantic Web 0 (0) 1 1 IOS Press Applying Linked Data Approaches to Pharmacology: Architectural Decisions and Implementation Alasdair J. G. Gray a , Paul Groth b , Antonis Loizou b , Sune Askjaer c , Christian Brenninkmeijer a , Kees Burger d , Christine Chichester d , Chris T. Evelo e , Carole Goble a , Lee Harland f , Steve Pettifer a , Mark Thompson d , Andra Waagmeester e , Antony J. Williams g a University of Manchester b VU University Amsterdam c H. Lundbeck A/S d Netherlands Bioinformatics Center e Maastricht University f Connected Discovery g Royal Society of Chemistry Abstract. The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. In this application report, we describe a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies. We describe the architec- ture of the platform focusing on seven design decisions that drove its development with the aim of informing others developing similar software in this or other domains. The utility of the platform is demonstrated by the variety of drug discovery applications being built to access the integrated data. Keywords: pharmacology, linked data, data integration 1. Introduction The exploration and development of new drugs requires scientists to draw knowledge from multi- ple sources of information. These range from online databases of proteins (e.g. UniProt [24] and Enzyme [3]) and chemicals (e.g. ChEMBL [17], ChemSpi- der 1 [34], and DrugBank [27]), to models of biologi- cal pathways (e.g. Reactome [30], WikiPathways [26], and KEGG [25]) as well as the scientific literature. These information sources are often held in different formats and sourced from a wide variety of organiza- 1 http://www.chemspider.com accessed March 2012. tions. Together they cover a wide area of the scientific space of interest, but at the same time overlap scope, and thus frequently record different (or even inconsis- tent) representations of the same data. In recent years several key datasets for drug discovery have been pub- lished on the Semantic Web including those provided by Chem2Bio2RDF [10] and Linking Open Drug Data (LODD). In the latter case, the data is linked following linked data principles [7,22]. The integration of knowledge from these disparate sources presents a significant problem to scientists, with the intellectual and scientific challenges often being overshadowed by the need to repeatedly per- form error-prone and tedious mechanical tasks. The 1570-0844/0-1900/$27.50 c 0 – IOS Press and the authors. All rights reserved

Upload: trinhxuyen

Post on 03-Jan-2017

219 views

Category:

Documents


0 download

TRANSCRIPT

Semantic Web 0 (0) 1 1IOS Press

Applying Linked Data Approaches toPharmacology: Architectural Decisions andImplementationAlasdair J. G. Gray a, Paul Groth b, Antonis Loizou b, Sune Askjaer c, Christian Brenninkmeijer a,Kees Burger d, Christine Chichester d, Chris T. Evelo e, Carole Goble a, Lee Harland f , Steve Pettifer a,Mark Thompson d, Andra Waagmeester e, Antony J. Williams g

a University of Manchesterb VU University Amsterdamc H. Lundbeck A/Sd Netherlands Bioinformatics Centere Maastricht Universityf Connected Discoveryg Royal Society of Chemistry

Abstract. The discovery of new medicines requires pharmacologists to interact with a number of information sources rangingfrom tabular data to scientific papers, and other specialized formats. In this application report, we describe a linked data platformfor integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionalityoffered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of theOpen PHACTS project, a collaboration of research institutions and major pharmaceutical companies. We describe the architec-ture of the platform focusing on seven design decisions that drove its development with the aim of informing others developingsimilar software in this or other domains. The utility of the platform is demonstrated by the variety of drug discovery applicationsbeing built to access the integrated data.

Keywords: pharmacology, linked data, data integration

1. Introduction

The exploration and development of new drugsrequires scientists to draw knowledge from multi-ple sources of information. These range from onlinedatabases of proteins (e.g. UniProt [24] and Enzyme[3]) and chemicals (e.g. ChEMBL [17], ChemSpi-der1 [34], and DrugBank [27]), to models of biologi-cal pathways (e.g. Reactome [30], WikiPathways [26],and KEGG [25]) as well as the scientific literature.

These information sources are often held in differentformats and sourced from a wide variety of organiza-

1http://www.chemspider.com accessed March 2012.

tions. Together they cover a wide area of the scientificspace of interest, but at the same time overlap scope,and thus frequently record different (or even inconsis-tent) representations of the same data. In recent yearsseveral key datasets for drug discovery have been pub-lished on the Semantic Web including those providedby Chem2Bio2RDF [10] and Linking Open Drug Data(LODD). In the latter case, the data is linked followinglinked data principles [7,22].

The integration of knowledge from these disparatesources presents a significant problem to scientists,with the intellectual and scientific challenges oftenbeing overshadowed by the need to repeatedly per-form error-prone and tedious mechanical tasks. The

1570-0844/0-1900/$27.50 c© 0 – IOS Press and the authors. All rights reserved

2 Gray et al. / Applying Linked Data Approached to Pharmacology

first challenge is to identify the entities of interest inthe many data sources, and to relate and map these toone another. This allows complementary informationfrom the data sources to be collated in a single record.For example, ChemSpider contains data about chemi-cal compounds and where they can be sourced, whileChEMBL complements this with data about the bioac-tivity of drug-like molecules and DrugBank providesinformation on the clinical use of drugs which containthe molecules. These data sources can be linked basedon the chemical structure of the compounds.

However, differences in scientific or technical ap-proaches to molecular structure representation meanthat different data sources will not always be in agree-ment. For example, searches for the chemical “simvas-tatin” on ChemSpider2 and DrugBank3 return differ-ent compounds: although their basic chemical struc-ture matches, the compounds differ in their stereo-chemistry.

A further challenge is the lack of semantics asso-ciated with links in traditional database entries. Forexample, the entry in UniProt for the protein “kinaseC alpha type homo sapien”4 contains a link to theEnzyme database record 5, which has complementarydata about the same protein and thus the identifierscan be considered as being equivalent. The UniProtentry also contains a link to the DrugBank compound“Phosphatidylserine”6. Clearly, these concepts are notidentical as one is a protein and the other a chemicalcompound. The link in this case is representative ofsome interaction between the compound and the pro-tein, but this is left to a human to interpret.

Thus, for successful data integration one must de-vise strategies that address such inconsistencies withinthe existing data.

In this application paper, we present the architec-tural decisions made in implementing the Open Phar-macology Space (OPS) platform within the OpenPHACTS project7. The key goal of the project is tosupport a variety of common tasks in drug discovery

2http://www.chemspider.com/Chemical-Structure.49179.html accessed April 2012.

3http://www.drugbank.ca/drugs/DB00641 accessedApril 2012.

4http://www.uniprot.org/uniprot/P17252 ac-cessed March 2012.

5http://enzyme.expasy.org/EC/2.7.11.13 ac-cessed March 2012.

6http://www.drugbank.ca/drugs/DB00144. accessedMarch 2012.

7http://www.openphacts.org accessed March 2012

through a technology platform that will integrate phar-macological and other biomedical research data us-ing open standards such as RDF. The OPS platformis being developed by a public-private partnership toaddress the problem of public domain data integra-tion for both academia and major pharmaceutical com-panies [40]. A key driver of the project is that inte-grated data should address concrete pharmacologicalresearch questions and integrate with the workflowsand applications already used within the drug discov-ery pipeline.

This paper focuses on two key contributions:

– A set of seven architectural decisions for a dataintegration platform driven by pharmacologicalbusiness questions. These decisions should in-form others developing similar or related soft-ware.

– A discussion of the use of this platform by threeseparate drug discovery applications.

The rest of this paper is organized as follows. Re-lated work is discussed in Section 2. A summary of theconcrete pharmacological research questions that thefunctionality provided by the OPS linked data platformseeks to address are presented in Section 3. Section 4discusses the architectural decisions made to providean integration framework that is capable of integratingpublic data sources in order to answer the top priorityresearch questions. Section 5 gives details of the im-plementation of a linked data platform for pharmacol-ogy that enables enriched functionality on three sepa-rate, existing, drug discovery applications. Finally con-clusions are drawn in Section 6.

2. Related Work

The OPS platform deliberately builds upon a largeamount of prior work delivered by the Semantic Webcommunity in two key areas: infrastructure and datasources.

In terms of infrastructure, there have been severalsemantic data integration platforms proposed and de-veloped [4,9,16]. These systems present a single in-tegrated view of the data to the users and build uponthe large body of work on data integration within thedatabase community [20,28]. Central to these data in-tegration systems are the mappings between the datasources (which in many cases are relational databases)and the intended semantics of the integrated result(represented as ontologies in semantic integration sys-

Gray et al. / Applying Linked Data Approached to Pharmacology 3

tems). Such mappings need to be supplied by domainexperts with a good understanding of the data sourcesand the (ontological) view of the integrated data to becreated. Dataspaces [19] is an approach to lower thelevel of expertise required to generate and maintainmappings. The dataspace system starts with a minimalset of mappings which is incrementally enriched basedon user feedback. [5].

Another important area of infrastructure work arescientific workflow systems such as Taverna [31],Galaxy [18], and Kepler [29]. These systems are de-signed to enable users to compose a series of op-erations into a workflow that will extract data fromsources, and execute data manipulation and computa-tional services to analyse it. An important aspect ofthese systems is capturing the provenance of each op-eration performed by the workflow. A key idea behindworkflows is enabling their reuse so that results canbe reproduced. Community portals such as MyExperi-ment [14] support the sharing and reuse of workflows.The OPS infrastructure builds on the Large KnowledgeCollider (LarKC) system that provides a pluggableworkflow environment for developing scalable Seman-tic Web applications that integrate large datasets [12].

The Open PHACTS project is, of course, not thefirst to the realize that the use of Semantic Web stan-dards, along with common ontologies, can ease the in-tegration burden in the pharmacology domain [33,36].The Bio2RDF [6], Neurocommons [35], Linking OpenDrug Data (LODD) [37], Linked Life Data [32] andChem2Bio2RDF [10] projects have all made signifi-cant sets of biology and chemistry data available inRDF. We built upon the comprehensive work of thecommunity in creating RDF-based data sources topower the OPS platform.

3. Motivation: Pharmacology Research Questions

A key driver of the OPS system is that the integrateddata should address concrete ‘real world’ pharmacol-ogy research questions and integrate with the work-flows and applications already used within the drugdiscovery pipeline. A collection of 83 questions hasbeen created and prioritised by the scientific and phar-maceutical partners in the Open PHACTS project. Fulldetails of the research questions, the generation andprioritisation process, and analysis of the results canbe found in [2,23]. We provide a brief summary so asto motivate the functionality provided by the OPS sys-

tem, and the data sets that have been integrated to pro-vide an information space.

The business questions span a range of areas, how-ever, the focus in developing the OPS platform is onthose concerning pharmacology data, specifically theinteractions between compounds and molecular tar-gets. Many of the questions can be answered nowby industry researchers or by academic researcherswith their internal systems and via bio- and chem-informaticians. However, a shared, fully interoperableand routinely updated, scientist–friendly platform forperforming these analyses in a non–manual way doesnot currently exist.

In the simplest case, the questions would be of theform:

“Retrieve all information available about aspirin.”

This requires discovering all of the entries for aspirinin the available data sources and combining the data.Such a task is complex in itself due to the different fo-cus of the data sources and the fact that each uses itsown set of identifiers, not to mention the inconsisten-cies in the overlapping data that must be resolved ordisplayed in an appropriate form to the user.

The more complex research questions involve filter-ing compounds based on their properties and experi-mental results, e.g.

“Give all oxidoreductase inhibitors with IC50 ac-tivity value < 100nM in both human and mouse.”

To answer such questions requires the filtering andcomparison of values from different data sources.

Another group of the questions are driven by genet-ics, pathways, or diseases. An example from this groupwould be:

“For a given disease related molecular pathway,give me all targets and known active compoundsthat modulate them.”

These require a larger set of data sources to be asso-ciated to compounds, targets, and the interactions be-tween them.

4. Architectural Decisions

The OPS platform provides a semantic platform forpharmacology that integrates data from a set of dis-tributed open data sources.

Figure 1 provides an overview of the OPS Platformarchitecture as it is currently implemented. It consistsof seven components:

4 Gray et al. / Applying Linked Data Approached to Pharmacology

User Interface & Applications

Core API

Linked Data Cache

Domain Specific Services

Data Sources

Identity Resolution

Service

Identity Mapping Service

Fig. 1. OPS Platform Architecture Overview

1. Data Sources2. Linked Data Cache (LDC)3. Identifier Resolution Service (IRS)4. Identifier Mapping Service (IMS)5. Domain specific services6. Core API7. User Interfaces/Applications

These components arise out of corresponding archi-tectural decisions made variously for pragmatic, tech-nical and social reasons. We now discuss each of theseseven decisions. Components resulting from these de-cisions are in bold and the section is concluded bywalking through an example query discussing the roleof each component.

4.1. Rely on existing RDF-based datasets

The OPS platform builds upon the comprehensivework already performed by the community in creat-ing RDF-based datasets, which are relevant for theOpen PHACTS project. The current platform usesthe ChEMBL and ChEBI datasets provided by theChem2Bio2RDF project [10], the conversion of Drug-Bank provided by the LODD project [37], and theconversion of the Enzyme database sourced fromUniProt [24]. These Data Sources are the foundationof the OPS platform.

The decision to rely on existing data sources wasboth pragmatic and social. Pragmatically, relying onexisting data allowed the project to quickly developa working system. Socially, it encourages both origi-nating data providers (e.g. UniProt) and third partiesto continue to provide RDF-data for pharmacologysources. Such RDF conversions are crucial to the suc-cess of Open PHACTS. This decision however has acost, as data sources will use different schemes insteadof one common global schema and OPS is reliant on

data providers continually updating RDF versions ofimportant datasets such as ChEMBL.

4.2. Centralize the data

A classic trade-off in the design of data integrationsystems is whether to federate queries or warehousedata. In this case, we chose to warehouse the data forreasons of reliability and performance. In terms of per-formance, interactive query speeds are necessary, thus,there may be significant latency overheads when re-lying on remote services (e.g. SPARQL endpoints).Furthermore, third party services (i.e. external to theconsortium) cannot and should not be relied upon toprovide consistent access. Essentially, it would be im-polite for us to make use of others resources for an-swering SPARQL queries especially when they may belarge or frequent. To address these concerns, the afore-mentioned datasets are centralized into a Linked DataCache (LDC). The term cache is used to emphasizethat the platform does not maintain data on its own andis just a temporary store to enable data to be queriedquickly at interactive speeds.

An issue with caching the data is ensuring that thedatasets reflect the latest versions of the underlyingdatasets. However, it is common for scientific datasetsto follow release cycles and these can be used to reloadthe data in the cache. See for example UniProt [13].

4.3. Separate keyword search and structured queries

A key entry point into the platform is through key-word search. In the pharmacology domain, this is morethan just text matching as keywords can often matchto multiple often very distinct concepts. For example,when typing “menthol” the user could quite reason-ably mean the chemical menthol, or the menthol recep-tor protein. This sort of conceptual keyword search isdifferent from queries over structured databases, thus,a new architectural component was introduced, theIdentifier Resolution Service (IRS). The role of theIRS is to translate user-entered entity names (in freetext form) into known entities within the system (i.e.that have a defined URI). These known entities canthen be used in structured queries. The IRS also sup-ports the disambiguation of concepts through interac-tion with the user interface.

4.4. Equality is context dependent

A novel approach to instance level links betweenthe datasets is used in the OPS platform. Scientists

Gray et al. / Applying Linked Data Approached to Pharmacology 5

care about the types of links between entities: differ-ent scientists will accept concepts being linked in dif-ferent ways and for different tasks they are willingto accept different forms of relationships. For exam-ple, when trying to find the targets that a particularcompound interacts with, some data sources may havecreated mappings to gene rather than protein identi-fiers: in such instances it may be acceptable to usersto treat gene and protein IDs as being in some senseequivalent. However, in other situations this may notbe acceptable and the OPS platform needs to allow forthis dynamic equivalence within a scientific context.As a consequence, rather than hard coding the linksinto the datasets, the OPS platform defers the instancelevel links to be resolved during query execution by theIdentifier Mapping Service (IMS). Thus, by chang-ing the set of dataset links used to execute the query,different interpretations over the data can be provided.

4.5. Leverage domain-specific services

There are a variety of important pharmacologicaloperations that are specific to a domain and have re-liable and performant implementations. Thus, insteadof creating new implementations, the OPS platform re-lies on these existing domain-specific services. Forexample, of critical importance to the Open PHACTSproject is that identical small molecule compoundsfrom across different data sources are integrated accu-rately, based on structure rather than identifier map-ping which is less accurate. Instead of reimplement-ing this feature, we rely on an existing chemistry reg-istration and normalisation service: ChemSpider [34].This takes each compound from all of the data sourcesand maps it using structure-based methods to a uniqueChemSpider identifier. The mappings between eachchemical in each data source and ChemSpider are thenloaded into the system to provide an accurate set ofchemical mappings between different databases.

Going forward, we aim to make use of the SADIFramework for accessing additional domain specificservices [11].

4.6. Provide a simple API

The initial prototype of the OPS platform only pro-vided a SPARQL endpoint through which the inte-grated data could be queried.This required each ofthe drug discovery applications to have an intimateknowledge of the data exposed and the ability to writethe required SPARQL queries to retrieve the data de-

sired. This approach also left the OPS platform ex-posed to poorly formed queries or simply queries thatare too open ended (e.g. a select * where {?s?p ?o}). This both impacted developer productivityand hurt the reliability of the service.

To address this problem, an additional component,the Core API was introduced into the architecture.The Core API provides a set of common methods thatservices can call. Calls to these methods are trans-lated into parameterized SPARQL queries, which areused to join the data across the datasets at a schemalevel, i.e. the specific columns to retrieve from each ofthe datasets is determined by the SPARQL query is-sued. This benefits application developers as they nolonger need to formulate their own SPARQL queries.Additionally, it means that the that the parameterizedSPARQL queries used by the OPS platform can be op-timized for performance. This is particularly importantdue to the potential for large volumes of data and thecomplexity of performing query time mapping of iden-tifiers. Finally, the Core API enables schema mappingsto be encapsulated at the query level enabling the plat-form to be adapted to new data sources.

4.7. Leverage the user interface

A key driver of the OPS platform is to enable user-focused drug discovery applications. Thus, the entiretyof the OPS platform architecture is designed aroundenabling these sorts of applications. This also meansthat we can take advantage of the user interface, forexample, by encapsulating queries in an API, thus en-suring that only well-defined queries are run, or usingthe user interface to ensure that correct URIs are givento the system when performing a query. We also be-lieve that this decision will allow for scalability in thefuture. However, there is a trade-off as this means thatthe OPS platform in its current form is not designedfor other tasks such as large offline analytics.

4.8. The architecture as a whole

To further clarify the platform architecture, considerthe simplest of the three example pharmacology re-search questions provided in Section 3:

“Retrieve all information available about aspirin.”

This can be achieved through the use of two OPScore API methods: compoundLookup(text) andcompoundPharmacology(URI).

6 Gray et al. / Applying Linked Data Approached to Pharmacology

Fig. 2. Sample results of a compoundPharmacology(URI)API call.

The Identifier Resolution Service is invoked by thecore API through compoundLookup(“aspirin”)and provides a number of alternatives, including “As-pirin”, “FIORINAL” (which is a compound that con-tains aspirin), and “Hypyrin” (which is an aluminiumcomplex ion of aspirin). The returned items are eachannotated with additional metadata (e.g. descriptionsand synonyms) to aid the users in their decision.

Once the user selects a particular concept, the userinterface will call thecompoundPharmacology(URI) method. An ex-ample result of this API call is shown in Figure 2.The compound URI is inserted into the query which isthen expanded with equivalent URIs provided by theIMS. The final query is evaluated against the LinkedData Cache. The figure provides example propertiesretrieved from each dataset.

5. Open Pharmacology Space Implementation

This section provides details on the implementationof the OPS prototype platform as shaped by the archi-tectural decisions given in Section 4.

5.1. The Large Knowledge Collider

The platform is implemented as a connection of anumber of separate software systems. These softwaresystems are coordinated using the Large KnowledgeCollider (LarKC), exploiting its pluggable architectureto call external services [12].

Workflows in LarKC are defined in terms of theirinput, output and the connections between the plug-ins that participate in them. In addition, LarKC end-

Fig. 3. LarKC plugins and workflow activation for acompoundPharmacology(URI) API call.

points are a special type of plugin which expose an in-terface for receiving input and providing output, acces-sible through HTTP. A valid workflow must containa single endpoint and a single plugin connected to itsoutput. When the execution of all participating pluginsis completed, the endpoint collects any output from theworkflow and responds to the original HTTP request.

Figure 3 presents the LarKC plugins that partici-pate in the main OPS workflow, illustrated as jigsawpuzzle pieces. The figure gives a trace of the informa-tion flow between the various plugins as a result of acompoundPharmacology(URI) call to the OPSAPI.

5.2. Linked Data Cache

Additionally, LarKC acts as the Linked Data Cache(LDC) component with the OPS platform. Duringthe OPS platform implementation, an alternative datalayer was developed for LarKC, which implements theOpenRDF Sesame SAIL interface [8]. SAIL standsfor “Storage and Inference Layer” and is a connec-tion oriented interface that is implemented by a num-ber of RDF stores to enable their interoperability withthe Sesame platform. This enables the deployment of

Gray et al. / Applying Linked Data Approached to Pharmacology 7

Data property Value

Distinct resources 28 670 859Distinct predicates 6 974Named graphs 4Total quads 70 938 799Size on disk 12 GB

Table 1Characteristic properties of RDF data currently present in the LDC.

LarKC with a wide range of alternatives to the defaultBigOWLIM (which is now deprecated and replaced byOntotext with OWLIM-SE). More importantly it al-lows OPS users and developers to set up local OPS in-stances with an RDF store that is specifically selectedto suit their particular data needs and hardware specifi-cations. For example an offline demonstration instanceto be installed on a mid-range laptop using a smallsample dataset can avoid the installation and configu-ration overhead of a scalable RDF store by using anin-memory alternative.

We have successfully tested the integration of LarKCwith four additional RDF stores: the default Sesamein-memory store [8], bigdata [39], OpenLink Virtu-oso [15], and Garlik’s 4store [21]. Currently we use4store for the deployment of the platform as it hasproven capable at providing quick responses (in the300ms range) for complex queries as it is specificallyoptimized for the type of FILTER clauses used by theIMS to expand SPARQL queries. However, 4store wasdesigned to deal specifically with large (billions oftriples) datasets characterised by only a few hundredpredicates. In contrast the datasets in the OPS platformcontain over 6,000 distinct predicates, see Table 1 foran overview of the data currently loaded in the LDC.The SAIL interface has been contributed to the LarKCopen source codebase.

In addition to the main OPS workflow, the cur-rent prototype makes use of an auxiliary workflow forwhich the IMS, IRS and external services are not used.The purpose of the workflow is to identify the URIsfor which data is available in OPS from a large list ofcandidates. As RDF stores will provide the worst-caseperformance when queried for URIs that do not appearin their indices (as they do not exist), such queries ben-efit from circumventing the IMS.

5.3. Identify Resolution Service

As discussed in Section 4 the mapping between userprovided text and URIs is a non trivial task that has to

take into account the type of entity the user is inter-ested in. This functionality is thus decoupled from theLDC and is provided by the Identity Resolution Ser-vice (IRS) plugin.

This plugin is able to make calls to an API pro-vided by ConceptWiki8 to retrieve a list of resourcesof the required type with canonical labels curated bythe community. The response is then dynamically im-ported in the underlying RDF store. The IRS plugin re-sponds to explicitly defined predicates in the SPARQLquery9 which are also used to relate the resulting RDFto existing data in the LDC. Updates take place imme-diately, making the result of the external call directlyavailable to the SPARQL query that triggered it, butalso to all future queries.

If such predicates are not present in the query theplugin remains inactive, as is the case in the exampleof Figure 3. However, the example call will retrieve in-formation dynamically asserted by the IRS plugin dur-ing a previous compoundLookup call—namely theobject of the skos:prefLabel predicate (arrow #0in the figure).

5.4. Identity Mapping Service

The IMS plugin responds to all queries provided bythe API endpoint and invokes the IMS service for anyinstance URIs that appear in the query. Once an appro-priate list of mappings for each URI has been retrievedfrom the IMS service, the plugin will expand the orig-inal SPARQL query to include equivalent URIs.

The expanded query is then forwarded to the SPARQLEvaluator plugin which will query the RDF store andwrite to the output of the workflow. The need to storeequivalent URIs in the LDC is thus eliminated, and theplatform is able to provide appropriate results to iden-tical queries with different entity equivalence require-ments.

5.5. Domain Specific Services

So far only a single domain–specific service hasbeen integrated with the OPS platform – namely theChemical Structure Search service provided by Chem-Spider. This service enables the retrieval of smallmolecule compound URIs that match, contain, or aresimilar to a user provided chemical structure.

8http://www.conceptwiki.org accessed April 2012.9PREFIX :<http://wiki.openphacts.org/index.

php/ext_function#>

8 Gray et al. / Applying Linked Data Approached to Pharmacology

As such functionality requires the use of highly spe-cialised and computationally expensive methods andalgorithms, we developed a LarKC plugin that invokesthe ChemSpider API, dynamically converts search re-sults to RDF and updates the LDC. The inserted data ispersistent and reused to answer queries with identicalparameters.

Similarly to the IRS plugin, Chemical StructureSearch will scan each SPARQL query it receives fora set of predefined predicates which correspond to thetypes of search supported by ChemSpider.

5.6. Core API

As a key focal point of the OPS platform is to en-able the rapid development of user–focused applica-tions in the drug discovery domain, we have developeda simple REST–based API to facilitate the retrieval ofdata from the LDC. Applications can thus be devel-oped without a requirement for extensive knowledgeof the schemas used in each underlying dataset. More-over, this approach allows for the changes to SPARQLqueries to be made at a single point, facilitating theaddition of new datasets and the evolution of existingschemas.

The core API is implemented as an endpoint inLarKC that will always receive the input to each OPSworkflow, and extract the method name and corre-sponding parameters from the HTTP POST request.Each method name corresponds to a parameterizedSPARQL query which is instantiated using the param-eters provided. The resulting query is provided as inputto all plugins directly connected to the endpoint in theworkflow (i.e. the IMS, IRS and Chemical StructureSearch plugins).

The OPS Core API does more than to simply pro-vide an abstraction layer for the application developer.Such interfaces provide assurances that the data layerwill only be exposed to well designed queries, and al-lows the API developer to make extensive use of op-timizations present in RDF stores. We believe this tobe key to the successful deployment of Semantic Webinfrastructures which consume large scale datasets.

5.7. User interfaces

Within the Open PHACTS project there are severalexemplar user interface applications being developedeach with a different drug discovery task as a focus.The OPS platform provides a common semantic web

Fig. 4. Compound by name look up through the OPS general inter-face

Fig. 5. Article enrichment in Utopia Documents using the OpenPHACTS Platform

platform which can be accessed by these drug discov-ery applications through the core API.

For a general browsing interface over the linkeddata, a version of an internal system used in the phar-maceutical company Lundbeck, named LSP4All, wasextended and connected to the platform. This inter-face allows for searching, retrieving and browsing ofdata tables. As a result some of the typical user in-terfaces experienced by a pharmaceutical company re-searcher have been converted to rely on Semantic Webdata. Figure 4 shows an example of the interface re-turning the integrated information about the compoundaspirin.

In addition to browsing and searching for data, re-searchers utilize scientific literature in their work. Tosupport this process the Utopia Documents [1] soft-ware was connected to the platform to allow users toview chemical compounds associated with a PDF arti-cle, see Figure 5.

Gray et al. / Applying Linked Data Approached to Pharmacology 9

Fig. 6. PathVisio using the OPS Platform to suggest relationshipswith new genes or metabolites (right side window) for a gene se-lected in a pathway (left side window)

Finally, a more specialized interface for curating bi-ological pathways, PathVisio [38], was connected tothe platform. This connection allowed PathVisio toprovide additional pathway information to the user asthey review a pathway. This interface is shown in Fig-ure 6.

Thus, based on the same common platform, wehave created a prototype for an enriched pharmacologyworkspace software platform.

6. Conclusions

We presented the architectural decisions for im-plementing a linked data platform to support drugdiscovery that integrates the pre-competitive openlyavailable data from many semantic web sources, e.g.Chem2Bio2RDF and LODD. The motivation for thefunctionality provided by the OPS linked data platformwas driven by real business questions for drug discov-ery in pharmacology gathered from the Open PHACTSpartners. The platform supports different views overthe data through stand-off mappings which are appliedat query time, i.e. it is possible to have the system re-late genes and proteins in one case but not another.The versatility of the platform has been demonstratedby the wide variety of drug discovery applications thatcan be connected to the integrated data.

Future work is to expand the functionality of the sys-tem to a wider set of the business questions. This willalso require linking a larger number of data sources,and thus increasing the size and complexity of thelinked data cache.

Acknowledgements

The research leading to these results has receivedsupport from the Innovative Medicines Initiative JointUndertaking under grant agreement number 115191,resources of which are composed of financial contri-bution from the European Union’s Seventh FrameworkProgramme (FP7/2007- 2013) and EFPIA companies’in kind contribution.

References

[1] T. K. Attwood, D. B. Kell, P. McDermott, J. Marsh, S. R. Pet-tifer, and D. Thorne. Utopia documents: linking scholarly lit-erature with research data. Bioinformatics (Oxford, England),26(18):i568–i574, September 2010.

[2] K. Azzaoui, E. Jacoby, S. Senger, E. Cuadrado Rodríguez,M. Loza, B. Zdrazil, M. Pinto, A. J. Williams, V. de la Torre,J. Mestres, O. Taboureau, M. Rarey, and G. F. Ecker. Analy-sis of the scientific competency questions followed by the IMIOpen PHACTS consortium for the development of the seman-tic web-based molecular information system OPS. 2012. Tobe published.

[3] A. Bairoch. The ENZYME database in 2000. Nucleic AcidsResearch, 28:304–305, 2000.

[4] J. Barrasa Rodríguez and A. Gómez-Pérez. Upgrading rela-tional legacy data to the semantic web. In Proceedings ofthe 15th international conference on World Wide Web (WWW2006), pages 1069–1070, Edinburgh, UK, May 2006. ACM.

[5] K. Belhajjame, N. W. Paton, A. A. A. Fernandes, C. Hedeler,and S. M. Embury. User feedback as a first class citizen in in-formation integration systems. In CIDR, pages 175–183, 2011.

[6] F. Belleau, M.-A. Nolin, N. Tourigny, P. Rigault, and J. Moris-sette. Bio2RDF: Towards a mashup to build bioinformat-ics knowledge systems. Journal of Biomedical Informatics,41(5):706 – 716, 2008.

[7] T. Berners-Lee. Linked data. Technical report, W3C, 2006.[8] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame:

A generic architecture for storing and querying rdf and rdfschema. In Ian Horrocks and James Hendler, editors, The Se-mantic Web – ISWC 2002, volume 2342 of Lecture Notes inComputer Science, pages 54–68. Springer Berlin / Heidelberg,2002.

[9] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini,A. Poggi, M. Rodriguez-Muro, and R. Rosati. Ontologies anddatabases: The DL-Lite approach. In Proceedings of 5th In-ternational Reasoning Web Summer School (RW 2009), LNCS.Springer, August 2009.

[10] B. Chen, X. Dong, D. Jiao, H. Wang, Q. Zhu, Y. Ding, andD.J. Wild. Chem2Bio2RDF: a semantic framework for linkingand data mining chemogenomic and systems chemical biologydata. BMC Bioinformatics, 11(1):255, 2010.

[11] L. L. Chepelev and M. Dumontier. Semantic web integration ofcheminformatics resources with the sadi framework. Journalof cheminformatics, 3(1):16, 2011.

[12] A. Cheptsov, M. Assel, G. Gallizo, I. Celino, D. DellAglio,L. Bradesko, M. Witbrock, and E. Della Valle. LargeKnowledge Collider. A Service-oriented Platform for Large-

10 Gray et al. / Applying Linked Data Approached to Pharmacology

scale Semantic Reasoning. In Proceedings of the Interna-tional Conference on Web Intelligence, Mining and Semantics(WIMS2011), ACM International Conference ProceedingsSeries, Sogndal, Norway, 5 2011.

[13] The UniProt Consortium. Reorganizing the protein space at theuniversal protein resource (uniprot). Nucleic Acids Research,2011.

[14] D. De Roure, C. Goble, and R. Stevens. The design and real-isation of the myExperiment virtual research environment forsocial sharing of workflows. Future Generation Computer Sys-tems, 25:561–567, 2009. doi:10.1016/j.future.2008.06.010.

[15] O. Erling and I. Mikhailov. Virtuoso: RDF Support in a NativeRDBMS, page 501. 2010.

[16] P. Fox, D. McGuinness, R. Raskin, and K. Sinha. A vol-cano erupts: Semantically mediated integration of heteroge-neous volcanic and atmospheric data. In Proceedings of theACM first workshop on CyberInfrastructure: Information Man-agement in eScience, pages 1–6, Lisbon (Portugal), November2007. ACM.

[17] A. Gaulton, L. Bellis, J. Chambers, M. Davies, A. Hersey,Y. Light, S. McGlinchey, R. Akhtar, F. Atkinson, A.P.Bento, B. Al-Lazikani, D. Michalovich, and J.P. Overington.ChEMBL: A large-scale bioactivity database for chemical bi-ology and drug discovery. Nucl. Acids Res. Database Issue.,2011.

[18] J Goecks, A Nekrutenko, J Taylor, and The Galaxy Team.Galaxy: a comprehensive approach for supporting accessible,reproducible, and transparent computational research in the lifesciences. Genome Biology, 11(8):R86, 8 2010.

[19] A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of datas-pace systems. In Proceedings of the Twenty-Fifth Symposiumon Principles of Database Systems (PODS 2006), pages 1–9,Chicago, (IL, USA), June 2006. ACM.

[20] A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data integration:The teenage years. In Proceedings of 32nd International Con-ference on Very Large Data Bases (VLDB), pages 9–16, Seoul(Korea), September 2006. ACM.

[21] S. Harris, N. Lamb, and N. Shadbolt. 4store : The design andimplementation of a clustered rdf store. Time, 42(8):81–96,2009.

[22] T. Heath and C. Bizer. Linked Data: Evolving the Web intoa Global Data Space, volume 1 of Synthesis Lectures on theSemantic Web: Theory and Technology. Morgan & Claypool,1st edition, 2011.

[23] E. Jacoby, K. Azzaoui, S. Senger, E. Cuadrado Rodríguez,M. Loza, B. Zdrazil, M. Pinto, A. J. Williams, V. de la Torre,J. Mestres, O. Taboureau, M. Rarey, and G. F. Ecker. Scien-tific requirements for the next generation semantic web-basedchemogenomics and systems chemical biology molecular in-formation system OPS. In Computational Chemogenomics.2012. To appear.

[24] E. Jain, A. Bairoch, S. Duvaud, I. Phan, N. Redaschi, B. Suzek,M. Martin, P. McGarvey, and E. Gasteiger. Infrastructure forthe life sciences: design and implementation of the UniProtwebsite. BMC Bioinformatics, 10(1):136+, 2009.

[25] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe.Kegg for integration and interpretation of large-scale moleculardatasets. Nucleic Acids Research, 40:D109–114, 2012.

[26] T. Kelder, M.P. van Iersel, K. Hanspers, M. Kutmon, B.R. Con-klin, C. Evelo, and A.R. Pico. Wikipathways: building re-

search communities on biological pathways. NAR, 2011. doi:10.1093/nar/gkr1074.

[27] C Knox, V Law, T Jewison, P Liu, S Ly, A Frolkis, A Pon,K Banco, C Mak, V Neveu, Y Djoumbou, R Eisner, AC Guo,and DS Wishart. DrugBank 3.0: a comprehensive resource for’omics’ research on drugs. Nucleic Acids Research, 39:D1035–1041, 1 2011.

[28] M. Lenzerini. Data integration: A theoretical perspective.In Proceedings of 21st ACM Symposium on Principles ofDatabase Systems (PODS 2002), pages 233–246, Madison(WI, USA), June 2002. ACM.

[29] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao. Scientific work-flow management and the kepler system. Special Issue: Work-flow in Grid Systems. Concurrency and Computation: Practice& Experience, 18(10):1039–1065, 2006.

[30] L Matthews, G Gopinath, M Gillespie, M Caudy, D Croft,B de Bono, P Garapati, J Hemish, H Hermjakob, B Jassal,A Kanapin, S Lewis, S Mahajan, B May, E Schmidt, I Vastrik,G Wu, E Birney, L Stein, and P D’Eustachio. Reactome knowl-edgebase of biological pathways and processes. Nucleic AcidsResearch, 37:D619–622, 2008.

[31] P. Missier, S. Soiland-Reyes, S. Owen, W. Tan, A. Nenadic,I. Dunlop, A. Williams, T. Oinn, , and C. Goble. Taverna,reloaded. In SSDBM 2010, Heidelberg, Germany„ 2010.

[32] Vassil Momtchev. Linked life data: Knowledge extractionand semantic data integration in the pharmaceutical domain, 42011. LarKC Pharma workshop at High Performance Comput-ing Center Stuttgart (HLRS), Stuttgart, Germany.

[33] E. K. Neumann, E. Miller, and J. Wilbanks. What the semanticweb could do for the life sciences. Drug Discovery Today:BIOSILICO, 2(6):228 – 236, 2004.

[34] H. E. Pence and A. Williams. ChemSpider: An Online Chem-ical Information Resource. Journal of Chemical Education,87(11):1123–1124, November 2010.

[35] A. Ruttenberg, J. A. Rees, M. Samwald, and M. S. Marshall.Life sciences on the Semantic Web: the Neurocommons andbeyond. Briefings in bioinformatics, 10(2):193–204, March2009.

[36] M. Samwald, A. Coulet, I. Huerga, R. L. Powers, J. S. Luciano,R. R. Freimuth, F. Whipple, E. Pichler, E. Prud’hommeaux,M. Dumontier, and M. S. Marshall. Semantically enablingpharmacogenomic data for the realization of personalizedmedicine. Pharmacogenomics, 13(2):201–212, 2012.

[37] M. Samwald, A. Jentzsch, C. Bouton, C. Kallesoe, E. Willigha-gen, J. Hajagos, M. Marshall, E. Prud’hommeaux, O. Hassan-zadeh, E. Pichler, and S. Stephens. Linked open drug data forpharmaceutical research and development. Journal of Chem-informatics, 3(1):19+, 2011.

[38] M. van Iersel, T. Kelder, A. Pico, K. Hanspers, S. Coort,B. Conklin, and C. Evelo. Presenting and exploring biologi-cal pathways with PathVisio. BMC Bioinformatics, 9(1):399+,2008.

[39] Bigdata Whitepaper. Bigdata : Approaching web scale for thesemantic web . Acta Informatica, 2009.

[40] A. J. Williams, L. Harland, P. Groth, S. Pettifer, C. Chich-ester, E. L. Willighagen, C. T. Evelo, N. Blomberg, G. Ecker,C. Goble, and B. Mons. Open PHACTS: Semantic interoper-ability for drug discovery. 2012. To be published.