1st year partner evaluation report - cern documents... · this is the 1st deliverable of the...

83
PARTNER Grant Agreement Number 215840 WP23 Deliverable 1 - 02/2010 Data Integration for the PARTNER Hadron Therapy Information Sharing Platform Daniel Abler Supervisors: Prof. Manjit Dosanjh (CERN), Prof. Ken Peach (PTCRi, University of Oxford), Prof. Jim Davies (Computing Laboratory, University of Oxford) Date 15.03.2010

Upload: phamnga

Post on 22-Feb-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

PARTNER Grant Agreement Number 215840

WP23 – Deliverable 1 - 02/2010

Data Integration for the PARTNER

Hadron Therapy Information Sharing Platform

Daniel Abler

Supervisors: Prof. Manjit Dosanjh (CERN),

Prof. Ken Peach (PTCRi, University of Oxford),

Prof. Jim Davies (Computing Laboratory, University of Oxford)

Date 15.03.2010

Page 2: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

PARTNER Work Package 23, deliverable 1

Data Integration for the PARTNERHadron Therapy Information Sharing

Platform

Daniel [email protected]

February 2010

Page 3: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European Community’s Seventh Framework Programmeunder contract number (PITN-GA-2008-215840-PARTNER).

Page 4: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Contents

1 Introduction 11.1 Cancer and cancer treatment . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 PARTNER and the GRID work package . . . . . . . . . . . . . . . . . . . 21.3 Purpose and structure of this report . . . . . . . . . . . . . . . . . . . . . 2

2 Medical information sharing 52.1 Health Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 PARTNER Hadron Therapy Information Sharing Platform . . . . . . . . 62.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Grid for distributed user access and sources . . . . . . . . . . . . 72.3.2 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.3 Security framework . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Data integration 93.1 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Characteristics of information systems . . . . . . . . . . . . . . . . 93.1.2 Classification of information systems . . . . . . . . . . . . . . . . 10

3.2 Architectural data integration . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Semantic data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.1 Mediator-wrapper approach . . . . . . . . . . . . . . . . . . . . . 143.3.2 Ontology based approaches . . . . . . . . . . . . . . . . . . . . . . 15

4 Enabling technologies 194.1 Semantic web and semantic grid . . . . . . . . . . . . . . . . . . . . . . . 194.2 Overview of semantic web technologies . . . . . . . . . . . . . . . . . . . 20

4.2.1 Hypertext web technologies . . . . . . . . . . . . . . . . . . . . . . 214.2.2 Standardised semantic web technologies . . . . . . . . . . . . . . 22

4.3 Semantic Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.1 Ontologies and Terminologies . . . . . . . . . . . . . . . . . . . . 24

4.3.1.1 Example Ontologies and Terminologies . . . . . . . . . 254.3.1.2 Merging Ontologies . . . . . . . . . . . . . . . . . . . . . 25

i

Page 5: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Contents

4.3.2 Metadata Registries & Common Data Elements . . . . . . . . . . 264.3.2.1 Common Data Elements . . . . . . . . . . . . . . . . . . 264.3.2.2 Metadata Registries . . . . . . . . . . . . . . . . . . . . . 26

4.4 Semantic Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5 Grid technologies for data access . . . . . . . . . . . . . . . . . . . . . . . 29

4.5.1 Open Grid Service Architecture Data Access and Integration(OGSA-DAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Examples of implementations 315.1 ACGT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 The ACGT semantic mediator . . . . . . . . . . . . . . . . . . . . . 325.1.2 The ACGT Data Access Wrappers (ACGT-DAW) . . . . . . . . . . 335.1.3 The ACGT Master Ontology on Cancer (ACGT-MOC) . . . . . . 34

5.2 caBIG™ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2.1 caGrid infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2.2 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.2.1 Syntactic interoperability . . . . . . . . . . . . . . . . . . 365.2.2.2 Semantic interoperability . . . . . . . . . . . . . . . . . . 36

5.2.3 Cancer Common Ontologic Representation Environment (caCORE) 375.2.3.1 Enterprise Vocabulary Services (EVS) . . . . . . . . . . . 405.2.3.2 Cancer Data Standards Repository (caDSR) . . . . . . . 415.2.3.3 Cancer Bioinformatics Infrastructure Objects (caBIO) . . 41

5.2.4 Further caBIG-related projects and technology . . . . . . . . . . . 415.2.4.1 Cancer Translational Research Informatics Platform (ca-

TRIP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.5 Semantic queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 CancerGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3.1 METABRIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.4 Further projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.1 @neurIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4.1.1 General architecture . . . . . . . . . . . . . . . . . . . . . 475.4.1.2 Data integration . . . . . . . . . . . . . . . . . . . . . . . 47

5.4.2 ADMIRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4.3 OntoGRID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 505.4.3.2 Ontology driven data access with S-OGSA-DAI . . . . . 51

6 Conclusions for the PARTNER platform 536.1 Requirements for HISP infrastructure . . . . . . . . . . . . . . . . . . . . 536.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2.1 Semantic annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2.1.1 Terminology server . . . . . . . . . . . . . . . . . . . . . 566.2.1.2 Meta data registry . . . . . . . . . . . . . . . . . . . . . . 56

ii

Page 6: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Contents

6.2.1.3 Model repository . . . . . . . . . . . . . . . . . . . . . . 576.2.2 “Mediation”-query processor . . . . . . . . . . . . . . . . . . . . . 576.2.3 Data access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3 Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

List of Tables 61

List of Figures 63

Bibliography 65

iii

Page 7: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Contents

iv

Page 8: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Abstract

Information sharing is indispensable for efficient health care delivery and for improvingclinical practice and health care technology. However, many factors complicate cross-institutional, inter-disciplinary and international information sharing.

Within the PARTNER FP7 project, the workpackages 22 and 23 address these diffi-culties by creating a prototype information sharing platform for the hadron therapycommunity: the Hadron Therapy Information Sharing Platform (HISP). This platformwill provide users with a common access point to heterogeneous data sources usingsecure grid services. Workpackage 24 extends HISP by a rare tumour database (RTDB)which will support modelling of cancer treatment outcome and indications for hadrontherapy.

To accomplish these functionalities, mechanisms for data integration and machineinteroperability are needed.

This report provides a general introduction into data integration and gives an over-view of existing technologies for semantic data integration. An analysis of severalcurrent grid-based data integration projects illustrates how these technologies enableinteroperability and data sharing among research tools and researchers.

Since hadron therapy is an evolving field, the semantic annotation approaches fromcaBIG and cancerGrid seem very promising for HISP as they support this process.Semantic queries, on the other hand, are better supported by the approach ACGThas chosen. Attempts for combining both advantages, supporting semantic querieson caBIG are being made but still in an experimental stage. In order to provide aneasy-access portal to the data of HISP, the aspect of semantic queries will have to beexplored further by WP23.

SupervisorsProf. Manjit Dosanjh CERNProf. Ken Peach PTCRi, University of OxfordProf. Jim Davies Computing Laboratory, University of Oxford

This is the 1st Deliverable of the PARTNER Work Package 23 within the MarieCurie Initial Training Fellowship of the European Community’s Seventh FrameworkProgramme under contract number (PITN-GA-2008-215840-PARTNER).

v

Page 9: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European
Page 10: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

1 Introduction

1.1 Cancer and cancer treatment

Each year about 11 million people are diagnosed with cancer and more than 7 millionpeople die from this disease worldwide, especially in low- and middle-income countrieswhere 70% of all the cancer related deaths occurred. [84]. These numbers are expectedto rise to about 20 to 26 million new diagnoses and 13 to 17 million deaths per year by2030 [25].

In Europe, cancer accounts for 25.5% of all deaths and is the primary cause of deathin the age group between 45 and 64 years, followed by circulatory diseases [42]. About50% of all patients can be cured by surgery, radiation therapy, chemotherapy or acombined therapy. Radiotherapy, alone or as part of a combined therapy, accounts for40% of this figure.

In radiation therapy, a target volume, the tumour, is exposed to ionising radiation,X-rays in most present-day machines, which induces damage in the genetic materialof the cells in this target volume. The main goal of radiation therapy is to deliver amaximally effective dose to the tumour volume while sparing surrounding normaltissue. Since the dose delivered by x-rays penetrating into tissue falls off rapidly, themaximum dose to the target volume is limited by the dose constraints in the entryregion of the beam. Also, a significant dose is delivered to sites beyond the targetvolume.

Despite the success of X-rays for cancer treatment, there remains a fraction of about18% of patients in which present-day local control treatment modalities fail even forlocalised tumours [48]. A good fraction of these patients is expected to profit fromHeavy Ion Therapy, a type of radio therapy which uses protons [86] or heavier ionslike Carbon in the place of X-rays and which differs from conventional radiotherapy intwo aspects: These particles show a more advantageous depth-dose relationship thanphotons since they deposit most of their energy at the end of their range in the tissue,the Bragg Peak. Furthermore, heavy ions have a higher relative biological effectiveness(RBE), producing more harmful DNA damage, than photons due to higher ionisationdensities in the tissue. The ability to directly induce irreparable DNA damage makesthem the treatment of choice for radio-resistant tumours which accounts for 10% ofcancer cases.

1

Page 11: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

1 Introduction

In 2009 more than 25 proton therapy centres were operational worldwide but onlytwo centres using carbon ions were treating patients in a clinical environment (HIMACin Chiba and HIBMC in Hyogo, Japan) [71]. The first facility using Carbon ions inEurope was inaugurated in November 2009 [51] , HIT [48] in Heidelberg, Germany. TheNational Centre for Ontological Hadrontherapy (CNAO) , Pavia, Italy, is expected tostart treating patients in 2010. Further facilities will become operational soon in WienerNeustadt, Austria, MedAustron [46] and in Lyon, France, ETOILE [29]

Like any other new treatment modality, carbon therapy has to be established inmedicine by proving its superiority and cost effectiveness compared to the current bestavailable treatment. Based on the results from pilot projects as e.g. at GSI, Darmstadt,Germany, and clinical results from Japan, about 15% of all radiotherapy patients arelikely to be treated better by particle therapy than by conventional radiotherapy. Thus,the demand for particle therapy centres is estimated to one proton centre per 5 millionpeople and one carbon-ion centre per 35 million people [24]. These numbers clearlyshow that a transnational collaboration is required for determining best practices andfor evaluating the efficacy of the treatment. Efficient exchange of clinical results andinformation is crucial for improving the treatment of patients.

1.2 PARTNER and the GRID work package

The goal of the Particle Training Network for European Radiotherapy (PARTNER) isto train researchers in the field of hadron therapy. The structure of PARTNER mirrorsthe interdisciplinary character of hadron therapy and involves research reaching fromsimulations and radiobiology over treatment planning, accelerator and gantry designto in-situ monitoring techniques. Since the goal of particle therapy is to improve cancertreatment, all results and development will finally be evaluated in clinical practice. Thiskind of interdisciplinary and multinational research requires means to exchange dataand information in order to be carried out efficiently. Work packages (WP) 22 and 23in PARTNER reflect the need for a translational research infrastructure and aims forcreating a prototype of a collaborative infrastructure for information sharing in hadrontherapy. Closely linked to this, WP 24 will use this information sharing infrastructureto build a rare tumour database and to develop prediction models.

1.3 Purpose and structure of this report

This report covers the first PARTNER deliverable for WP 23:Summary of existing shared applications to be used in the infrastructure with emphasis on

the mentioned experience of the Austrian radio-oncologists.Since the information sharing platform is a joint effort between WP 22 and 23, it had

been agreed to split the main responsibilities into

• Grid core services, security, interfaces (WP 22),

2

Page 12: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

1.3 Purpose and structure of this report

• data semantics and ontology (data integration), query-based user access (WP23).

Consequently, this report focuses on existing technology for data integration.Chapter 2 introduces the idea of health grids as collaborative tools for medical infor-

mation, outlines the purpose of the PARTNER Hadron Therapy Information SharingPlatform and identifies the challenges relevant to this project: distribution of data andusers, multidisciplinarity and legal and ethical aspects. Chapter 3, data integration,provides an methodological overview of how the first two of these challenges canbe addressed. Chapter 4 introduces the technologies to implement data integrationapproaches. Some of the most renown projects and their data integration approachesare described in chapter 5. The last chapter summarises these approaches and evaluatestheir applicability for the PARTNER project.

3

Page 13: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

1 Introduction

4

Page 14: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

2 Medical information sharing

Research into new treatment modalities, pharmaceutical drugs and their efficacy as wellas the establishment of best treatment practices requires the comparison and evaluationof huge amounts of data. Making patient records available and transparently accessiblewould help medical professionals to determine the most suitable therapy for theirpatients. It also would facilitate patient referral within countries and increase patientmobility across Europe, which could solve problems of long waiting lists in somestates.[26]

2.1 Health Grids

The SHARE1 initiative comes to the conclusion that GRID technology is most suitable formeeting the ICT challenges of the healthcare sector since it “...offers rapid computation,large data storage and flexible collaboration by harnessing together the power oflarge numbers of computers, from end-users’ desktops to powerful workstations andclusters of more powerful machines”[26]. The initiative defines health Grids as “gridinfrastructures comprising applications, services or middleware components that dealwith the specific problems arising in the processing of biomedical data. Resourcesin healthgrids are databases, computing power, medical expertise and even medicaldevices.”

Grids are often differentiated into computational, data and collaboration grids andare employed depending on the needs of their community. While particle physics hasto cope with huge amounts of data which must be quickly stored but not instantlyprocessed, bio-informatics is less data-expensive but requires more intensive processing.When using grids in e-health2, the collaborative aspect becomes most important.

1Supporting and structuring Healthgrid Activities and Research in Europe2A definition (among many others) of e-health and a perspective for its future development is given in

[87].

5

Page 15: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

2 Medical information sharing

Figure 2.1: Conceptual layout of the PARTNER Hadron Therapy Information Sharing Platform(HISP) as presented at the Physics for Health in Europe workshop in February 2010. [22]

2.2 PARTNER Hadron Therapy Information Sharing Platform

The PARTNER Hadron Therapy Information Sharing Platform (HISP) will provide aprototype collaborative tool for translational research and clinical practice. As suchit has to mediate the views of researchers and clinicians with different educationalbackgrounds and skills and provide them with access to different kinds of data storedin various independent repositories across Europe.

Figure 2.1 shows the conceptual layout of HISP: Users can access a multitude of datasources via a Grid-based software platform (HISP). Data access and data integrationservices of HISP provide the users with a common access point to the underlying data.HISP also enforces the security policies required for handling medical data.

Three early use scenarios were identified:

Research: The data integration capabilities of HISP will enable researchers from

6

Page 16: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

2.3 Challenges

different disciplines to search across a large quantity of disease and treatmentrelated data. Semantic annotation ensures that the original meaning of the data ispreserved and the view on the data can be adjusted to the user’s domain.

Patient Referral: The data sharing capabilities of HISP will allow clinicians to ex-change and access patient data in a standardised and secure way. This will facili-tate cross-border patient referral in a multi-lingual environment with differentlegal frameworks.

Rare Tumour Database (RTDB): RTDB extends HISP by providing a prediction mo-del for treatment outcomes in function of the patients’ characteristics (clinical,morphological, genetic, molecular data) and the treatment modalities. The resultsof RTDB will yield indication for hadron therapy of rare tumours. This may serveas a treatment decision support system in the future.

Since HISP will provide access to its integrated data through standardised interfaces,applications for further use scenarios can easily be implemented as future develop-ments.

2.3 Challenges

The idea of sharing medical information gives rise to ethical concerns and informationsharing in practice suffers from a number of technical difficulties. HISP has to providesolutions to three main challenges.

Besides providing secure, user-friendly and reliable access to interoperable data,the technology and the underlying ethical and legal framework to be employed formeeting these challenges must also support the community, healthcare professionalsand researchers, as well as patients and governments in developing a “culture of datasharing”.

2.3.1 Grid for distributed user access and sources

The way in which medical information is organised poses a challenge to data sharing.Relevant patient information usually resides in a multitude of databases (and occasio-nally paper reports) in different institutions. The users of HISP as well as the data to beintegrated in HISP are highly distributed. HISP must be extensible in the sense that ithas to allow new data sources to be easily integrated and new applications to be builtonto the existing platform.GRID technology provides mechanisms for user management and resource and ser-vice discovery. The technical structure of grids and examples of health grids havealready been discussed in the first PARTNER deliverable of Faustin Roman [73].

7

Page 17: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

2 Medical information sharing

2.3.2 Data integration

HISP will facilitate medical data sharing among researchers and medical professionals.Physical access to distributed data alone is not sufficient for this since most of the

medical information systems have grown and developed independently over years andwere not designed with the need for inter-compatibility in mind. Although de-factostandards like HL7 and DICOM are supported by most devices, these standards arenot sufficient for large-scale data sharing as needed in epidemiology or for conductingmeta-analyses.

Medical research is of multidisciplinary nature and involves scientists with verydifferent academic backgrounds. Consequently, terminologies and methodologies aswell as the information needs vary among researchers from different disciplines. Inorder for interdisciplinary research to be efficient, information must be gathered ina standardised way and shared in a way which conserves the original meaning ofdata and the context in which it was recorded. “Metadata”, data about data, can beused to record information about the “semantics” of data, its meaning. This semanticannotation is crucial for extracting knowledge from medical data.

The act of combining data from different sources (physically or virtually) and provi-ding the user with a unified view to this data is termed “data integration”[58].

This report will explore how semantic data integration can provide solutions to thisproblem.

2.3.3 Security framework

Medical data contains sensitive information about individuals who have a right toprivacy and confidentiality. Sharing this information poses legal and ethical challengesregarding the ownership of the data, primary and secondary use of healthcare datain commercial and public research, communication of genetic information, and theprovenance and quality of information[26].

Depending of their roles, users have to be granted different privileges when hand-ling the data in HISP. These privileges must be compliant to the ethical and legalrequirements imposed by a regulatory framework for secure data sharing. Web por-tals for easy-to-use access to different user groups and APIs for programmatic accessalso have to be embedded in this security framework.

More details on these aspects will follow in the second PARTNER deliverable (dueby July 2010) together with the ethical and legal requirements from different countriesperspectives.

8

Page 18: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3 Data integration

Data integration addresses the problem of combining data residing in different sourcesin order to provide the user with a unified view of this data [58]. There are two strongreasons for integrating data: First, to facilitate data access by providing a single accesspoint to data in an existing information system. Second, to complement data fromdifferent information systems in order to gain a more comprehensive set of data. [89]

This chapter starts out by providing some background terminology for characterisingdatabase systems (3.1.1) and for classifying information systems according to theircharacteristics (3.1.2).

Projects which involve data integration of various heterogeneous data source develo-ped by different groups face three informatics problems.

First, the sources hosted by different institutions have to be programmatically ac-cessed from various database management systems, requiring multiple query andretrieval mechanisms.

Second, the sources may have different data representations, data structures anddatabase schemata. These problems can be overcome by data integration approachesdescribed in section 3.2.

The third problem is that each resource provider may use distinct terminologies andontologies to annotate its data and data types. Different definitions or vocabularies forthe same concepts make it difficult to interpret the information received. Semantic DataIntegration is used to tackle these problems, section 3.3.

3.1 Scenarios

3.1.1 Characteristics of information systems

Services consisting of multiple database systems can be characterised along threeorthogonal dimensions (illustrated in figure 3.1)[78]:

Distribution of data means that data resides in different database systems (DBS) whichmay be geographically distributed but are interconnected by a communicationsystem. Data distribution may increase availability of data and improve accesstimes.

9

Page 19: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3 Data integration

Distribution

Heterogeneity

Autonomy

central DB (---)

distributed DB (D--)

distributed andheterogeneous DB

(D-H)

distributed,autonomous

and partly heterogenous

DB (DAh)

our situation(DAH)

distributed andautonomous DB (DA-)

Figure 3.1: Characteristics of database systems.

Heterogeneity among those sources exist due to differences in the syntax, structure orcontent of Information systems.Different institutions may have selected different Information Systems (IS) due todifferent requirements or due to changes in technology. Differences in the datamodels of these systems (e.g. representation of data and its underlying language)as well as differences on the system level (technical heterogeneity) can both leadto heterogeneity.Semantic Heterogeneity occurs when there is no agreement on the meaning,interpretation or intended use of data.Bridging of heterogeneities is THE main goal of information integration [81].

Autonomy describes the degree to which information systems can operate indepen-dently and includes design autonomy, communication autonomy and executionautonomy [66]. Design autonomy is the main cause for heterogeneity because itallows changes in the data model and the data schema for the individual DBSsand therefore leads to semantic heterogeneity. Communication and executionautonomy allows each subsystem to decide on the communication partners, howand what is being communicated and if certain operations are executed or not.

Often, distribution of ISs leads to autonomy which then results in system and data hete-rogeneity. By imposing standards, autonomy becomes limited and thus heterogeneitycan be reduced.

3.1.2 Classification of information systems

Information systems can be classified according to their distribution, autonomy andheterogeneity (DAH) as discussed previously [81]. They can also be locally controlled

10

Page 20: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3.2 Architectural data integration

and independent (-) :

Homogeneous (- - -) ISs are centrally controlled and maintained.

Distributed (D - -) are used to gain performance through parallelisation of processesor in order to replicate data to prevent data losses. The distribution had beenpredefined, the individual database nodes obey the same database schema andare not autonomous.

Distributed and heterogeneous (D - H) ISs which are not autonomous will preservetheir level of heterogeneity will and not become more heterogeneous.

Distributed and autonomous (D A -) databases can maintain their data homogeneousby (voluntarily) following rules (conventions, standards, contractual rules...) indatabase design and data definitions. They are therefore not fully autonomous.

Multidatabases (D A h) may have different database schemas but use the same datamodels and query languages. Despite being autonomous they are designed toallow access from other information systems.

This report (D A H) deals with data sources that are distributed and heterogeneousas they have been designed for different purposes storing different types of data.Most of the sources represent “closed” systems (e.g. the information system of awhole hospital) and must work fully autonomous and independent from othersources.

3.2 Architectural data integration

The integration problem can be approached in different ways. Figure 3.2 illustratesthe approaches from an architectural perspective according to the information systemlayer in which integration takes place: Users interact through various interfaces withdifferent applications which access data through a middleware via a data access layer.Database and database management systems (DBMSs) together form the data accessand storage layer.

The integration can be done manually by the user, which requires users’ expertise indealing with different interfaces and query languages, as well as a detailed knowledgeof of the location of the data, its representation and semantics.

A common user interface, e.g. a web browser, may provide a common interface, butthe homogenisation and integration is still left to the users. This is clearly the easiest toimplement, but it is the hardest for the user, who has to learn many different systems.Since it relies upon a lot of manual manipulation of the data, it is prone to error.

Designated applications could be used to access various data sources and return theintegrated results to the user. Scalability of such a system is likely to be problematicsince the integration applications have to be updated and their size grows when new

11

Page 21: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3 Data integration

Figure 3.2: Approaches towards the integration problem from an architectural perspective.[89]

systems are added. This still is not very convenient for the user who has to update theapplication.

The integration can be achieved by middlewares (different middleware tools) whichrelieve applications from implementing common integration functionality.

Uniform data access at the access level provides applications with an unified view ofphysically distributed data. Homogenisation and integration of data have to be done atrun-time, which affects access times.

Data can be physically integrated into a common data storage, a “data warehouse”.This method provides fast data access but requires the common data storage to berefreshed regularly.

Data warehousing is the most representative approach based on data translation (DT):The goal of DT based methods is to create a centralised repository (“physical integra-tion”) containing all the data from the different sources to be integrated, equippedwith an unified and normalised schema. Data coming from different sources have tobe translated to match the normalised schema. It also requires that the data has to betransferred to a different medium or database system, leading to a potential loss ofcontrol by the data owner over access to the data.

Figure 3.3 compares common integration methods.Query translation (QT) based approaches are often termed “virtual” integration since

the data is not stored in a single repository but integrated on demand by querying thesources. Queries are launched in terms of a “global schema” (often also mediated schema)against a central system (Mediator or DBMS). This system translates the queries intosub-queries in the native languages of the underlying sources and presents the resultsto the user. QT based approaches include federated database systems, mediated systems,and other hybrid approaches.

Modelling the relation between sources and the global schema, mapping, is crucial

12

Page 22: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3.2 Architectural data integration

Integration Approach

Physical integration Virtual Integration

Global Schema

bulk-loadable

Data Warehouse

Yes

Global Schema

DBMS

Federated DBMS(tightly coupled)

Yes

Federated DBMS(loosely coupled)

No

Global Schema

Mediator-Wrapper

Yes

Peer to Peer

No

query-able

Data Sources Data Sources

Figure 3.3: Integration systems: Data-translation-based physical integration in which data istranslated into the global schema of a single DBS. Query-translation-based virtual integrationwhere the sources are queried in their respective native language “on demand”, differentviews are possible. (adapted from [81])

and two basic approaches have been proposed for this purpose. Both of them areequipped with a global schema against which user queries are launched.

Global-as-view (GAV) requires that the global schema is expressed in terms of theunderlying data sources, whereas in the local-as-view (LAV) approach the global schemais to be specified independently from the sources.

The GAV approach facilitates query processing since the mapping specifies whichsource queries correspond to the elements of the global schema; this allows a simpleunfolding strategy for query answering.

In the LAV approach, the content of each source is characterised in terms of a viewover the global schema. These views present only partial information about the dataand it is not immediate to infer how to use the sources in order to answer queriesexpressed over the global schema.

While the GAV approach requires the mediated schema to be extended whenever anew source gets integrated or an existent source changes, the LAV approach allows newsources to be added relative easily1. The problem of reformulating user queries, posedover the global schema, into queries referring to the source schemas can be generalisedto the problem of answering (rewriting) queries using views [49].

The materialised approach to data integration in data warehouses adds problems re-lated to dynamic aspects, like view updates and view maintenance problems and tendsto high costs and administrative efforts. FDBSs provide a non-centralised alternativebut require all sources to be DBMSs. The mediator-wrapper approach, discussed in3.3.1 offers more flexibility. Figure 3.4 compares FDBS systems to wrapper-mediatorbased systems for data integration.

1 More information on GAV and LAC, e.g. in[58].

13

Page 23: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3 Data integration

Federated Schema

Export Schema

Component Schema

Local Schema

External Schema

Export Schema

Component Schema

Local Schema

External Schema. . .

. . .

. . .

. . .

Federated Database System Mediator - Wrapper Approach

Mediator

Application 1 Application n. . .

Source 1 Source n

Wrapper nWrapper 1 . . .

. . .

Figure 3.4: 5-layer architecture for federated database systems versus mediator-wrapper ap-proach for data integration (according to [81]).

3.3 Semantic data integration

Semantic data integration tries to solve semantic conflicts between heterogeneous datasources. Different data sites may capture their information with different tools andmethodologies and the data may be described in different languages. This hetero-geneity causes difficulties when exchanging information, since the meaning of thatinformation cannot automatically be understood and interpreted by all the members ofthe federation.

Semantic data integration combines data access tools with tools for describingthe meaning of things and interconverting assumptions so that the use of data canunambiguously be understood in the context of applications. Ontologies are widelyaccepted for representing knowledge and are used for resolving semantic conflicts inthe Semantic Web. Semantic integration has been refined over many years, startingfrom early multidatabase systems over simple mediator systems (section 3.3.1) toontology based approaches (section 3.3.2). More recent approaches use the Semanticweb technologies or are based on the Semantic Grid, described in chapter 4.

3.3.1 Mediator-wrapper approach

“A mediator is a software module that exploits encoded knowledge about certain setsor subsets of data to create information for a higher layer of applications.” Wiederhold[85] proposes a 3-layer approach to data integration (figure 3.4) in which the mediationfunctionality is separated from the user-oriented processing and from database access.

Mediators require domain knowledge to overcome structural and semantic heteroge-neity.

Wrapper are software components connecting mediators to the data sources. They

14

Page 24: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3.3 Semantic data integration

Figure 3.5: Garlic architecture [37], a mediation system based on an object oriented data model.

translate global query language to the source specific query language and data modelin order to solve (technical) heterogeneities in the interfaces.

The mediator-wrapper approach offers up-to-data data due to virtual integration.The sources can stay autonomous and the system can easily be extended by addingnew wrappers for virtually any kind of sources (not only DBMS) and mediators. On theother hand, wrappers have to be maintained, results of queries and the performancemay be limited by sources and ensuring data quality can be difficult.

Figure 3.5 shows the architecture of an early mediation-wrapper system from 1995.The goal of Garlic [37] was to develop a system and associated tools for the managementof large quantities of heterogeneous multimedia information. Data of various formats isseen through a unified schema expressed in an object-oriented data model, queried andmanipulated by an object-oriented dialect of SQL. A “middleware” query processorand data wrappers ensure extensibility of the system. Queries are formulated in aGarlic schema, which is maintained in the Garlic metadata repository, together withtranslation-related information for the various data repositories.

3.3.2 Ontology based approaches

The use of ontologies (see 4.3.1) for the explanation of implicit and hidden knowledgeis a possible approach to overcome the problem of semantic heterogeneity. Wache et al.[83] compared 25 early-stage approaches to intelligent information integration andcharacterised them according to their use of ontologies. Three different directions couldbe specified (Fig. 3.6): single ontology approaches, multiple ontology approaches and

15

Page 25: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3 Data integration

Figure 3.6: Common usage of ontologies: single (a), multiple (b) and hybrid ontology (c)approaches. [83]

hybrid approaches.Single Ontology approaches use one global ontology providing a shared vocabulary

for the specification of the semantics. This approach can be applied to integrationproblems where all information sources to be integrated provide nearly the same viewon a domain.

In Multiple Ontology approaches each information source is described by its ownontology. For example, OBSERVER [63] uses multiple ontologies to address the problemof heterogeneous vocabularies used in different domains to describe similar information.User queries are rewritten by using inter-ontology relationships to obtain semantics-preserving translations across the ontologies, the architecture is pictured in Fig. 3.7.

Although this approach enables independent development of the source ontologies,the lack of a common vocabulary makes comparing different source ontologies difficult.Inter-ontology mappings are needed to define the semantic relation between the terms

16

Page 26: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3.3 Semantic data integration

Figure 3.7: OBSERVER [63] architecture, supporting multiple ontologies.The Query Processor, translates terms in user queries into component ontologies whilepreserving their semantics. Ontology Server address the structural heterogeneity by mappingthe terms in ontologies to the structures in data repositories. Inter-ontology RelationshipsManager (IRM) address the vocabulary problem by mapping synonym relationships acrossthe terms in the ontologies.

used by the respective ontologies.Similarly to multiple ontology approaches, in hybrid approaches the semantics of each

source is described by its own ontology. In order to make the ontologies comparableto each other they are built upon one global shared vocabulary which contains basicterms of a domain.

The SIRUP (Semantic Integration Reflecting User-specific semantic Perspectives)approach [88], should enable the user to view data in their individual real-worldcontext by previously defining their specific domain model and semantics.

More information about the usage of ontologies for knowledge management can befound in section 4.3.1.

17

Page 27: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

3 Data integration

18

Page 28: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4 Enabling technologies

Semantic data integration in the context of HISP requires tools for data access, forannotating data with semantics and for data querying.

Most of these tools use technologies developed in the wake of the semantic web. Thischapter introduces semantic web concepts and discusses how these technologies can beused for accessing, annotating and querying data in a grid environment.

4.1 Semantic web and semantic grid

The Semantic Web is a vision of information that is understandable by computers. In-formation is given well-defined meaning so that machines can find, share and combineinformation available on the web. Tim Berners-Lee expressed his vision of the semanticweb as follows [30]:

I have a dream for the Web [in which computers] become capable of analyzing all the data on theWeb – the content, links, and transactions between people and computers. A ‘Semantic Web’,

which should make this possible, has yet to emerge, but when it does, the day-to-daymechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to

machines. The ‘intelligent agents’ people have touted for ages will finally materialize.

For the W3C the “Semantic Web is a web of data” that “provides a common fra-mework that allows data to be shared and reused across application, enterprise, andcommunity boundaries” [16]. The challenge for the semantic web is to provide a lan-guage that expresses both data and rules for reasoning about the data[31]. An outlineof technologies providing these capabilities is given in section 4.2.

The ideas of the Semantic Web also changed the notion of Grids.The Semantic Gridis expected to combine semantic interoperability with Grid technology to provide agenerically usable e-Research infrastructure for flexible collaborations and computa-tions on a global scale. The key to this is an infrastructure where all resources, includingservices, are adequately described in a form that is machine-processable - the goal issemantic interoperability. Goble and Roure [44] further differentiate between a Know-ledge Grid which is a Grid of semantics based on knowledge which is generated byapplications using and mining the Grid, and a Semantic Grid which uses semantics for

19

Page 29: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4 Enabling technologies

ExpertSystems

WISDOM

Decision Support Systems

Knowledge

Management Information Systems

INFORMATION

Transaction Processing Systems

DATA

(a) The wisdom hierarchy mapping to types ofinformation systems. (adopted from [75])

(b) The Semantic/Knowledge Grid Stack. (from [44])

Figure 4.1: The wisdom hierarchy and the Semantic Grid: Grid fabric layers provide adminis-trative and data access capabilities, semantic annotation enabled Grid middleware ensuresthat the meaning of data is preserved, further higher-level services provide informationprocessing capabilities which may generate new knowledge.

the Grid to manage and execute its architectural components”. Fig. 4.1b shows theKnowledge Grid stack built of various layers like the wisdom hierarchy (fig. 4.1a).

The semantic Grid is a joint effort of semantic web and Grid communities, “glued”together by web services (Fig. 4.2)1

4.2 Overview of semantic web technologies

The Semantic Web, being a Web of Data, requires technologies to link data and to recordtheir relationships, to draw inferences from this data using vocabularies and to query thedata.

Once data has been linked (RDF2) and published or embedded in documents (RDFa,GRDDL), a RDF-specific language is needed to query this data and to retrieve results.This is provided by the SPARQL query language and the accompanying protocols.

Vocabularies on the Semantic Web define the concepts and relationships used todescribe an area of concern and are used to classify the terms which can be used in

1Further information on the Semantic Grid: [11, 44, 74], ONTOGrid, Semantic Grid RG (SEM-RG) groupof the OpenGridForum

2All these languages are briefly described in the following subsection 4.2.1 and 4.2.2.

20

Page 30: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4.2 Overview of semantic web technologies

Figure 4.2: Stakeholders to be bridged for a Semantic Grid. (from [44])

a particular application. There is no clear division between what is referred to asvocabularies and ontologies (section 4.3.1). Usually, ontology is used for a more formalcollection of terms. Vocabularies are to help data integration, e.g. when ambiguitiesexist on the terms used in different datasets, and they are the basic building blocks forinference techniques on the Semantic Web.

Inference on the Semantic Web can broadly be characterised by discovering newrelationships. Data on the Semantic Web is modelled as a set of relationships betweenresources; “inference” means that automatic procedures can generate new relationshipsbased on the data and based on additional information about the data. This extrainformation can be defined via vocabularies/ontologies (RDF, RDFS, OWL, SKOS) orrule sets (RIF)[17].

The semantic web stack (figure 4.3) illustrates the hierarchy of the languages whichempower linked data. The Semantic Web Technologies (section 4.2.2), as an extensionof the web, are based on hypertext web technologies (section 4.2.1). Most of thetechnologies are standardised by the World Wide Web Consortium W3C.

4.2.1 Hypertext web technologies

Hypertext Web technologies form the bottom layer of the semantic technology stackand provide the basis for the semantic web.

Internationalised Resource Identifier (IRI) is a generalisation of URI and providesmeans for uniquely identifying semantic web resources.

XML is a markup language that enables creation of documents composed of structureddata.

21

Page 31: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4 Enabling technologies

Figure 4.3: Hierarchy of languages in the semantic web. (from [16])

The structure of XML documents can be defined by a Document Type Definition(DTD) or a XML Schema. Both standards enable independent groups of peopleto specify how their XML documents are formatted so that they can interchangedata.XSLT (a part of the Extensible Stylesheet Language (XSL)) is a language for transfor-ming XML documents into other XML documents. It uses XPath to define parts ofthe source document that should match one or more predefined templates. Whena match is found, XSLT will transform the matching part of the source documentinto the result document.XQuery is a query and functional programming language that is designed toquery collections of XML data.[18]

XML Namespaces are used for providing uniquely named elements and attributes inan XML document. When an XML instance contains element or attribute namesfrom more than one XML vocabulary, ambiguities between identically namedelements or attributes can be resolved if each of these vocabularies is given aparticular namespace.[19]

4.2.2 Standardised semantic web technologies

The middle layer contains modelling technologies which enable semantic web applica-tions. These highly structured languages provide the basis for building easy-to-use anduser-friendly tools for information exchange:

Resource Description Framework (RDF) is a general method for conceptual descrip-

22

Page 32: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4.3 Semantic Annotation

tion or modelling of information that is implemented in web resources, usingsyntax formats based on XML or URI. The RDF data structure consists of a subject,predicate, and object. These three components constitute a statement known asa triple. Triples are linked together to form a graph structure. [14] Subjects orobjects which have the same URL are merged into a single node. This feature iscalled RDF Merge and provides the basis for data integration on the SemanticWeb.

RDF Schema (RDFS) provides basic vocabulary for describing the meaning of RDFdata. Meaning in the semantic web is defined by the kind of inferences thatcan be made. RDFS defines rules (expressed in formal logic) that when used ina particular pattern, allow certain inferences to be made. This enables the useof reasoning software to draw inferences from the data. The inferred data isexpressed as new triples in the RDF graph.

Web Ontology Language (OWL) extends RDF, and RDFS by adding more advancedconstructs to describe semantics of RDF statements. OWL can be used to explicitlyrepresent the meaning of terms in vocabularies and the relationships betweenthose terms. It allows stating additional constraints, such as for example cardina-lity, restrictions of values, or characteristics of properties such as transitivity andbrings reasoning capability to the semantic web. OWL Full ensures compatibilityto RDFS.[12]OWL 2 is an extension and revision of OWL by the Web Ontology Working Grouppublished in 2004. [13]

Simple Knowledge Organisation System (SKOS) is an RDF vocabulary designed torepresent terminologies, thesauri and classification schemes, its relational seman-tics are considerably simpler than OWL. Using SKOS, a knowledge organisationsystem can be expressed as machine-readable data. The SKOS data model isformally defined as an OWL Full ontology [15].

Semantic Web Rule Language (SWRL) brings support of rules which allow to ex-press concepts which are inexpressible with OWL axioms alone.

SPARQL can be used to express queries across diverse data sources, whether the datais stored natively as RDF or viewed as RDF via middleware. The query resultsare returned in in the SPARQL Query Results XML format.[9]

4.3 Semantic Annotation

The previous chapter introduced technologies of the semantic web. These technologiescan be used to record and maintain the correct meaning of data by adding metadata:semantic annotation.

23

Page 33: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4 Enabling technologies

Figure 4.4: Extract from the ACGT Master Ontology for the class “neoplasm”.

Metadata is defined by the ISO/IEC 11179 [53] “to be data that defines and describesother data. This means that metadata are data, and data become metadata when theyare used in this way. This happens under particular circumstances, for particularpurposes, and with certain perspectives, as no data are always metadata.”

Semantic annotation requires a way to express the meaning of data by establishinglinks between data and description of data.

Two annotation mechanisms will be introduced here: Linking data to a concept in adomain ontology fully describes its meaning (section 4.3.1). An other approach is todescribe the meaning of data in metadata registries (section 4.3.2).

4.3.1 Ontologies and Terminologies

In information science, an ontology is a formal representation of a set of concepts withina domain. “A conceptualisation is an abstract, simplified view of the world that wewish to represent for some purpose. [...] An ontology is an explicit specification of aconceptualisation.” [47]

Ontologies provide vocabularies which can be used to model a domain. Most ontolo-gies describe individuals (concepts), classes, attributes and relations.

Domain ontologies model specific domains and represent the particular meanings ofterms as they are used in that domain. Fig. 4.4 shows an extract from the ACGT MasterOntology on Cancer.

Foundation ontologies (or upper ontologies) provide models of common objects whichare generally applicable across a wide range of domains.

Ontology languages are formal languages used to encode ontologies, OWL being thethe most widely used one.

24

Page 34: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4.3 Semantic Annotation

4.3.1.1 Example Ontologies and Terminologies

Since ontologies and terminology sources provide domain models and controlledvocabularies they are tools to standardise information for the purposes of capturing,storing, exchanging, searching, and analysing data. They contain the terms generallyused in literature of the respective subject domain. Using such well-defined terms isnecessary to achieve semantic interoperability among systems.

There are several well established generic ontologies and terminologies for the bio-medical domain, like the

• Systematised Nomenclature of Medicine Clinical Terms (SNOMED CT),

• Unified Medical Language System (UMLS),

• Generalized Architecture for Languages, Encyclopedias and Nomenclatures in medicine(OPEN GALEN),

as well as more specific ones, like e.g. the

• Foundational Model of Anatomy (FMA).

• NCI Thesaurus, or

• Logical Observation Identifiers Names and Codes (LOINC).

A more detailed overview of these resources can be found in [59].Unfortunately, many of these resources involve incompatible formats, are based

on different modelling languages, and lack appropriate tooling and programminginterfaces.

The Lexical Grid (LexGrid) model accommodates multiple vocabulary and ontologydistribution formats and support of multiple data stores for federated vocabularydistribution. The model provides a foundation for building consistent and standardisedAPIs to access multiple vocabularies that support features like lexical search queriesand hierarchy navigation. Support for reasoning with ontologies is planned for futureversions of LexGrid [68].

LexGrid is implemented as LexBig API for the caBIG community. It is being deve-loped to accommodate multiple vocabulary and ontology distribution formats (OpenBiomedical Ontologies (OBO), Web Ontology Language (OWL), Unified Medical Lan-guage System (UMLS) Rich Release Format (RRF)) and to support multiple data stores(e.g. relational database, LDAP, XML) for federated vocabulary distribution.[4]

4.3.1.2 Merging Ontologies

Domain ontologies of different domains often are incompatible since they represent theconcepts of their domain in a very specific way. Trying to expand a domain ontologyinto more general representations is challenging. If domain ontologies use the samefoundation ontology to provide a set of basic elements which are used to specify themeanings of the domain ontology, the elements can be merged automatically.

25

Page 35: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4 Enabling technologies

4.3.2 Metadata Registries & Common Data Elements

The ISO/IEC 11179 is a standard for the implementation of metadata registries anddefines how to use common data elements for describing concepts.

4.3.2.1 Common Data Elements

A Data Element (often called Common Data Element (CDE)) for the purpose of ISO/IEC11179 [53] is composed of two parts:

• A Data Element Concept (DEC) is a formal description of the thing about which adata value is recorded, in the example in figure 4.5, the name of an agent.

• A Representation consists of the properties which constitute a valid response forthe thing that is recorded. Primarily this is a Value Domain (VD), data type andunits of measure are other representations. Value domains can be enumerated orconstrained around permitted values.

Data Element Concepts themselves are composed of two sub-components:

• An Object Class is the entity that is being described by the data element.

• A Property is a specific attribute of the entity whose value is being recorded.

In the example in fig. 4.5, the entity being described is an Agent, and the property(the characteristic being used to distinguish instances of one Agent from another) is itsname.

The ISO 11179 on its own does not provide an unambiguous definition of a dataelement because it uses words or phrases to name and represent the meaning of mostof the components of a data element. For example, the exact meaning of “Agent Name”in figure 4.5 cannot be determined, neither for a computer nor for a human being. Asolution to this problem is to bind the words or phrases to concepts in a controlledterminology, e.g. the one supplied by NCI EVS in the case of caBIG in section 5.2.

4.3.2.2 Metadata Registries

Since metadata is also data, metadata may be stored in a database. A database of metadatathat supports the functionality of registration is called a metadata registry (MDR)[53].

A metadata registry typically has the following characteristics:

• It is a protected area where only approved individuals may make changes

• It stores data elements that include both semantics and representations

• The semantic areas of a metadata registry contain the meaning of a data elementwith precise definitions

26

Page 36: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4.3 Semantic Annotation

Figure 4.5: Simplified view of a Common Data Element (CDE) in the caDSR implementation(caBIG, see section 5.2.3.2) of the ISO 11179 metamodel. This hypothetical example is for aCDE that describes the name of an agent (i.e., a drug compound) that is constrained to anenumerated list of values provided by the Cancer Therapeutics Evaluation Program (CTEP)of the NCI. (from [56])

• The representational areas of a metadata registry define how the data is represen-ted in a specific format such as within a database or a structured file format suchas XML

The MDR stores the identifier of the CDE and more details such as definition, valuedomain, unit of measure, property and object class where the CDE belongs to. MDRsoften also provide tools for registering, updating and browsing CDEs, concepts andproperties.

Thus, registering CDEs in a metadata registry provides a metadata definition by aninformal explanation of the CDE’s meaning and usage, a list of alternative names anddefinitions, units of measurements and the types of values to be recorded. Fig. 4.6shows a registered CDE in the prototype ULICE MDR.

In contrast to the ontological approach, registered CDEs do not derive their meaningfrom their position in a taxonomy of graph. There is therefore no need to define theposition of a CDE within an existing domain ontology before recording the semanticsof a data definition. However, classifications or ontologies can be added to supportnavigation and inference.

27

Page 37: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4 Enabling technologies

eXist: 1.4rc, cgMDR: 1.1 beta, revision: 592

welcome: guest

metadata registry

Data Element: Allocated Treatment

Administered Item - Preferred Name: Allocated Treatment

Administered Item Identifier GB-OUCL-CB4CCCE12-1

Registration Status explain Standard

Definition The treatment allocated to the participant in a clinical study in the code system specified by that study. The data element should

be accompanied by metadata describing the codeset when communicated to third parties

Registered By Dr. Steve Harris (Researcher, IDH)

Oxford Comlab

Value Domain Attributes

Value domain record Allocated Treatment

Conceptual Domain Treatment

Datatype xs:string (XMLSchema)

Unit of Measure (not applicable)

Format not specified

Maximum Character Quantity not specified

Representation Class

Model Reference

Expresses data element concept (full

record)Allocated Treatment

Object class Participant urn:lsid:ncicb.nci.nih.gov:nci-thesaurus:C29867

Property default property

Precision

Example not-specified

Typed by Representation Class not-specified

Representation Class Qualifier unqualified

Related data elements Allocated Treatment supersededBy this element

Reference Documents

Reference Documents id language title description

515023787 engEBCTCG Fifth Cycle Data

Format

A PDF of the EBCTCG Fifth Cycle Data Format accessed on the

20/01/2010

Naming

Naming context language name preferred

ULICE GB-eng Allocated Treatment true

ULICE GB-eng Participant Treatment Allocation false

Administration

Administrative Status explain noPendingChanges

Administered By Steve Harris

Researcher, IDH

Created On 2010-01-21Z

Effective From 2010-01-21Z

Last Changed On 2010-01-21Z

Effective until 2020-01-01

Submitted By Steve Harris

Researcher, IDH

Explanatory Comments not-specified

Administrative Note not-specified

Change Description not-specified

Unresolved Issue not-specified

Origin not-specified

Copyright (C) 2010 The CancerGrid Consortium (http://www.cancergrid.org)

browse search discover understand login

as XML

Data Element: Allocated Treatment http://ptcri-ulice.oerc.ox.ac.uk:8080/exist/mdr/web/show-admin-item.xq...

1 of 1 22.01.2010 22:11

Figure 4.6: A registered Data Element in the ULICE MDR based on cgMDR from cancerGrid.

28

Page 38: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4.4 Semantic Queries

4.4 Semantic Queries

Semantic queries help users to manipulate data without knowing the details of theunderlying syntactic data structure. While syntactic query languages only supportretrieval of data based on explicit syntactic information, semantic queries enable theretrieval of both, explicit and implicitly derived information based on syntactic andsemantic information.

These “intelligent” queries require a data model and reasoning capabilities in orderto derive relations between data. Ontologies or merged RDF graphs provide the basisfor this reasoning and can be queried with SPARQL.

4.5 Grid technologies for data access

The Open Service Architecture (OGSA) defines a set of core capabilities and behavioursimportant in Grid systems. OGSA is based on Web service technologies, the Web ServicesDescription Language WSDL and Simple Object Access Protocol (SOAP). OGSA proposesthe use of Web services3 as the method for virtualising Grid resources.

A more recent effort, the Web Services Resource Framework (WSRF) standardisation,has paved the way for closer interoperability and unification between stateful Gridservices and Web services.

4.5.1 Open Grid Service Architecture Data Access and Integration(OGSA-DAI)

The aim of the OGSA-DAI project is to develop middleware to assist with accessand integration of data from separate sources via the Grid4. It can act as a databaseconnection layer between clients and supports workflows.

OGSA-DAI is based upon WS-DAI5 specifications. OGSA-DAI components are eitherdata access components or data integration components.

Workflows in OGSA-DAI can be used to integrate data from multiple data sources.Fig. 4.7a illustrates an example in which two queries are executed on two separate data-bases. The results of the first query are transformed in some way and the transformedresult joined in some way with the results from the second query.

OGSA-DAI Distributed Query Processing (OGSA-DAI DQP) is an extension to OGSA-DAI which supports queries to a federation of databases. The client in the example in

3A web service is a software system designed to support interoperable machine- or application-orientedinteraction over a network. A web service has an interface described in a machine-processableformat (specifically WSDL). Other systems interact with the Web service in a manner prescribed by itsdescription using SOAP messages.

4This OGSA-DAI Tutorial gives a good overview of the capabilities and applications of OGSA-DAI. Moreinformation on OSAI for database integration can be found in [27].

5Web Services Data Access and Integration is a specification defining core operations required to accessstructured data resources in a Grid environment using web services

29

Page 39: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

4 Enabling technologies

(a) Data integration using workflows and OGSA-DAI servers (from [6])

(b) OGSA-DAI DQP (from [5])

Figure 4.7: OGSA-DAI for data integration and distributed queries.

Fig. 4.7b submits a workflow which interacts with a “virtual database”. The virtualdatabase is in fact a federation of a number of actual physical databases exposed viaOGSA-DAI servers. OGSA-DAI DQP parses the clients’ queries, forwards these andjoins the result of the queries to the single databases into a single unified result.

OGSA-DAI-RDF has extended OGSA-DAI access to RDF database systems.

30

Page 40: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

A multitude of research projects engages in grid based platforms for data integra-tion. This chapter outlines some of these projects, focusing on the implementations ofsyntactic and semantic data integration.

ACGT (5.1) and caBIG (5.2) follow two distinct approaches to data integration andare explained in detail. The technological approach taken by caBIG is very similar to thecancerGrid (5.3) approach. All three projects and initiatives are aiming for integrateddata access and interoperability in the biomedical and cancer domain.

Section 5.4 gives a brief outline of examples of further data integration and semanticGrid projects from several domains.

@neurIST (5.4.1) aims for collaborative tools for patient management.As seen in the previous chapter, the tools for semantic annotation are generic and

only the ontologies and terminologies encode the specific domain knowledge. ADMIRE(5.4.2) uses these technologies for data integration and data mining in different domains.

OntoGrid (5.4.3) has developed generic extensions to Grid middleware towards thesemantic Grid paradigm.

5.1 ACGT

Advancing Clinico-Genomic Clinical Trials on Cancer (ACGT) is a EU FP6 project (2006-2010) aiming at developing open-source, semantic and grid-based technologies insupport of post genomic clinical trials in cancer research [1]. The main goal of ACGT isto provide seamless access to heterogeneous sources of information by developing asemantically rich grid infrastructure in support of multicentric, postgenomic clinicaltrials. Two core tools were developed for this purpose and were included in the ACGTarchitecture:

The ACTG Semantic Mediator (ACGT-SM, section 5.1.1) which addresses semanticheterogeneities (semantics and schema integration) and the Data Access Wrappers (ACGT-DAWs, section 5.1.2) which cope with syntactic heterogeneities.

The ACGT architecture follows a Local-As-View (LAV) approach based on querytranslation.

31

Page 41: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

(a) ACGT layered initial architecture (from [82]) (b) Architecture of the data access tools in ACGT [60].

Figure 5.1: Data Access Tools in ACGT (5.1b) and their position in the ACGT architecture (5.1a).

Fig. 5.1b shows the different components involved in the ACGT data access archi-tecture. The different types of Client Tools that need to access data, communicate withthe Semantic Mediation and Database Integration Layer using the RDF Data Query Lan-guage (RDQL) [8]. ACGT-SM, the main component of this layer, uses the ACGT MasterOntology on Cancer (ACGT-MOC) (section 5.1.3), a global model describing the cancerdomain, to give virtual views of the integrated databases. The wrappers ACGT-DAWsin the Data Access Layer provide transparent access to the different types of sourcesusing their natural query languages.

Fig. 5.1 visualises how of the Data Access Tools and Master Ontology (fig. 5.1b) areintegrated in the ACGT architecture (fig. 5.1a).

5.1.1 The ACGT semantic mediator

The purpose of the mediator in the ACGT environment is to provide users with a toolfor integrated access and retrieval of data from distributed and heterogeneous databasesystems. The mediator acts as a service for knowledge discovery tools, and as a clientfor database wrappers [82].

ACGT adopted an LaV-based approach to database integration. Since LaV requires (inthe worst case) exploration of all data sources, it normally presents a poor performancein the query translation. This issue was dealt with by constraining the mappings toonly the most frequent kinds of queries: a user requirements driven restriction of thedomain.

The global schema required for the LaV approach is a subset of the master ontologyACGT-MOC which was developed in the Web Ontology Language with DescriptionLogic Language (OWL-DL) in order to be prepared for the need to implement moreadvanced mappings in the future.

A RDQL query, launched by the clients tools through the ACGT-SM is automatically

32

Page 42: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.1 ACGT

Figure 5.2: Knowledge and Discovery tools, from [7]

translated into SPARQL (Simple Protocol and RDF Query Language)[9] and dividedinto different dedicated queries for the different wrappers.

The results of a query are returned by the ACGT-DAWs in the SPARQL Query ResultsXML Format [10]. The ACGT-SM integrates the different result sets from different datasource into a set of instances of classes belonging to the ACGT-MOC. The relationbetween master ontology and mediator is illustrated in figure 5.2.

Since the LaV approach is chosen, a new source (or a change in the schema ofan existing source) can be included in the system by creating or changing the viewdescribing this single source.

5.1.2 The ACGT Data Access Wrappers (ACGT-DAW)

The ACGT-DAWs are responsible for resolving syntactic heterogeneities. Thus, theyneed to provide a uniform data access interface, export the data model of data sourcesand audit access to data sources (especially important for ensuring the legal and ethicalrequirements for clinical data).

The data access services are implemented as OGSA-DAI services.D2RQMap [32], a declarative language to describe mappings between application-

specific relational database schemata and RDF-S/OWL ontologies, is used for handlingthe query transformations for querying relational databases.

33

Page 43: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

OGSA-DAI does not support querying DICOM-formatted data. Therefore queriesdirected to the DICOM image databases had to be expressed in SPARQL which seemswell-suited to express any valid DICOM query[82].

OGSA-WebDB is an extension to OGSA-DAI which enables existing web databaseresources as OGSA. Thus, grid clients can not only access data stored in relational andXML database management systems, but can also integrate and access data availableon Web-accessible databases.

5.1.3 The ACGT Master Ontology on Cancer (ACGT-MOC)

In the context of the selected LAV data integration strategy the Master Ontology playsthe role of a global schema, to which all local schemata are mapped, so that all theirmapped equivalents are subsumed by the global schema. This requires that the globalschema (i.e. the ontology) be sufficiently generic covering not only terminology, butalso the meaning of all local schema constructs.

To integrate a data source into the mediation architecture, a mapping of the local (e.g.DB schema) to global schema (i.e. the master ontology) is required. ACGT providesgraphical tools for this.

ACGT-MOC is a domain ontology, i.e. it represents the entire reality of the domaincovered by the ACGT project. It has been designed as a “heavy weight ontology”,meaning that it contains a very rich internal structure, not only a controlled vocabulary.

IFOMIS [3] had been active in developing ontologies for the cancer domain beforethe start of the ACGT project, resulting in a reference ontology for anatomy, physiologyand pathology, the Ontology of Biomedical Reality (OBR). Since ACGT requires not onlybiological entities (as in OBR) but also administrative and legal entities, Basic FormalOntology (BFO) [2] has been adopted as the top level for ACGT-MOC. Clinical reportforms (CRFs) were integrated into the system to account for clinical practice on theontology level.

The National Cancer Institute Thesaurus (NCIT) was considered as a terminologyresource but did not fulfil the requirements of ACGT [35].

ACGT-MOC tries to become a member of the Open Biomedical Ontologies (OBO)Foundry [80], a library of interoperable reference ontologies for the biomedical sector.All ontologies in the foundry are open source and adhere to the same quality standards.According to [59], ACGT-MOC will be among the most extended ontologies in thebiomedical domain.

5.2 caBIG™

Following the need for a software infrastructure for sharing data and providing analysistools, the National Cancer Institute (NCI) has initiated the Biomedical Informatics Grid(caBIG) program in 2004. The aim of caBIG™ is to create a network of cancer centresand research laboratories to better combine their strengths and experiences in cancer

34

Page 44: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.2 caBIG™

research. caBIG™ focuses on the creation of a virtual community that shares resourcesand tackles the key issues of cyber-infrastructure. It is developing standards, policies,guidelines and common applications, open-source tools and middleware infrastructureto enable more efficient data sharing among researchers.

The caBIG Compatibility Guidelines [36] lay out requirements for achieving differentdegrees of syntactic and semantic interoperability, labelled “Bronze”, “Silver” and“Gold”. The “Gold” level calls for a common framework (caGrid, section 5.2.1) forrepresentation, advertisement, discovery and invocation of distributed data and analyticresources across the caBIG™ federation. Following a review of the state of existingtechnology frameworks, tools and middleware systems (caGrid Whitepaper, 2004, [77]),Grid Services technology was chosen as the underlying framework for caGrid.

5.2.1 caGrid infrastructure

The caGrid framework1 [76, 65], leverages Grid Services technologies, their tools andNCI data modelling infrastructure: the Globus Toolkit (GT), OGSA Data Access andIntegration (OGSA-DAI) toolkit, which has evolved into the Web Services ResourceFramework (WSRF), Mobius and NCI caCORE (see section 5.2.3). GT is used as the coreGrid middleware system in caGrid. Web services solve the programming languageinteroperability problem by specifying language-independent access to distributedresources. WSRF solves additional interoperability issues by defining standardisedweb service interfaces. The Mobius infrastructure provides support for distributed dataand metadata management and is employed to support Grid-wide management ofXML schemas (GME) representing the structure of common data types in the caBIG™domain.

caGrid is designed as a service-oriented architecture in which resources are exposed tothe environment as Grid services with well-defined interfaces. The caGrid infrastruc-ture (figure 5.3) consists of coordination services (services for metadata management,advertisement, discovery, query and security) and community-provided services (data andanalytical services). All these services are wrapped in or exposed by special interfacesand made accessible to the users through web portals or client applications. Since caGridadopts a model-driven, service-oriented architecture approach, services in caGrid arerequired to describe itself using caGrid standard service metadata.

5.2.2 Interoperability

caBIG follows a four-layer approach to interoperability. One layer is concerned withthe syntactic component of interoperability, the remaining three layers are concernedwith semantic interoperability.

1latest release as of January 2010: caGrid 1.3

35

Page 45: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

Figure 5.3: caGrid infrastructure. (from [21])

5.2.2.1 Syntactic interoperability

The data access services2 provide an object-oriented view of data kept in native formats(mostly relational). The system has been implemented by extending OGSA-DAI with anew query activity that accepts semantic search requests in the caGrid Query Language(CQL), as shown in Figure 5.4. CQL is a XML-based object oriented query languagethat allows expressing queries with objects, related objects and attributes with desiredvalues [23]. DCQL provides an extension to CQL for distributed querying.

5.2.2.2 Semantic interoperability

Semantic interoperability is ensured by information models, semantic metadata (recor-ded in CDEs) and controlled vocabularies (from the EVS) and ontologies as illustratedin fig. 5.5:

“Gold level” compability systems, as described in [36], are based upon informationmodels, terminologies, ontologies and common data elements which are accepted andharmonised within the caBIG™ community.

The Cancer Common Ontologic Representation Environment (caCORE, section 5.2.3)provides the software and services needed to achieve this level of interoperability.caGrid uses caCORE tools, like the NCI Enterprise Vocabulary Service (EVS, section5.2.3.1), the Cancer Data Standards Repository (caDSR, section 5.2.3.2) and the MobiusGlobal Model Exchange (GME) service for ontology, metadata and schema management,respectively. Figure 5.6 illustrates how the components work together [76]:

For any data to be shared on the Grid, an UML class model is created. These domainmodels are converted into common data elements in the form of ISO/IEC 11179 (see

2The 0.5 release of caGrid release was based on OGSA standards, the more recent releases use WSRF.

36

Page 46: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.2 caBIG™

Figure 5.4: caGrid Data Access Services Architecture (from [23])

section 4.3.2) administered components and registered in caDSR. The data elementsare annotated by terms and concepts drawn from vocabulary registered in EVS. Theconcepts of data elements and the relationships among the data elements thus aresemantically described.

Communication between services and clients in the Grid environment is basedon XML messages. When an object is transferred between clients and services, itis serialised into a XML document that adheres to a registered XML schema. Therequirement for use of registered data models and XML schemas is to ensure syntacticand semantic interoperability between two end-points exchanging information. Witha published model and schema, the receiving end-point can parse the data structureand interpret the information correctly. XML schemas corresponding to common dataelements and object classes are registered in the GME service.

The caDSR and EVS define the properties and semantics of caBIG™ data types, andthe GME defines the syntax of their XML materialisation.

5.2.3 Cancer Common Ontologic Representation Environment (caCORE)

The Cancer Common Ontologic Representation Environment (caCORE) is a frame-work for creating syntactically and semantically interoperable biomedical informationservices.

Version 3 of caCORE3 consists of three major parts [40, 56], diagrammed in figure 5.7:

3As of February 2010, caCore is in version 4 and further sub-versions have been released. Please refer tothe release notes for details.

37

Page 47: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

Figure 5.5: Relationships of semantic model layers. The three layers of the semantic modelinclude: information model (domain model), semantic metadata (CDEs) and controlledterminologies (EVS). The figure illustrates two information systems using different classnames, but annotating these UML entities with the same concept. The resulting CDE will beshared between systems. (from [79])

38

Page 48: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.2 caBIG™

Figure 5.6: The use of caDSR, EVS and GME for creation and management of common datatypes and exchange of objects conforming to these types. (from [76])

Figure 5.7: The major components of caCORE version 3. The primary technology stack containsa model driven, object-oriented data system (caBIO in this example) and the metadata andcontrolled terminology services required to achieve semantic interoperability. Supporting thisstack is a set of enabling technologies that simplifies the process of creating a “caCORE-like”system and a supporting technology stack that includes a Common Security Module (CSM)that can be readily implemented through the caCORE SDK. (from [56])

39

Page 49: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

1. A primary technology stack with three components (in the centre of figure 5.7):At the top of the stack are the cancer Biomedical Informatics Objects (caBIO), theinteroperable data system and at the bottom of the stack is the Enterprise VocabularyServices (EVS), supplying the controlled terminology that is leveraged to providesemantics for caBIO. Between these two components is the cancer Data StandardsRepository (caDSR), a system for storing semantic metadata which sticks togetherthe object-oriented data system and the controlled terminology.

2. Two major enabling technology components:The caCORE Software Development Kit (caCORE SDK) that is used to generate“caCORE-like” systems, and the Semantic Integration Workbench, an end-user appli-cation with a graphical user interface (GUI) that assists in creating the semanticmetadata that is stored in the caDSR4.

3. And one supporting technology:The Common Security Module (CSM), which is designed to be readily integratedinto systems designed along caCORE lines. The CSM contains a user provisioningtool for managing rights given to users within the system.5

Semantically annotated systems like caCORE 3, built using the caCORE SDK, aretypically compatible at the “Silver” level. When additional harmonisation requirementshave been met after connection to caGrid, these systems would achieve full “Gold”interoperability within the limits of Grid technology [56].

5.2.3.1 Enterprise Vocabulary Services (EVS)

The Enterprise Vocabulary Service (EVS) is the semantic basis of caCORE. It producesthe NCI Thesaurus and the NCI Methathesaurus which cover the same domain (basicand clinical research areas, publishing and administrative functions) using differentdatamodels. Both are available for unrestricted use by any organisation.[40, 56]

NCI Metathesaurus The NCI Metathesaurus (NCIm) is based on the National Libraryof Medicine (NLM) Unified Medical Language System (UMLS). While UMLS is focusedon medical terminologies and categorisations, the cancer research community requiresbroader terminologies, including biology and a wide range of translational and appliedresearch fields. The NCI Metathesaurus was created by dropping sources with lowrelevance from the UMLS Metathesaurus and adding sources of high relevance. It nowcontains 1400000 concepts mapped to 3600000 terms with 17000000 relationships [64].The NCI Metathesaurus is available for interactive use through a web-based interfaceand for programmatic access via the caBIO application interfaces (section 5.2.3.3) [40].

4Both components are not further described here. Some information on caCORE SDK can be found in[56], more details in [70].

5The description of security aspects and user rights management will be part of the second deliverable.[56] gives a rough outline of CSM in the caCORE context.

40

Page 50: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.2 caBIG™

NCI Thesaurus The NCI Thesaurus (NCIt) is the main vocabulary product publishedby the NCI, containing definitions, synonyms and other information on nearly 10000cancers and related diseases, 8000 single agents and combination therapies as well asother topics [64]. It was designed to bring together classification schemes, informalvocabulary and naming conventions used in the cancer research community. The NCIThesaurus is available for browsing and for download as XML.

NCI BioPortal The NCI BioPortal provides access to NCIt and NCIm as well as otherbiomedical terminologies hosted at the NCI.

5.2.3.2 Cancer Data Standards Repository (caDSR)

The Cancer Data Standards Registry and Repository (caDSR) is a database6 implemen-ted according to the ISO/IEC 11179 Standard for Metadata Registries (see 4.3.2) and aset of APIs and tools to create, edit, control, deploy, and find common data elements(CDEs). caDSR provides a number of tools for managing and deploying CDEs andfor reviewing forms and checking their compliance with the CDEs approved withincaDSR.

Standard vocabularies and ontologies are essential for creating CDEs: CDE names,definitions and Permissible Values are derived from terminology found in EVS . A fullyspecified CDE forms a unit of metadata which is adequate for the use in interoperablesystems [40].

5.2.3.3 Cancer Bioinformatics Infrastructure Objects (caBIO)

Cancer Bioinformatics Infrastructure Objects (caBIO) [20] is a resource for accessingbiomedical annotations from curated data sources (in genomics and proteomics domainfor human and mouse genome) in an integrated view.

Information in caBIO is modelled close to the corresponding biological entities. Dataaccess is provided through various interfaces, including remote Java API, web services(SOAP and REST API), grid data services available on caGrid and graphical userinterfaces like the caBIO Portlet.

5.2.4 Further caBIG-related projects and technology

5.2.4.1 Cancer Translational Research Informatics Platform (caTRIP)

The Cancer Translational Research Informatics Platform (caTRIP) aims to solve the trans-lational research problem of outcomes analysis. When a patient enters the clinic, theoncologist should be able to look across a cohort of patients with similar characteristicsto help inform treatment.

caTRIP uses caGrid to perform distributed queries across several caBIG applications:

6Links to the caDSR Technology Stack and NCI Technology Stack.

41

Page 51: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

Figure 5.8: Metadata and interfaces in the caTRIP architecture. (from [61])

• The cancer Text Information Extraction System (caTIES), a locator of tissue resourcesthat works via the extraction of clinical information from free text surgical patho-logy reports while using controlled terminologies.

• caTissue CORE, a tissue bank repository tool for biospecimen inventory, tracking,and basic annotation.

• Cancer Annotation Engine CAE, a system for storing and searching pathologyannotations.

• caIntegrator, a tool for storing, querying, and analysing translational data.

caTRIP uses a mediator-based, federated query engine and an extension to the caGridquery language called Distributed CQL (DCQL) to present a single interface where theseservices can be discovered and subsequently queried in a metadata-driven manner. Fig.5.8 illustrates the architecture of caTRIP. Queries are submitted via a graphical userinterface through a distributed query engine to the underlying databases. The caCOREservices are employed for semantic metadata discovery.

5.2.5 Semantic queries

The main goal of caGrid was to achieve semantic interoperability. As more datasets become available on caGrid, effective ways for accessing and integrating thisinformation are needed. Although caGrid is based on a rich metadata infrastructure itdoes not support semantic queries. Phillips [69] has evaluated the use semantic web

42

Page 52: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.3 CancerGrid

technologies in caBIG and argues that OWL would be more suitable for expressingcaBIG data models than UML. However, running systems should not be affected by thetransition to semantic web technologies.

Projects like Corvus [62], a semantic web-based data warehouse for creating relation-ships among caGrid models, or semCDI [79] (Semantic caBIG Data Integration), try toovercome caBIG’s limitations regarding semantic queries. Both approaches transformthe annotated caGrid UML model into OWL and use the resulting representation todraw inferences about the relation between data.

Gonzalez Beltran et al. [45] demonstrated the automatic generation of OWL ontolo-gies from semantic annotation of resources (especially caGrid) and the rewriting andtranslation from concept based queries to the caGrid query language for the ONcologyInformation eXchange (ONIX) platform of the UK National Cancer Research Institute.Their approach is general, in the sense that non-caGrid data resources can be suppor-ted, as long as they provide appropriate metadata, i.e. annotated UML models. Also,query languages other than CQL can be supported by specifying the translation rulesaccordingly.

5.3 CancerGrid

The CancerGrid consortium pursues the goal of integrating clinical and research infor-matics infrastructures taking advantage of developments in Semantic Web technology,Grids, Service-Oriented Architectures and Computer Supported Collaborative Working.It specifically addresses the need of pooling resources for clinical trials in order toachieve statistically significant results.

A clinical trial protocol together with specific standard operating procedure andcase report forms definitions provides a complete specification for a clinical trialsinformation management system: it is a model of the clinical trial itself and couldtherefore be used to generate elements of the information system needed to support thetrial.Where a number of individual models have common features, one may model thesefeatures to produce a model of models, a metamodel, which can be used to author andvalidate models. Similarly, a relational database schema is a model of the informationstructure in a relational database and may be used to support data representations,indexing, query evaluation The cancerGrid approach allows collaborators to maintaintheir own collections of metadata elements, value sets, and experimental designs, seefigure 5.9.[34, 33]

The following section illustrates the working principle by the METABRIC implemen-tation.

5.3.1 METABRIC

The METABRIC [67] study (Molecular Taxonomy of BReast cancer International Consor-tium) aims at associating experimental results with clinical datasets of patients to

43

Page 53: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

Figure 5.9: cancerGrid overview. Laboratories/centres can maintain their local MDRs (cgMDR).The data from multiple local notes can then be collated in a central repository (e.g. caDSR).[54]

Figure 5.10: cancerGrid: The tool for standardisation, query, inferring and validation SQIV. [38]

44

Page 54: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.4 Further projects

understand the heterogeneity of the cancer disease.METABRIC is based on the CancerGrid approach which gives more flexibility for

researchers to construct a scheme which reflects their local purposes rather then a globalclassification as in caBIG). It uses the cancerGrid metadata registry cgMDR to recordthe semantics of data: separate CDEs for each field definition from each data source andalso a number of CDEs which define the minimum dataset required by METABRIC.

An UML model of the relational databases is used to generate XML schemas foreach of the component databases. These schemas are annotated with the appropriatesource-CDE identifiers by means of SAWSDL [57] references. In order to compareand query the data, it is transformed to some agreed dataset definition by using thefunctions for Data Standardisation , Inference and Querying available in SQIV (fig.5.10).

Standardisation transforms formatted XML data according to an XML schema an-notated with CDE identifiers into equivalent RDF. RDF can then be queried usingSPARQL. Once standardised, the inference tool of SQIV allows to make any requiredtransformation.

Source-CDE annotated data is mapped to agreed METABRIC CDEs using the JenaSemantic Web Framework, a Java API. The result is another RDF file with METABRICCDE annotations which is converted into XML and stored in an eXist database.

This database supports standard XML querying tools like XQuery and XPath. Inorder to query across differently defined data fields SQIV is used, figure 5.11.

5.4 Further projects

This section outlines some examples of further data integration and semantic Gridprojects.

@neurIST (5.4.1) aims for collaborative tools for patient management. As seen inchapter 4, the tools for semantic annotation are generic and only the ontologies andterminologies encode the specific domain knowledge. ADMIRE (5.4.2) uses thesetechnologies for data integration and data mining in different domains. OntoGrid(5.4.3) has developed generic extensions to Grid middleware towards the semantic Gridparadigm.

5.4.1 @neurIST

@neurIST is a FP6 project (2006-2010) for supporting the research and treatment ofcerebral aneurysms. The project aims at building a distributed IT infrastructure thatconsolidates complex data from multiple sources, and enables personalised patientmanagement (i.e. data capture, referral, decision support, treatment planning), aswell as clinical research in cerebral aneurysms. The consortium brings together 30multi-sectorial partners representing hospitals, universities, research institutes and theindustry across Europe.[72]

45

Page 55: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

Figure 5.11: METABRIC: Process of query translation and execution. ([67], online supplement)

46

Page 56: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.4 Further projects

Figure 5.12: @neurIST Reference Architecture - layered view (from [72])

5.4.1.1 General architecture

The @neurIST architecture (Fig 5.12) is layered into resources middleware and appli-cation layer. Herein, @neuInfo offers a generic framework supporting the provisionand support of data services which virtualise heterogeneous scientific databases andinformation sources as Web services based on OGSA-DAI. This enables transparentaccess to and integration of relational databases, XML databases and flat files.

5.4.1.2 Data integration

A virtual data integration approach based on data mediation techniques has beenadopted and implemented on top of standard Grid and Web Services. [55]

• The distributed data mediation service is based on OGSA-DAI for exposing and que-rying a virtual data source, and on a data mediation and distributed query engine(Fig. 5.13). During the deployment process a mediation schema, defining theglobal schema of the virtual data source and the integration of local data sources,has to be provided and the distributed query processor has to be configured withevaluation services that should be used for the query execution.

• The data mediation engine is in charge of translating a query against the virtualdata source into an executable query against the local data sources. The datamediation engine follows the Global-as-View (GAV) approach, where the globalschema is described in terms of the local schemes.

• Distributed Query Processing Engine uses OGSA-DQP for executing distributedqueries against local data sources.

47

Page 57: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

Figure 5.13: Architecture of a Distributed Data Mediation Service in @neurIST. (from [55])

48

Page 58: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.4 Further projects

Figure 5.14: A high level view of the ADMIRE infrastructure. (from [52])

5.4.2 ADMIRE

ADMIRE (Advanced Data Mining and Integration Research for Europe) is a FP7 project(2008-2011, fact sheet) which aims to deliver a consistent and easy-to-use technologyfor extracting information and knowledge. The project is motivated by the difficulty ofextracting meaningful information by data mining of data from multiple heterogeneousand distributed resources and will provide an abstract view of data mining and integration(DIM).

A high-level overview of the ADMIRE architecture is shown in Figure 5.14.DMI gateways are connected via the Internet and Grid. The gateways communicate

with one another using standard internet communication technologies such as WSRF-compliant SOAP messages. Each gateway provides a core set of DMI services foraccessing data sources and custom services. A gateway hides internal complexityand manages acceptance of requests to conform with resource limitations, load levels,security enforcement, etc. A gateway also has its own registry which describes all ofthe resources, services and components it is able to work with.

Interaction with DMI systems can happen via workbenches and portals.DMI workbenches support a coherent set of tools designed to support a particular

category of DMI-process developers.DMI portals permit application-domain experts to conveniently re-use DMI processes

that have been developed and packaged at the above workbenches.ADMIRE envisages a future where many communities are developing processes

and using DMI services. To share definitions, the architecture uses a registry for eachcommunity and a common repository to store the established and shareable DMIworkflow designs for re-use by community members.

49

Page 59: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

Figure 5.15: S-OGSA connects Grid computing and Knowledge management in OntoGrid (from[43])

ADMIRE is based on a service-oriented architecture and develops its own language(DMIL) for administration of requests and requesting information about services, dataresources, data collections and defined components. Workflows and data access areimplemented using OGSA-DAI.[52, 28]

5.4.3 OntoGRID

OntoGrid is a FP6 programme (2004-2007) for supporting the vision of the semanticGrid.

OntoGrid aims for developing grid systems that optimise cross-process, cross-companyand cross-industry collaboration. A principle of OntoGrid is to adopt and influencestandards in Semantic Grid and Grid Computing, in particular the Open Grid ServiceArchitecture. [43]

5.4.3.1 Architecture

OntoGrid has developed S-OGSA7, a reference architecture for the semantic Grid, tofoster the use of knowledge and metadata both in and on the Grid, Fig. 5.15.

The S-OGSA architecture is supported by a set of middleware services related tothe storage, creation, use and management of semantic bindings and ontologies in adistributed setting, among others (full list in [43]):

Semantic Binding Service Suite for supporting the management of semantic bindings.WS-DAIOnt-RDF(S), a specification for the provision of an homogeneous access

mechanism for using heterogeneous and distributed RDF(S) ontologies in Grid applica-tions.

S-OGSA-DAI, a semantic data access service.

7See [39] for further reference.

50

Page 60: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5.4 Further projects

Figure 5.16: S-OGSA-DAI: A Semantic extension of the OGSA-DAI architecture. (from [23])

5.4.3.2 Ontology driven data access with S-OGSA-DAI

Semantic-OGSA-DAI (Fig. 5.16) is an extension of OGSA-DAI for ontology-driven ac-cess to relational data. It supports distributed query processing, semantic integrationof distributed databases, and dynamic discovery of data sources depending on theircontents and capabilities. In S-OGSA-DAI, metadata describes the relationships bet-ween relational data sources and RDFS ontologies, using the D2R mapping language.Metadata is stored within S-OGSA-DAI and can be retrieved if needed. S-OGSA-DAIdoes not impose a specific knowledge model to describe data sources, and it assumesthat the data sources and the ontologies have been developed separately. The func-tionality is developed as an add-on to the existing OGSA-DAI functionality. For theimplementation of this new activity, RDQL queries are translated to SQL.

51

Page 61: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

5 Examples of implementations

52

Page 62: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

6 Conclusions for the PARTNER platform

This report gave an overview of the challenges involved in medical data and informationsharing. It described fundamental approaches and current technologies to overcomethese challenges and illustrated their implementation on some selected projects.

The following chapter discusses the requirements for the Hadron Therapy InformationSharing Platform (HISP) to be built within PARTNER. The information presented inthis report is summarised in recommendations for the methodological and technicalapproach to information integration with HISP.

6.1 Requirements for HISP infrastructure

HISP will be a non-centralised system, with different nodes joining and dissociatingduring run-time. Extendability in the long term implies that different services can joinup to the infrastructure using standard protocols and can be discovered by users:

• HISP requires mechanisms for non-centralised data access, service discovery, useradministration and security enforcement, see section 2.3.1.

HISP will be employed in a multidisciplinary and multilingual environment and willbe used by differently skilled and trained professionals. This degree of heterogeneityimplies the use of different terms and concepts to code information due to differentdomain views and languages. The presentation of data has to reflect the view of theindividual user to ensure its correct interpretation.

• HISP requires mechanisms to record the meaning of data in the domain context,to integrate data from different contexts and to generate views of this data whichreflect the professional background of different user groups or individual users,see section 2.3.2.

HISP should allow users and programmers to make easy use of its integrated datawhile ensuring the data’s confidentiality.

• HISP requires standardised interfaces for programmatic access and user interfaceswith support for semantic queries, see section 2.3.2.

53

Page 63: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

6 Conclusions for the PARTNER platform

• The entire platform must be embedded in a strong security framework, see section2.3.3.

6.2 Conclusions

Chapters 3 and 4 outlined approaches and existing technologies for data integration.A federated approach to data integration is required to guarantee the autonomy ofthe data sources. Syntactic heterogeneities can be overcome by a mediator-wrapperbased approach. In these systems, the data sources are exposed via “wrappers” whichtranslate queries into the native language of the data sources. Mediators provide amapping from a system’s view on the data (e.g. a global schema, a global ontology) tothe schema used by the source to be queried and serve as distributed query processors.

Semantic annotation is required in order to integrate data from different sourcesand to provide users with a meaningful view of the data.

A query processor evaluates the (semantic) user query in terms of the global datamodel and distributes them to the data sources.

Data access wrapper, use the mappings between source schemata and global datamodels to translate the queries from the global query language (on an ontology or datamodel) to the native languages of the sources.

Figure 6.1 illustrates how these three instances work together.The following sections discuss the advantages and disadvantages of the approaches

chosen by CancerGrid, caBIG and ACGT .

6.2.1 Semantic annotation

ACGT and caBIG respectively follow two different approaches to annotate data, seetable 6.1 for comparison..

ACGT follows a top-down approach. It tries to model the domain of interest in acomprehensive ontology and maps data to the concepts in this ontology (illustrated inthe ontology-column in fig. 6.1). This method was chosen to provide best reasoningcapabilities, a prerequisite for semantic queries.

In caBIG and cancerGrid, data is mapped to Common Data Elements which areregistered in a common ISO 11179 compliant database (illustrated in the MDR-column infig. 6.1). The registration process describes data items by linking them to a terminologyresource. This approach is more flexible and easier to maintain but limits the reasoningcapabilities of the system.

While the caBIG implementation, caDSR, is intended to serve the entire researchcommunity by providing a centralised repository with many features, the cancerGridimplementation, cgMDR, is a distributed solution aimed at workgroups. In this ap-proach, each of the sources can be described by a local MDR which can be maintainedby the data owners themselves. The mapping between the local CDEs and the agreedglobal CDEs can be inferred from rules defined for each of the sources.

54

Page 64: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

6.2 Conclusions

Figure 6.1: Options for semantic integration: semantic annotation, query processing and dataaccess.

55

Page 65: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

6 Conclusions for the PARTNER platform

The domain of hadrontherapy is still evolving. New concepts and relations betweenconcepts may be discarded or reclassified as research continues. Some aspects ofradiation therapy in general and hadron therapy in particular seem not to be modelledsince previous projects were centred around trials for chemical drug development andgenetic data. The lack of a consistent domain model favours the cancerGrid/caBIGapproach to interoperability since it allows registering metadata items even withoutthe need for an underlying domain model. These data elements can be linked to anontology at any later point.

Certain features of data integration systems, like machine reasoning on the data forsemantic queries or future efforts towards data mining require the information to bestrongly linked. Some services using HISP may therefore need their own ontology ontop of the integrated data to support strong reasoning. The rare tumour data base, forexample, will probably be based on its own data model.

In order to build a semantic framework from the MDR based approach, the followingconstructs are needed [41]:

1. a terminology service which provides access to defined terms

2. meta data registries to build collections of structured “metadata elements” whichmay be related to one or more terms in the underlying terminology.

3. model repositories to store re-usable models for e.g. database schemas, servicedescriptions, forms, queries, mappings, ...

6.2.1.1 Terminology server

Section 4.3.1 introduced some of the publicly available terminologies and ontologiesof interest for HISP. In particular, EVS (section 5.2.3.1) and the LexBIG project seem toprovide comprehensive vocabulary for the biomedical domain and are accessible indifferent formats and via APIs. A “metadata connector” was developed by cancerGridto link CDEs in metadata registries (cgMDR, caDSR) to terminologies in LexBIG andEVS.

Some aspects of particle therapy are not sufficiently covered by existing terminologies,as e.g. accelerator related vocabulary. These fields have to be modelled and integratedinto the existing terminologies.

6.2.1.2 Meta data registry

ISO 11179 lays out requirements for meta data registries. Compliant implementationsexist, e.g. caDSR (caBIG, 5.2.3.2) or cgMDR (cancerGrid, 5.3), are functional and providetools for the provision and maintenance of data elements.

For implementation, a database has to be set up as a grid service.The cancerGrid metadata registry cgMDR is based upon the open source XML

database eXist and implements its logic in W3C standards. Access to the database isthrough REST with optional web services and would have to be ported to Grids.

56

Page 66: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

6.2 Conclusions

sourcerepresenta-

tionmapping tool

globalrepresenta-

tion

ACGTdatabaseschema

mapped torelevant subset

of MO (a “virtualschema”)

D2RQMaster

Ontology(OWL-DL)

caBIG UMLmapped to CDEs

in caDSR?

caDSR +EVS

cancerGrid UML

mapped to CDEsin local MDR,

relation betweenlocal MDR and

global MDRinferred from

rules

SQIV,Jena Fra-mework

any MDR +EVS

Table 6.1: Comparison between frameworks for semantic annotation.

The caBIG metadata registry uses Introduce [50] for developing strongly typed GridSWRF compliant services on the Globus Toolkit.

Other metadata registries may provide similar capabilities and may be favourablein the grid environment to be chosen. We need to explore if AMGA would be a viablealternative in the gLite environment.

6.2.1.3 Model repository

The support for strongly typed services requires the availability of data type repositoryservices, which maintain the definition and structure of Grid supplied data types.Having Grid-wide accessible data type repository services can promote the usageof common data types across the Grid. Model repositories like the GME in of theMobius framework serve as data type repositories and provide tools and services forcoordinated management of XML schemas. A XML schema describes the structure of asimple or complex data element (data object) that is consumed or produced by a serviceand is exchanged between two endpoints in the environment.

6.2.2 “Mediation”-query processor

Since data in ACGT is annotated with a comprehensive ontology, semantic queriescan be asked against the mediator in RDQL, a RDF query language. These queries aretranslated into SPARQL queries to the virtual schemata of the individual sources.

caBIG is based on an object data model which can be queried by the caBIG querylanguages CQL, DCQL. In order to run semantic queries over the caBIG data, the UML

57

Page 67: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

6 Conclusions for the PARTNER platform

User Query Query Translation Query ResultACGT RDQL (semantic) RDQL -> SPARQL on

virtual schemata, dividedinto queries to different

wrappers

SPARQL Query ResultsXML, Mediator joinsresults into a set ofinstances of classes

belonging to theACGT-MOC

caBIG CQL, DCQL(object)

semCDI, Corvus(semantic)

caDSR associationsrepresented as OWL

ontology, querying withSPARQL, translation into

source specific CQLqueries

cancerGrid mapped to CDEs in localMDR, relation betweenlocal MDR and global

MDR inferred from rules

Table 6.2: Comparison between Query Processors.

model has to be transformed into an ontology1. Projects like Corvus [62], semCDI [79]or [45] apply this technique using different UML-OWL mapping approaches. Theseapproaches will be discussed in more detail in a future report.

Table 6.2 compares the query strategies of ACGT, caBIG and cancerGrid.

6.2.3 Data access

GRID technology seems best suited to address the requirements for a distributedinfrastructure. Existing databases can be wrapped and exposed as grid services byOGSI or WSRF compliant implementations. This approach is chosen by most of thedata integration projects, table 6.3as well as the approaches outlined in section 5.4.

OGSA-DAI can wrap several database formats and expose them as services. Itsupports workflows and provides an engine for querying distributed sources. OGSA-WebDB provides access to web databases. ACGT used SPARQL for querying DICOMvia OGSA-DAI. Semantic-OGSA-DAI supports ontology driven access to distributedrelational database sources.

1A good description on how semantic web technologies can be used to enable caBIG for data integration,can be find in [69].

58

Page 68: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

6.3 Next steps

Query sourcedatabases

wrapper forsources

ACGT SPARQL relational, webdatabases,DICOM

OGSA-DAI,OGSA-WebDB

caBIG CQL, DCQL,semCDI

relational, XML OGSA-DAI,WSRF Hibernate

cancerGrid SQIV relational, XML

Table 6.3: Comparison between data access services.

6.3 Next steps

Semantic data integration is one of the core functionalities of HISP. We have to decideon a practical implementation which allows semantic annotation of data and dataquerying. The components for this implementation have to be selected.

As far as semantic annotation is concerned, the open source solutions of caBIG orcancerGrid, as a light-weight alternative, should fulfil most of our requirements andcould be adapted for the Hadron Therapy community. Since caBIG is based on GlobusToolkit as a Grid middleware, some compability issues may arise when installing it ona gLite based platform. This would have to be explored further if gLite will be chosenas the underlying middleware.

Attempts have been made to support semantic queries across various caBIG annota-ted data sources [45, 62, 79]. Since HISP has to provide an easy-to-use access portal tothe integrated data, this topic will require further attention by WP 23.

The outcome of the second deliverable, ethical and legal implications of medicaldata sharing, will allow us to refine the requirements for data access from sources bythe platform, metadata storage and handling, and user access to the platform and theunderlying data. In this context WP23 will also investigate how queries on metadatamay reveal information about the underlying data. This will have implications on thehandling of metadata describing sensitive patient information.

59

Page 69: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

6 Conclusions for the PARTNER platform

60

Page 70: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

List of Tables

6.1 Comparison between frameworks for semantic annotation. . . . . . . . . 576.2 Comparison between Query Processors. . . . . . . . . . . . . . . . . . . . 586.3 Comparison between data access services. . . . . . . . . . . . . . . . . . . 59

61

Page 71: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

List of Tables

62

Page 72: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

List of Figures

2.1 Conceptual layout of PARTNER Hadron Therapy Information SharingPlatform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Characteristics of database systems. . . . . . . . . . . . . . . . . . . . . . 103.2 Architectural data integration approaches. . . . . . . . . . . . . . . . . . 123.3 Integration systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 FDBS and mediator-wrapper approach. . . . . . . . . . . . . . . . . . . . 143.5 Mediator system example: Garlic architecture. . . . . . . . . . . . . . . . 153.6 Single, multi, hybrid ontology approaches. . . . . . . . . . . . . . . . . . 163.7 OBSERVER architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Wisdom hierarchy and Semantic Grid . . . . . . . . . . . . . . . . . . . . 204.2 Stakeholders for Semantic Grid. . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Semantic web stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Extract from ACGT Master Ontology. . . . . . . . . . . . . . . . . . . . . 244.5 Common Data Element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.6 Registered data element in ULICE MDR. . . . . . . . . . . . . . . . . . . 284.7 OGSA-DAI for data integration and distributed queries. . . . . . . . . . 30

5.1 Architecture and data access tools in ACGT. . . . . . . . . . . . . . . . . . 325.2 Knowledge and Discovery tools. . . . . . . . . . . . . . . . . . . . . . . . 335.3 caGrid infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.4 caGrid Data Access Services Architecture. . . . . . . . . . . . . . . . . . . 375.5 Semantic model layers in caCORE. . . . . . . . . . . . . . . . . . . . . . . 385.6 caCORE tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.7 Major components of caCORE 3. . . . . . . . . . . . . . . . . . . . . . . . 395.8 caTRIP architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.9 cancerGrid overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.10 cancerGrid: SQIV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.11 METABRIC: query translation and execution. . . . . . . . . . . . . . . . 465.12 @neurIST Reference Architecture. . . . . . . . . . . . . . . . . . . . . . . . 475.13 Distributed Data Mediation Service in @neurIST. . . . . . . . . . . . . . . 48

63

Page 73: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

List of Figures

5.14 ADMIRE infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.15 S-OGSA in OntoGrid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.16 S-OGSA-DAI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1 Options for semantic data integration. . . . . . . . . . . . . . . . . . . . . 55

64

Page 74: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

[1] Advancing clinico genomic trials on cancer (acgt); website. URLhttp://www.eu-acgt.org/.

[2] Bfo - basic formal ontology. website (last accessed December 29, 2009). URLhttp://www.ifomis.org/bfo.

[3] Ifomis - institute for formal ontology and medical information science, saarlanduniversity. website (last accessed December 29, 2009). URLhttp://www.ifomis.org/.

[4] Lexbig. online (last accessed January 7, 2010). URLhttps://cabig-kc.nci.nih.gov/Vocab/KC/index.php/LexBIG.

[5] Ogsa-dai. online (last accessed January 14, 2010), . URLhttp://www.ogsadai.org.uk/.

[6] Ogsa-dai 3.2.2 user guides. online (last accessed January 13, 2010), . URLhttp://sourceforge.net/apps/trac/ogsa-dai/wiki/UserDocumentation/ogsadai3.2.2.

[7] Acgt - toward a european e-infrastructure for clinico genomic research on cancer.online poster (last accessed December 29, 2009). URLhttp://eu-acgt.org/fileadmin/dissemination_materials/ACGT_poster_obtima_light.pdf.

[8] Rdql - a query language for rdf. online (last accessed December 29, 2009). URLhttp://www.w3.org/Submission/RDQL/.

[9] W3c sparql query language for rdf. online (last accessed December 29, 2009), .URL http://www.w3.org/TR/rdf-sparql-query/.

[10] Sparql query results xml format. online (last accessed December 29, 2009), . URLhttp://www.w3.org/TR/rdf-sparql-XMLres/.

[11] Semantic grid vision on www.semanticgrid.org. online (last accessed January 13,2010). URL http://www.semanticgrid.org/vision.html.

65

Page 75: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

[12] Owl web ontology language reference. online (last accessed January 13, 2010), .URL http://www.w3.org/TR/2004/REC-owl-features-20040210/.

[13] Version 2 owl web ontology language document overview. online (last accessedJanuary 13, 2010), . URLhttp://www.w3.org/TR/2009/REC-owl2-overview-20091027/.

[14] W3c resource description framework (rdf): Concepts and abstract syntax. online(last accessed January 13, 2010), . URLhttp://www.w3.org/TR/2004/REC-rdf-concepts-20040210/.

[15] W3c skos simple knowledge organization system reference. online (last accessedJanuary 13, 2010), . URLhttp://www.w3.org/TR/2009/REC-skos-reference-20090818/.

[16] W3c semantic web activity. online (last accessed January 13, 2010), . URLhttp://www.w3.org/2001/sw/.

[17] W3c semantic web linked data, vocabularies, query, inference, verticalapplications. online (last accessed January 13, 2010), . URLhttp://www.w3.org/standards/semanticweb/. and links to Linked Data,Vocabularies, Query, Inference, Vertical Applications.

[18] W3c xml technology. online (last accessed January 13, 2010), . URLhttp://www.w3.org/standards/xml/. and links.

[19] W3c: Namespaces in xml 1.0 (third edition). online (last accessed January 13,2010), . URL http://www.w3.org/TR/2009/REC-xml-names-20091208/.and links.

[20] Cancer bioinformatics infrastructure objects (cabio). online (last accessed January11, 2010). URL https://cabig.nci.nih.gov/tools/cabio.

[21] The cagrid knowledge center: cagrid 1.3 technical overview. online (last accessedJanuary 12, 2010). URLhttp://wiki.cagrid.org/display/knowledgebase/caGrid+1.3+Technical+Overview#caGrid1.3TechnicalOverview-Toc206325222.

[22] Daniel Abler, Vassiliki Kanellopoulos, and Faustin Laurentiu Roman. Futureinformation sharing in hadron therapy. Poster at ’Physics for Health in Europe’workshop, 2-4 Feb. 2010, CERN, Geneva, February 2010. URLhttps://espace.cern.ch/partnersite/workspace/abler/Shared%20Documents/ConferencePoster/PARTNER_Grid_PosterAndAbstract_PhysicsForHealthCERN2010.pdf.

[23] Pinar Alper, Carole Goble, and Oscar Corcho. Understanding semantic aware gridmiddleware for e-science. Computing and Informatics, 27:93–118, 2008.

66

Page 76: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

[24] Gerhard Amaldi, Ugo; Kraft. Particle accelerators take up the fight against cancer,December 2006. URLhttp://cerncourier.com/cws/article/cern/29777.

[25] December 9) American Cancer Society (2008. Cancer projected to become leadingcause of death worldwide in 2010. online article, December 2008. URL http://www.sciencedaily.com/releases/2008/12/081209111516.htm.

[26] I. Andoulsi, I. Blanquer, V. Breton, A. Dobrev, C. Van Doosselaere, V. Hernandez,J. Herveg, N. Jacq, Y. Legré, M. Olive, H. Rahmouni, T. Solomonides,K. Stroetmann, V. Stroetmann, and P. Wilson. Share the journey - a europeanhealthgrid roadmap, October 2008.

[27] Mario Antonioletti, Malcolm P. Atkinson, Robert M. Baxter, Andrew Borley, NeilP. Chue Hong, Brian Collins, Neil Hardman, Alastair C. Hume, Alan Knox,Mike Jackson 0003, Amrey Krause, Simon Laws, James Magowan, Norman W.Paton, Dave Pearson, Tom Sugden, Paul Watson, and Martin Westhead. Thedesign and implementation of grid database services in ogsa-dai. Concurrency -Practice and Experience, 17(2-4):357–376, 2005. doi: 10.1002/cpe.939.

[28] Malcolm P. Atkinson, Jano I. van Hemert, Liangxiu Han, Ally Hume, andChee Sun Liew. A distributed architecture for data mining and integration. InDADC ’09: Proceedings of the second international workshop on Data-aware distributedcomputing, pages 11–20, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-589-5.doi: http://doi.acm.org/10.1145/1552280.1552282.

[29] Marcel Bajard, Jean-Marie De Conto, and Joseph Remillieux. Status of the etoileproject for a french hadrontherapy centre. Radiotherapy and Oncology, 73(Supplement 2):S211 – S215, 2004. ISSN 0167-8140. doi:10.1016/S0167-8140(04)80050-1. Carbon-Ion Theraphy.

[30] Tim Berners-Lee. Weaving the Web : the past, present and future of the World Wide Webby its inventor. London : Texere, London, 2000. ID: UkOxUUkOxUb15176602.

[31] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web: A new formof web content that is meaningful to computers will unleash a revolution of newpossibilities. Scientific American Magazine, May 2001. URL http://www.scientificamerican.com/article.cfm?id=the-semantic-web.online (retrieved last accessed January 13, 2010).

[32] Christian Bizer and Andy Seaborne. D2rq - treating non-rdf databases as virtualrdf graphs. In ISWC2004 (posters), November 2004. URL http://www4.wiwiss.fu-berlin.de/bizer/pub/Bizer-D2RQ-ISWC2004.pdf.

[33] James Brenton, Carlos Caldas, Jim Davies, Steve Harris, and Peter Maccallum.Cancergrid: developing open standards for clinical cancer informatics. InProceedings of the UK e-Science All Hands Meeting 2005, pages 678–681, .

67

Page 77: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

[34] James Brenton, Jim Davies, Jeremy Gibbons, and Steve Harris. Accelerating cancerresearch using semantics-driven technology. .

[35] Mathias Brochhausen, Gabriele Weiler, Cristian Cocos, Holger Stenzhorn, NorbertGraf, Martin D?, and Manolis Tsiknakis. The acgt master ontology on cancer ? anew terminology source for oncological practice. Computer-Based Medical Systems,IEEE Symposium on, 0:324–329, 2008. ISSN 1063-7125. doi:http://doi.ieeecomputersociety.org/10.1109/CBMS.2008.17.

[36] caBIG. cabig compatibility guidelines. Technical report, NCI, 2008. URLhttps://gforge.nci.nih.gov/frs/download.php/3948/caBIG_Compatibility_Guidelines_v3.0_FINAL.pdf.

[37] M.J. Carey, L.M. Haas, P.M. Schwarz, M. Arya, W.F. Cody, R. Fagin, M. Flickner,A.W. Luniewski, W. Niblack, D. Petkovic, J. Thomas, J.H. Williams, and E.L.Wimmers. Towards heterogeneous multimedia information systems: the garlicapproach. In Research Issues in Data Engineering, 1995: Distributed ObjectManagement, Proceedings. RIDE-DOM ’95. Fifth International Workshop on, pages124–131, Mar 1995. doi: 10.1109/RIDE.1995.378736.

[38] CancerGrid Team Charles Crichton. Processing meta-data identifiers withsharepoint and sqiv. presentation. URL http://www.cancergrid.org/index.php?option=com_remository\&Itemid=26\&func=download\&id=149\&chk=4bdcb0a03c26cd8c9b7efc3a1a56be0b\&no_html=1.

[39] Oscar Corcho, Pinar Alper, Ioannis Kotsiopoulos, Paolo Missier, Sean Bechhofer,and Carole Goble. An overview of s-ogsa: A reference semantic grid architecture.Web Semantics: Science, Services and Agents on the World Wide Web, 4(2):102 – 115,2006. ISSN 1570-8268. doi: DOI:10.1016/j.websem.2006.03.001. URLhttp://www.sciencedirect.com/science/article/B758F-4JXRX6P-1/2/32f53011cef29fc00efd1de7d4c601e6. SemanticGrid –The Convergence of Technologies.

[40] Peter A. Covitz, Frank Hartel, Carl Schaefer, Sherri De Coronado, GilbertoFragoso, Himanso Sahni, Scott Gustafson, and Kenneth H. Buetow. cacore: Acommon infrastructure for cancer informatics. 2003. URLhttp://bioinformatics.oxfordjournals.org/cgi/content/short/19/18/2404.

[41] Jim Davies, Jeremy Gibbons, Steve Harris, and Denise Warzel. Evolving healthinformatics: Semantic frameworks and metadata-driven architectures.Mathematical biosciences and engineering : MBE, 5. ISSN 1547-1063.

[42] EuroStat. eurostat news release: Causes of death in the eu25, 2006. URLhttp://epp.eurostat.ec.europa.eu/cache/ITY_PUBLIC/3-18072006-AP/EN/3-18072006-AP-EN.PDF.

68

Page 78: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

[43] Asunción Gómez-Pérez, Carole Goble, and Oscar Corcho. Ontogrid fp6-511513final report systematic metadata management for applications that use the grid.Public deliverables of "ontogrid", November 2007. URLhttp://www.ontogrid.net/ontogrid/download/OntoGrid-FinalPublishableReport.pdf.

[44] Carole A. Goble and David De Roure. The semantic grid: Myth busting andbridge building. In ECAI, pages 1129–1135, 2004.

[45] A. Gonzalez Beltran, A. Finkelstein, J. M. Wilkinson, and J. Kramer. Domainconcept-based queries for cancer research data sources. In 22nd IEEE InternationalSymposium on Computer-Based Medical Systems (CBMS 2009), Albuquerque, NewMexico, 2009.

[46] E. Griesmayer, T. Schreiner, and M. Pavlovic. The medaustron project. NuclearInstruments and Methods in Physics Research Section B: Beam Interactions withMaterials and Atoms, 258(1):134 – 138, 2007. ISSN 0168-583X. doi:10.1016/j.nimb.2006.12.082. URLhttp://www.sciencedirect.com/science/article/B6TJN-4MMPNDK-J/2/3c6ceaaa013f16bfe9ccdd2489c10011. InelasticIon-Surface Collisions - Proceedings of the 16th International Workshop onInelastic Ion-Surface Collisions, 16th International Workshop on InelasticIon-Surface Collisions.

[47] Thomas R. Gruber. A translation approach to portable ontology specifications.Knowl. Acquis., 5(2):199–220, 1993. ISSN 1042-8143. doi:http://dx.doi.org/10.1006/knac.1993.1008.

[48] Th. Haberer, J. Debus, H. Eickhoff, O. Jäkel, D. Schulz-Ertner, and U. Weber. Theheidelberg ion therapy center. Radiotherapy and Oncology, 73(Supplement 2):S186 –S190, 2004. ISSN 0167-8140. doi: 10.1016/S0167-8140(04)80046-X. URLhttp://www.sciencedirect.com/science/article/B6TBY-4H0RR0H-1M/2/e556a09592d9c0c6f2852ecd8f353624.Carbon-Ion Theraphy.

[49] Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270–294, December 2001. ISSN 10668888. doi: 10.1007/s007780100054. URLhttp://dx.doi.org/10.1007/s007780100054.

[50] Shannon Hastings, Scott Oster, Stephen Langella, David Ervin, Tahsin Kurc, andJoel Saltz. Introduce: An open source toolkit for rapid development of stronglytyped grid services. Journal of Grid Computing, 5(4):407–427, December 2007. doi:10.1007/s10723-007-9074-8. URLhttp://dx.doi.org/10.1007/s10723-007-9074-8.

69

Page 79: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

[51] HIT. Hit. webpage (retrieved December 7, 2009). URL http://www.klinikum.uni-heidelberg.de/Startseite-HIT.113005.0.html.

[52] Ally Hume, Liangxiu Han, Jano van Hemert, and Malcolm Atkinson. Admired2.1-public - admire architecture. Public deliverables of "advanced data miningand integration research for europe", University of Edinburgh and others withinthe ADMIRE Project, February 2009. URLhttp://www.admire-project.eu/docs/ADMIRE-architecture.pdf.

[53] ISO IEC. International standard iso/iec 11179-1 information technology -metadata registries (mdr) - part 1: Framework. Technical report, ISO / IEC, 2004.URLhttp://standards.iso.org/ittf/PubliclyAvailableStandards/c035343_ISO_IEC_11179-1_2004(E).zip.

[54] Steve Harris Jim Davies, Jeremy Gibbons and Denise Warzel. Evolving healthinformatics: Semantic frameworks and metadata-driven architectures.presentation, 2008. URL http://cancergrid.org.

[55] Martin Koehler and Siegfried Benkner. A service oriented approach fordistributed data mediation on the grid. Grid and Cooperative Computing,International Conference on, 0:401–408, 2009. doi:http://doi.ieeecomputersociety.org/10.1109/GCC.2009.35.

[56] G. Komatsoulis, D. Warzel, F. Hartel, K. Shanbhag, R. Chilukuri, G. Fragoso,S. Coronado, D. Reeves, J. Hadfield, and C. Ludet. cacore version 3:Implementation of a model driven, service-oriented architecture for semanticinteroperability. Journal of Biomedical Informatics, 41(1):106–123, February 2008.ISSN 15320464. doi: 10.1016/j.jbi.2007.03.009. URLhttp://dx.doi.org/10.1016/j.jbi.2007.03.009.

[57] Jacek Kopecky, Tomas Vitvar, Carine Bournez, and Joel Farrell. Sawsdl: Semanticannotations for wsdl and xml schema. IEEE Internet Computing, 11:60–67, 2007.ISSN 1089-7801. doi: http://doi.ieeecomputersociety.org/10.1109/MIC.2007.134.

[58] Maurizio Lenzerini. Data integration: A theoretical perspective. In PODS, pages233–246, 2002.

[59] Luis Martín. Consolidated requirements on ontological approaches for integrationof multi-level biomedical information. Public deliverables of "advancing clinicogenomic trials on cancer" (acgt): D7.1, UPM, School of Computer Science, UPM,Madrid, Spain, January 2006. URLhttp://eu-acgt.org/documents/public-deliverables.html.

[60] Luis Martín, Erwin Bonsma, Alberto Anguita, Jeroen Vrijnsen, MiguelGarcía-Remesal, José Crespo, Manolis Tsiknakis, and Víctor Maojo. Data access

70

Page 80: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

and management in acgt: Tools to solve syntactic and semantic heterogeneitiesbetween clinical and image databases. In Jean-Luc Hainaut, Elke A.Rundensteiner, Markus Kirchberg, Michela Bertolotto, Mathias Brochhausen,Yi-Ping P. Chen, Samira S. Cherfi, Martin Doerr, Hyoil Han, Sven Hartmann,Jeffrey Parsons, Geert Poels, Colette Rolland, Juan Trujillo, Eric Yu, and EstebanZimányie, editors, Advances in Conceptual Modeling - Foundations and Applications,volume 4802 of Lecture Notes in Computer Science, chapter 4, pages 24–33. SpringerBerlin Heidelberg, Berlin, Heidelberg, 2007. ISBN 978-3-540-76291-1. doi:10.1007/978-3-540-76292-8\_4. URLhttp://dx.doi.org/10.1007/978-3-540-76292-8_4.

[61] Patrick McConnell. Cancer translational research informatics platform - atranslational tool in action. presentation, September 2007. URLhttp://gforge.nci.nih.gov/docman/view.php/131/6486/caTRIP_icr_2007_05_09.ppt.

[62] J. P. McCusker, J. A. Phillips, A. G. Beltran, A. Finkelstein, and M. Krauthammer.Semantic web data warehousing for cagrid. BMC bioinformatics, 10 Suppl 10:S2,Oct 1 2009. doi: 10.1186/1471-2105-10-S10-S2. id: 1; JID: 100965194; OID: NLM:PMC2755823; 2009/10/01 [aheadofprint]; epublish.

[63] E. Mena, V. Kashyap, A. Sheth, and A. Illarramendi. Observer: An approach forquery processing in global information systems based on interoperation acrosspre-existing ontologies. Cooperative Information Systems, IFCIS InternationalConference on, 0:14, 1996. doi:http://doi.ieeecomputersociety.org/10.1109/COOPIS.1996.554955.

[64] NCI. Nci enterprise vocabulary services (evs). online (last accessed December 30,2009). URLhttp://www.cancer.gov/cancertopics/terminologyresources/.

[65] Scott Oster, Shannon L. Hastings, Stephen Langella, David W. Ervin, RaviMadduri, Tahsin M. Kurc, Frank Siebenlist, Ian Foster, Krishnakant Shanbhag,Peter A. Covitz, and Joel H. Saltz. cagrid 1.0: A grid enterprise architecture forcancer research, Dec 2007.

[66] Tamer Ozsua and Patrick Valduriez. Principles of Distributed Database Systems.Prentice Hall, 1999-01. ISBN: 9780136597070.

[67] Irene Papatheodorou, Charles Crichton, Lorna Morris, Peter Maccallum,Molecular Taxonomy of Breast Camcer International ConsortiumMETABRIC Group, Jim Davies, James Brenton, and Carlos Caldas. A metadataapproach for clinical data management in translational genomics studies in breastcancer. BMC Medical Genomics, 2(1):66+, 2009. ISSN 1755-8794. doi:10.1186/1755-8794-2-66.

71

Page 81: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

[68] Jyotishman Pathak, Harold R. Solbrig, James D. Buntrock, Thomas M. Johnson,and Christopher G. Chute. Lexgrid: a framework for representing, storing, andquerying biomedical terminologies from simple to sublime. Journal of the AmericanMedical Informatics Association : JAMIA, 16(3):305–315, March 2009. ISSN1067-5027. doi: 10.1197/jamia.M3006. URLhttp://dx.doi.org/10.1197/jamia.M3006.

[69] Joshua Phillips. Evaluation of semantic web technologies and cabig. Technicalreport, NCI, 2009. URL https://wiki.nci.nih.gov/download/attachments/18947630/EvaluationOfSWTechAndCaBIG.doc.

[70] Joshua Phillips, Ram Chilukuri, Gilberto Fragoso, Denise Warzel, and PeterCovitz. The cacore software development kit: Streamlining construction ofinteroperable biomedical information services. BMC Medical Informatics andDecision Making, 6(1):2+, 2006. ISSN 1472-6947. doi: 10.1186/1472-6947-6-2. URLhttp://dx.doi.org/10.1186/1472-6947-6-2.

[71] PTCOG. Particle therapy facilities in operation. webpage (retrieved December 7,2009). URL http://ptcog.web.psi.ch/ptcentres.html.

[72] Hariharan Rajasekaran, Luigi Lo Iacono, Peer Hasselmeyer, Jochen Fingberg, PaulSummers, Siegfried Benkner, Gerhard Engelbrecht, Antonio Arbona, AlessandroChiarini, Christoph M. Friedrich, Martin Hofmann-Apitius, Kai Kumpf, BobMoore, Philippe Bijlenga, Jimison Iavindrasana, Henning Mueller, Rod D. Hose,Robert Dunlop, and Alejandro Frangi. @neurist - towards a system architecturefor advanced disease management through integration of heterogeneous data,computing, and complex processing services. In CBMS ’08: Proceedings of the 200821st IEEE International Symposium on Computer-Based Medical Systems, pages361–366, Washington, DC, USA, 2008. IEEE Computer Society. ISBN978-0-7695-3165-6. doi: http://dx.doi.org/10.1109/CBMS.2008.42.

[73] Faustin Roman. Health grids: overview and added-value (partner work package22, deliverable 1; prototype grid hadron therapy testbed;. Technical report, CERN,November 2009. URL https://espace.cern.ch/partnersite/workspace/faust/Shared%20Documents/FaustinRoman.WP22.D1.pdf.

[74] David De Roure, N. R. Jennings, and Nigel R Shadbolt. The semantic grid: Past,present and future. Proceedings of the IEEE, 93(3):669–681, 2005. URLhttp://eprints.ecs.soton.ac.uk/9976/.

[75] Jennifer Rowley. The wisdom hierarchy: representations of the dikw hierarchy.Journal of Information Science, 33(2):163–180, 2007. doi: 10.1177/0165551506070706.URL http://jis.sagepub.com/cgi/content/abstract/33/2/163.

[76] Joel H. Saltz, Scott Oster, Shannon L. Hastings, Stephen Langella, WilliamSanchez, Manav Kher, Peter A. Covitz, Tahsin M. Kurc, and Krishnakant

72

Page 82: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

Shanbhag. cagrid: design and implementation of the core architecture of thecancer biomedical informatics grid, Dec 2006.

[77] W. Sanchez, B. Gilman, M. Kher, S. Lagou, and S. Covitz. cagrid white paper.Technical report, National Cancer Institute, 2004. URL http://cabig.nci.nih.gov/guidelines_documentation/caGRIDWhitepaper.pdf.

[78] Amit P. Sheth and James A. Larson. Federated database systems for managingdistributed, heterogeneous, and autonomous databases. ACM Comput. Surv., 22(3):183–236, September 1990. ISSN 0360-0300. doi: 10.1145/96602.96604. URLhttp://dx.doi.org/10.1145/96602.96604.

[79] E. Patrick Shironoshita, Yves R. Jean-Mary, Ray M. Bradley, and Mansur R.Kabuka. semcdi: A query formulation for semantic data integration in cabig.Journal of the American Medical Informatics Association, 15(4):559 – 568, 2008. ISSN1067-5027. doi: DOI:10.1197/jamia.M2732. URLhttp://www.sciencedirect.com/science/article/B7CPS-4SY6VS1-T/2/ebedef812f17828340347f72d67c76b1.

[80] Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug,Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J.Mungall, OBI Consortium, Neocles Leontis, Philippe Rocca-Serra, AlanRuttenberg, Susanna-Assunta A. Sansone, Richard H. Scheuermann, Nigam Shah,Patricia L. Whetzel, and Suzanna Lewis. The obo foundry: coordinated evolutionof ontologies to support biomedical data integration. Nature biotechnology, 25(11):1251–1255, November 2007. ISSN 1087-0156. doi: 10.1038/nbt1346.

[81] Dr. Andreas Thor. Datenintegration (lecture notes "data integration" 2008). onlinelecture notes. URLhttp://dbs.uni-leipzig.de/stud/ss2008/datenintegration.

[82] M. Tsiknakis, M. Brochhausen, J. Nabrzyski, J. Pucacki, S. G. Sfakianakis,G. Potamias, C. Desmedt, and D. Kafetzopoulos. A semantic grid infrastructureenabling integrated access and analysis of multilevel biomedical data in supportof postgenomic clinical trials on cancer. IEEE transactions on information technologyin biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society,12(2):205–217, March 2008. ISSN 1089-7771. doi: 10.1109/TITB.2007.903519. URLhttp://dx.doi.org/10.1109/TITB.2007.903519.

[83] H. Wache, T. Vögele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, andS. Hübner. Ontology-based integration of information - a survey of existingapproaches. pages 108–117, 2001. doi: 10.1.1.12.8073. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.8073.

[84] WHO. Fact sheet 297: Cancer, February 2009. URL http://www.who.int/mediacentre/factsheets/fs297/en/index.html.

73

Page 83: 1st year PARTNER evaluation report - CERN Documents... · This is the 1st Deliverable of the PARTNER Work Package 23 within the Marie Curie Ini-tial Training Fellowship of the European

Bibliography

[85] G. Wiederhold. Mediators in the architecture of future information systems.Computer, 25(3):38–49, Mar 1992. ISSN 0018-9162. doi: 10.1109/2.121508.

[86] R Wilson. Radiological use of fast protons. Radiology, 47:487, 1946.

[87] Jeremy C Wyatt and Frank Sullivan. ehealth and the future: promise or peril?Biomedical journal, 331:1391–1393, 2005. doi: 10.1136/bmj.331.7529.1391.

[88] Patrick Ziegler and Klaus R. Dittrich. User-specific semantic integration ofheterogeneous data: The sirup approach. In Semantics of a Networked World, pages44–64. 2004. URLhttp://www.springerlink.com/content/yl6a9r3xfmfh10m0.

[89] Patrick Ziegler and Klaus R. Dittrich. Three decades of data integration - allproblems solved? In In 18th IFIP World Computer Congress (WCC 2004), Volume 12,Building the Information Society, pages 3–12, 2004.

74