0 cancer biomedical informatics grid (cabig) – an approach towards data access and integration...

35
Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering National Cancer Institute Center for Bioinformatics

Upload: nathan-pearson

Post on 29-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

1

Cancer Biomedical Informatics Grid

(caBIG) – An Approach towards Data

Access and Integration

Avinash Shanbhag

Director, Core Infrastructure EngineeringNational Cancer Institute

Center for Bioinformatics

Page 2: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

2

National Cancer Institute 2015 Goal

Relieve suffering and death due to cancer by the year 2015

Page 3: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

3

Origins of caBIG

Need: Enable investigators and research teams nationwide to combine and leverage their findings and expertise in order to meet NCI 2015 Goal.

Strategy: Create scalable, actively managed organization that will connect members of the NCI-supported cancer enterprise by building a biomedical informatics network and data can be seamlessly shared

Page 4: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

4

caBIG Challenges

Handle diversity of data types

Precise “Meaning” of data

Provide local hosting of data

Local access control

Provide tools to “publish” and “access” data easily

High Performance computing will be needed in future

Page 5: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

5

SemanticSemanticinteroperabilityinteroperability

SyntacticSyntacticinteroperabilityinteroperability

Interoperability

ability of a system to access and use the parts or equipment of another system

Page 6: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

6

How to Achieve Interoperability for Data Systems?

Well Documented public API access to data

Based on object oriented abstraction of underlying data– No particular technology or tool specified

Abstraction layer must be derived using widely accepted “standards”– Model Driven Architecture

Information Model is the “Metadata” of the data and needs to be persisted and accessible via API

Need to be able to “unambiguously” and programmatically determine the meaning of data

Page 7: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

7

OMG Model Driven Architecture (MDA) Approach

Analyze the problem space and develop the artifacts for each scenario– Use Cases

Use Unified Modeling Language (UML) to standardize model representations and artifacts. Design the system by developing artifacts based on the use cases– Class Diagram – Information Model– Sequence Diagram – Temporal Behavior

Use meta-model tools to generate the code

Page 8: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

8

Limitations of MDA

Limited expressivity for semantics

No facility for runtime semantic metadata management

Page 9: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

9

caCORESyntactic and Semantic Integration

MDA Plus a whole lot more!

Page 10: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

10

caCORE

Bioinformatics Objects

Enterprise Vocabulary

Common Data Elements

SECURITY

Page 11: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

11

Use Cases

Description

Actors

Basic Course

Alternative Course

Page 12: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

12

Bioinformatics Objects

Page 13: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

13

What do all those data classes and attributes actually mean, anyway?

Data descriptors or “semantic metadata” required

Computable, commonly structured, reusable units of metadata are “Common Data Elements” or CDEs.

NCI uses the ISO/IEC 11179 standard for metadata structure and registration

Semantics all drawn from Enterprise Vocabulary Service resources

Common Data Elements

Page 14: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

14

Preferred Name

Synonyms

Definition

Relationships

Concept Code

Enterprise Vocabulary Description Logic

Page 15: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

15

Semantic metadata example: Agent

<Agent>

<name>Taxol</name>

<nSCNumber>007</nSCNumber>

</Agent>

Page 16: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

16

Why do you need metadata?Why do you need metadata?

Class/Attribute

Example Object Data

CIA Metadata NCI Metadata

Agent A sworn intelligence agent; a spy

Chemical compound administered to a human being to treat a disease or condition, or prevent the onset of a disease or condition

AgentnSCNumber

007 Identifier given to an intelligence agent by the National Security Council

Identifier given to chemical compound by the US Food and Drug Administration Nomenclature Standards Committee

Agentname

Taxol CIA code name given to intelligence agents

Common name of chemical compound used as an agent

Page 17: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

17

Computable Interoperability

Agent

name

nSCNumber

FDAIndID

CTEPName

IUPACName

Drug

id

NDCCode

approver

approvalDate

fdaCode

C1708:C41243

C1708:C41243

C1708 C1708

My model Your model

Page 18: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

18

Cancer Data Standards Repository

ISO/IEC 11179 Registry for Common Data Elements – units of semantic metadata

Client for Enterprise Vocabulary: metadata constructed from controlled terminology and annotated with concept codes

Precise specification of Classes, Attributes, Data Types, Permissible Values: Strong typing of data objects.

Page 19: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

19

caCORE Tools

UML Loader: automatically register UML models as metadata components

CDE Curation: Fine tune metadata and constrain permissible values with data standards

Form Builder: Create standards-based data collection forms

CDE Browser: search and export metadata components

Common Security Module: Provides role based security

Page 20: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

20

caCORE Software Development Kit

UML Modeling Tool (any with XMI export)

Semantic Connector (concept binding utility)

UML Loader (model registration in caDSR)

Codegen (middleware code generator)

Security Adaptor (Common Security Module)

caCORE SDK generates syntactically and semantically interoperable data service system

Page 21: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

21

caGrid

caCORE meets grid technology!

Page 22: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

22

Use cases not satisfied by caCORE alone

Advertisement– Service Provider composes service metadata describing the

service and publishes it to grid.

Discovery– Researcher (or application developer) specifies search criteria

describing a service of interest– The research submits the discovery request to a discovery

service, which identifies a list of services matching the criteria, and returns the list.

Invocation– Researcher (or application developer) instantiates the grid

service and access its resources

Page 23: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

23

GolGoldd

Cancer Center Cancer Center

Cancer Center

Cancer Center

Cancer Center

NCIOTHER caBIGSERVICE

PROVIDERS

OTHERTOOLKITS

SilverSilver

SilverSilver

SilverSilverSilverSilver

SilverSilver

SilverSilver SilverSilver

Page 24: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

24

caGrid Components

Leverage existing technologies:– caDSR, EVS, Mobius GME: Common data elements, controlled vocabularies, schema

management– Globus Toolkit (currently version 4.0.1)

• Core grid services infrastructure• Service deployment, service registry, invocation, base security infrastructure

Additional Core Infrastructure– Higher-level security services (Dorian)– Grid service access to metadata components (caDSR, GME, etc)– Workflow, Identifier services

Service Provider Tooling (Introduce)– Graphical service development and configuration environment– Abstractions from service infrastructure for Data and Analytical services– Deployment wizards

Client Tooling– High-level APIs for interacting with core components and services– Graphical Tools

Page 25: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

25

caGrid 0.5 Architecture(May be updated for 1.0)

Grid Communication Protocol

Service Description

Service

Business ProcessService R

egistry

Secu

rity

Sem

antic service

Resource M

anagement

Functions Quality of Service

ID R

esolution

Transport

GSI

GUMS

GT3

Analytical

OGSA-DAI GT3

GLOBUS Toolkit

caDSR

EVS GT3

UI

caDSR IndexGME

CAMS

Page 26: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

26

Data Object Semantics, Metadata, and Schemas

Object oriented, APIs, well-defined data types

Classes defined in UML and converted into ISO/IEC 11179, registered in the caDSR

Definitions drawn from Enterprise Vocabulary Services (EVS), relationships semantically described

XML serialization of objects adhere to XML schemas registered in the Global Model Exchange (GME)

Service

Core Services

Client

XSDWSDL

Grid Service

Service Definition

Data TypeDefinitions

Service API

Grid Client

Client API

Registered In

Object Definitions

SemanticallyDescribed In

XMLObjectsSerialize To

ValidatesAgainst

Client Uses

Cancer Data Standards Repository

Enterprise Vocabulary

Services

Objects

GlobalModel

Exchange

GMERegistered In

ObjectDefinitions

Objects

Page 27: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

27

Introduce Toolkit

A framework which enables fast and easy creation of caGrid compatible services whether they are data, analytical, custom, or core services.

Provide easy to use graphical service authoring tools.

Hide all “grid-ness” from the developer so that they can concentrate on the domain expert implementation.

Utilize best practice layered grid service architectures.

Handle all service architecture requirements of the caGrid.– Strong service interface data typing– Metadata and service registration– Grid security integration

Page 28: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

28

Data Service Access on caGrid

Specialization of caGrid grid services to expose data through a common query interface

Present an object view of data sources

Exposed objects are registered in caDSR and their XML representation in GME

Queries made with caBIG Query Language (CQL) Query objects

Results returned as objects (or identifiers) nested in a CQL Query Result Set

Page 29: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

29

Data Service Query Language

Specialization of caGrid grid services to expose data through a common query interface

Present an object view of data sources

Exposed objects are registered in caDSR and their XML representation in GME

Queries made with CQL Query objects

Results returned as objects (or identifiers) nested in a CQL Query Result Set

Page 30: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

30

Data Service Interface

public CQLQueryResultsType processQuery(CQLQueryType query)

Data Provider’s only responsibility is to implement CQL over their local data resource– A default implementation will be provided for caCORE SDK created

systems

caGrid provides grid service implementation to invoke provider’s CQL implementation

Service provides all features necessary for compliance, such as advertisement of data service metadata, and security integration

Page 31: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

31

Data Service Query Scenario

4. Data Source is queried by the Grid Data Service

5. Grid Data Service Builds a CQL Result Set

6. Result Set is serialized and returned to the client

7. Client deserializes result set

8. Result set is iterated with client tools to retrieve objects

1. Client builds a CQL Query

2. CQL Query is serialized and submitted to the Grid Data Service

3. Grid Data Service deserializes the CQL Query Object and processes it

Page 32: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

32

Federated and Aggregated Queries

Componentized library being developed to facilitate limited federating and aggregating queries

An extension language used to describe distributed queries

Library creates and executes a Query Plan for the distributed query, using multiple CQL queries to targeted data services

Page 33: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

33

Data Service Client Tooling

APIs provided to discover available data services on the grid based on client-defined criteria (such exposed data models and concepts)

Object-Oriented API for building queries, querying a given data service, and processing the results

Client tools available to iterate query result sets– Object iterator deserializes XML into registered objects– XML iterator simply returns XML documents

Page 34: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

34

Acknowledgements (caGrid Team)

Ohio State University - Department of BioMedical Informatics – Dave Ervin– Shannon Hastings– Tahsin Kurc– Stephen Langella– Scott Oster– Joel Saltz

Argonne National Lab / University of Chicago– William Allcock– Jarek Gawor– Ravi Madduri– Frank Siebenlist– Michael Wilde

Duke University– A. Jamie Cuticchia– Patrick McConnell

Georgetown University– Colin Freas– Paul A. Kennedy– Chad La Joie

SAIC (http://www.saic.com)– Manav Kher

ScenPro/Semantic Bits– Vinay Kumar– David Wellborn– Valerie Bragg

Booz | Allen | Hamilton (http://www.bah.com) – Arumani Manisundaram– Michael Keller– Reechik Chatterjee

Page 35: 0 Cancer Biomedical Informatics Grid (caBIG) – An Approach towards Data Access and Integration Avinash Shanbhag Director, Core Infrastructure Engineering

35

Acknowledgements

NCIAndrew von EschenbachAnna BarkerWendy PattersonOCDCTDDCBDCPDCEGDCCPSCCR

Industry PartnersSAICBAHOracleScenProEkagraApelonTerrapin SystemsPanther Informatics

NCICBKen BuetowPeter CovitzGeorge Komatsoulis Denise Warzel Frank HartelSherri De CoronadoDianne ReevesGilberto FragosoJill HadfieldLeslie Derr