grids and biology professor carole goble university of manchester, uk bbsrc bioinformatics and...

47
Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October 2002

Upload: lillian-johnston

Post on 13-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Grids and Biology

Professor Carole GobleUniversity of Manchester, UK

BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK28th October 2002

Page 2: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Grids and Biology

A take on the GridIssues in Bioinformatics for GridVarious BioGridsApplicability of Grid to BiologyReality check

Page 3: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

What is the Grid?“ Grid computing [is] distinguished from

conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation...we review the "Grid problem", which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources - what we refer to as virtual organizations."From "The Anatomy of the Grid: Enabling Scalable Virtual

Organizations" by Foster, Kesselman and Tuecke

Page 4: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

What is the Grid?Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizationsOn-demand, ubiquitous access to computing, data, and servicesNew capabilities constructed dynamically and transparently from distributed servicesNo central location, No central control, No existing trust relationships, Little predeterminationUniformity for Pooling ResourcesVirtual pools of resources: databases, clusters….

Page 5: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Biology as a Grid Application

Informational ScienceLarge ScaleDistributedNo one organisation owns it all

Page 6: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Motivation

1990 2000 2010

ESTs

Combinatorial Chemistry

Human Genome

Pharmacogenomics

Metabolic Pathways

Computational Load

Genome Data

Moores Law

Page 7: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Genome Sequences

Assembled Genomes

Genes and Gene Structures

Simulation of Metabolic andSignal Transduction Pathways

Genes, Proteins,RNAs, and other

Biomolecules

Biochemical Pathways&

Processes

Cellular & Developmental Processes

Large-scaleGenome

Sequencing

Tissue and OrganismalPhysiology

Ecological Processesand Populations

Sequence Variation ofPopulations

Reconstructing Phylogeny,Homology, and Comparitive

Approaches

Predicting ProteinSequence

Simulating and Understanding GeneExpression Networks

Predicting Three-DimensionalStructures of Proteins and RNAs

Predicting Functions

PredictingCatalysis, Molecular

Dynamics

Structures of Multi-molecularcomplexes

Predicting Effects ofVariation

Morphogenesis and Development

Experiments

Computation

BioMedical Computation[Rick Stevens, Argonne Labs]

Page 8: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Biomedical Data: High Complexity and Large Scale

...atcgaattccaggcgtcacattctcaattcca...billions

DNA sequencesalignments

MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...

Proteins sequence 2º structure 3º structure

Hundredthousands

Protein-ProteinInteractions metabolism pathways receptor-ligand 4º structure

millions

billions

Polymorphism and Variants genetic variants individual patients epidemiology

Physiology Cellular biology Biochemistry Neurobiology Endocrinology etc.

millions

millions

ESTs Expression patternsLarge-scale screens

Genetics and Maps Linkage Cytogenetic Clone-based

[Rick Stevens, Argonne Labs]

Page 9: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

BioGrid Projects

EUROGRID BioGRIDAsia Pacific BioGRIDNorth Carolina BioGridBioinformatics Research NetworkOsaka University BioGridIndiana University BioArchive BioGridmyGridBioSime-ProteinObiGrid

Page 10: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Today’s Grid

A Single System ImageTransparent wide-area access to large data banksTransparent wide-area access to applications on heterogeneous platformsTransparent wide-area access to processing resources

Security, certification, single sign-on authentication, AAA

Grid Security Infrastructure,

Data access,Transfer & Replication

GridFTP, Giggle

Computational resource discovery, allocation and process creation

GRAAM, Unicore, Condor-G

Page 11: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Immediate benefits

Uniform file views of directories, regardless of platformGrid-based data transfer libraries for faster access to large files, reducing need for mirror-site servers. Replication to support mirroringGrid APIs provide a job manager with metadata about services to the user. Evaluate the quality of service providers based on factors that may include more than just server performance and availability. Grid-aware applications -- split sequence reference libraries among several servers, where BLAST comparisons can be conducted in parallel.Shielding from a variety of low-level computing problems would otherwise have to address themselves.

Page 12: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Grid LandscapeComputationally Intensive

Data Intensive

Collaborative

Visualisation

Knowledge Intensive

Page 13: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Grid LandscapeComputationally Intensive

Data Intensive

Collaborative

Visualisation

Knowledge Intensive

Page 14: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Classical Grids

Classical Grids emphasise sharing of physical resources.Existing Grid middleware (e.g. Globus, Condor, Unicore) allows resource discovery, resource allocation, data movement, certification …

Page 15: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

High Performance Bioinformatics Software

[Jack da Silva, NCSC, Paracel]

Page 16: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

European DataGrid

Page 17: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Managed access to specialist remote resources

Page 18: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Access portal for biomolecular modeling resources. Interfaces to enable chemists and biologists to be able to submit work to HPC facilitiesVisualization of electrostatic field generated by a molecule.

dr Krzysztof Nowinski (ICM)

Page 19: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Biogrid system

1000Base-T x 12

Myrinet-2000

Data Grid DiskExpress5800/140Ra-4 x3

Grid system 1Express5800/ISS for PC-ClusterXeon2.2G x 8 + Management node 1

1000Base-SX

Grid system 2NEC Blade Server78node (156 CPU )

Flat N

eighborh

ood networks

SCOREManagement Station

SCOREManagement Station

Connected toGrid system 3

Page 20: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Remote control of instrumentsSharing of UHVEM(Ultra High Voltage Electron Microscopy) in Osaka University with NCMIR (National Center for Microscopy and Imaging Research)

3 Million electron volts the most powerful microscopy

Tokyo XP(Chicago)

STAR TAP

TransPACAPAN

vBNS

(UC San Diego)SDSC

NCMIR(San Diego)

UHVEM(Osaka, Japan)

JGN

Osaka University

Page 21: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Home ComputersEvaluate AIDS Drugs

Community = 1000s of home

computer users Philanthropic computing

vendor (Entropia) Research group

(Scripps)

Common goal= advance AIDS research

From Steve Tuecke 12 Oct. 01

Page 22: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

MatlabMatlab and toolboxes for mathematical computation, analysis, visualization, and algorithm development:

CROSS PLATFORM/ OS

MATLAB is an intuitive language and a technical computing environment. It provides core mathematics and advanced graphical tools for data analysis, visualization, and algorithm and application development. With more than 600 mathematical, statistical, and engineering functions, engineers and scientists rely on the MATLAB environment for their technical computing needs.”

(www.mathworks.com)

Geodise release in November [email protected]

Page 23: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

BioSim -- Molecular simulations as a tool for protein structure analysis

Overall vision – simulation as an integral component of structural genomics

Needs both capacity (many systems) and capability (large systems - HPCx)

Molecular Dynamics database (distributed)

synchrotron

MD database

novel biology…

compute GRID

[Sansom]

Page 24: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Grid LandscapeComputationally Intensive

Data Intensive

Collaborative

Visualisation

Knowledge Intensive

Page 25: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Metabolic Reconstruction

Function Assignment

Stoichiometric Representation& Flux Analysis

Dynamic Simulation

Network Visualization Tools

Genome Visualization Tools

Whole Cell VisualizationsImage/Spectra Augmentations

Interactive StoichiometricGraphical Tools

Laboratory Verification

VisualizationEnvironment

BioinformaticAnalysis Tools

Whole Genome Analysis

Microbiology &Biochemistry

Enzymatic ConstantsMetabolic ***

Proteomics

Visualization + Bioinformatics

[Rick Stevens Argonne Labs]

Page 26: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

X-ray microtomography

Scientific discovery can be enhanced by closely coupling computation and experiment. Simulation, visualization and data gathering coupledX-ray microtomography produces 3D X-ray attenuation maps of specimens at a microscopic levelExpensive synchrotron beam time resources optimally used to obtain sufficient resolution for simulation

Page 27: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Interactive Steering

Enables controlled simulation using knowledge and skills of trained scientist.

•User steers calculation from laptop

•Controlled steering on supercomputers

•Visualization and computation use large scale machines accessed via Grid.

Page 28: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Scalable molecular dynamics

• Structure of a protein in a fluid medium

• Calculation takes into account forces between protein and ambient medium (in this case water molecules)

• Run on world largest academic computer, LeMieux at PSC (6 Tflops theoretical peak)

Page 29: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Grid LandscapeComputationally Intensive

Data Intensive

Collaborative

Visualisation

Knowledge Intensive

Page 30: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

UCSF

UIUC

From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign

Page 31: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

http://www.ks.uiuc.edu/Research/biocore/

Page 32: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Grid Landscape: DATA!!Computationally Intensive

Data Intensive

Collaborative

Visualisation

Knowledge Intensive

Page 33: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Information Weaving and Question Answering

Large amounts of different kinds of data & many applications.Highly heterogeneous. Different types,

algorithms, forms, implementations, communities, service providers

High autonomy.Highly complex and inter-related, & volatile.

Page 34: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Annotation Pipeline

sequencesSCOPCATHPDB

NRPROT

proteome sequences

PDB hit no PDB hit

TM, CC, LC, SIG & MOTIFS

PSIBLAST & HHMs

structure-based function prediction

structural and functional annotation

INTERPRO

3D modelling x 2 fold recognition x 2

[Mike Sternberg]

Page 35: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

myGrid

Personalised extensible environments for data-intensive in silico experiments in biology

RASMOL

Straightforward discovery, interoperation, deployment & sharing of services

Service-oriented architecture

Integration and Information Workflow & Databases

Experimentation Provenance, propagating change, personalisation

For bioinformaticians who are building tools and using or providing services

Page 36: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

High Throughput Computing Services

Distributed Data EngineeringData Registration, Data Normalisation, Data Quality

Information StructuringInformation Integration & Composition,

Semantics & Domain-based Ontologies, Sharing

Grid-based Knowledge DiscoveryGrid-based Data Mining, Collaborative Visualisation

DiscoveryNetHigh Throughput Sensing (HTS) Applications

Large-scale Dynamic Real- time Decision

support

Large-scale Dynamic System Knowledge

Discovery

Grid Basic InfrastructureGlobus/Condor/SRB

Utilising Grid Infrastructure for HT Computing

Base

d o

n

Ken

sing

ton

Disco

very

Pla

tform

Base

d o

n

Glo

bu

s &

OR

B

Infra

structu

re

http://www.discovery-on-the.net/

Bio Chip Applications

Protein-folding chips: SNP chips, Diff. Gene chips using LFII

Protein-based fluorescent micro arrays

1-100010-1000 >10000

Data QualityVisualisationStructuringClusteringDistributed Dynamic

Knowledge Management

Page 37: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Grid Evolution

1st Generation Grid Computationally intensive, file access/transfer Bag of various heterogeneous protocols & toolkits Recognises internet, Ignores Web Academic teams

2nd Generation Grid Data intensive -> knowledge intensive Services-based architecture Recognises Web and Web services Global Grid Forum Industry participation

We are here!

Page 38: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Node

NodeNodeNode

Gigabit IP Network

Geographically (e.g. UKGrid)

Grid Middleware

A Grid of resources, not just compute resources but databases, digital libraries, instruments, workflows, documents …

MouseGrid

NovartisGrid

BioSimGrid

These configurations are dynamic

Resources discovered, combined, used and disbanded as and when needed or available.

A Grid vs The Grid

Ph

ysic

al

Log

ica

l

Page 39: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

A configuration of resources

Not just compute services but databases, digital libraries, instruments, workflows, documents …

services

Web Services

Grid Technology

Grid Services

Open Grid Service ArchitectureOGSA

Page 40: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Bio Services

Domain Oriented Services

Basic BioGrid Services

Grid Resource ServicesCommon Services

Base Services

Fabric Services

• Drug Discovery• Microbial Engineering• Molecular Ecology• Oncology Research

• Integrated Databases• Sequence Analysis• Protein Interactions• Cell Simulation

• Compute Services• Pipeline Services• Data Archive Service• Database Hosting• Workflow Enactment• Event notification

Page 41: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

What We Need to CreateGrid Bio applications enablement software layer Provide application’s access to Grid services Provides OS independent services

Grid enabled version of bioinformatics data management tools (e.g. DL, SRS, etc.) Need to support virtual databases via Grid services Grid support for commercial databases

Bioinformatics applications “plug-in” modules End user tools for a variety of domains Support major existing Bio IT platforms

Page 42: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Requirements for the BioGridOpen and extendable architecture

Enable tie in to service stack at appropriate points Not just access via Portals

Leverage scripting tools in wide use for Bioinformatics

Create BioGrid services bindings for PERL and Python

Address data federation and integration Leverage work of IBM, Lion BioSciences, DAS, BioMOBY, etc.

Match the biology workflow and tool chain Create high-level BioGrid services to address critical stages

in existing workflow Support composibility of new BioGrid tools with existing tool

chain elements

Page 43: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Some BioGrid ChallengesScalable human bioinformatics expertise

Best people working on the important problems Exploit collaboration technology to create world class teams

Robust local bioinformatics computing environment Best systems administrators and high-end technologies Embed local resources into the Grid via portal technologies

Access to leading edge bioinformatics software and databases customized to user needs

Core content from top scientists and developers Integrated access to biological databases

Worldwide access to robust computing and database infrastructure

Leverage Grid technology to provide worldwide access Integrate purpose built systems and service providers

Page 44: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Reality Checks!!The Technology is Ready Not true — its emerging

Building middleware, Advancing Standards, Developing, Dependability

Building demonstrators. The computational grid is in advance of the data

intensive middleware Integration and curation are probably the obstacles But!! It doesn’t have to be all there to be useful.

We know how we will use grid services No — Disruptive technology

Lower the barriers of entry.

Page 45: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Reality Checks!!

It’s the only game Not true — I3C, BioMOBY, bioDAS, OMG LSR

Grid and Web service merge makes integration likely.

One Size Fits All Not true

Addressed by a minimum set of composable virtual services, But starting with Globus

It’s only for “big” science No — “small” science collaborates too!

Biology is not unique! AstroGrid

Page 46: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Not a silver bullet! Its just middleware not magic

Data qualityContent management of databases (controlled vocabularies)Provenance and versioning policiesAppropriate use of toolsComputational inaccessibility of free text annotationDatabase accessibility through means other than point and click web interfaces.

Independent of the Grid!

Page 47: Grids and Biology Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October

Life Sciences Grid (LSG)

http://people.cs.uchicago.edu/~dangulo/LSG/