bioinformatics applications in the spanish network for e-science

16
EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Ignacio Blanquer Vicente Hernández Bioinformatics Applications in the Spanish Network for e- Science

Upload: clovis

Post on 18-Jan-2016

22 views

Category:

Documents


2 download

DESCRIPTION

Bioinformatics Applications in the Spanish Network for e-Science. Ignacio Blanquer Vicente Hernández. Outline. The Spanish Network for e-Science Structure and link with the Spanish NGI. Bioinformatics applications in the Spanish Network for e-Science. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics Applications in the Spanish Network for e-Science

EGEE-III INFSO-RI-222667

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

Ignacio Blanquer

Vicente Hernández

Bioinformatics Applications in the Spanish Network for e-Science

Page 2: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Outline

• The Spanish Network for e-Science– Structure and link with the Spanish NGI.

• Bioinformatics applications in the Spanish Network for e-Science.

• Challenges for Bioinformatics on the Grid.

Bioinformatics Session - EGEE’09 - Barcelona 2

Page 3: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

The Creation of the Spanish Network for e-Science

• As a consequence of the interest raised by the different research centres and groups participating in national and international projects on Grids and Supercomputing, the white book for the e-Science was produced (http://www.fecyt.es/e-ciencia/libroblanco.htm).

• The need for a global coordination and the development of common tool for easing the access to resources, the Spanish Network for e-Science (CAC-2007-52) was created by the Ministry of Science and Innovation– Officially approved on December 2007 and coordinated by Vicente

Hernández García (Universidad Politécnica de Valencia).• One of the mandates of the Network was to set up the Spanish

NGI, which has been officially created in July 2009– The ministry nominated Isabel Campos (IFCA) as the coordinator of

the Spanish NGI.

Bioinformatics Session - EGEE’09 - Barcelona 3

Page 4: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Participant Groups

• More than 50 different institutions and 97 Research Groups.• More than 1000 researchers.• Dynamic Structure

– 28 Groups have been incorporated after the starting of the activity.

• Structured in Four Activity Areas– EGEE Booth Number 6.

Bioinformatics Session - EGEE’09 - Barcelona 4

Page 5: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Infrastructure

Bioinformatics Session - EGEE’09 - Barcelona 5

CESGACESGA339 cores339 cores1 TB1 TB

UPVUPV36 cores36 cores1 TB1 TB

UNIZARUNIZAR54 cores54 cores

0.8 TB0.8 TB

CIEMATCIEMAT220 cores220 cores

2.7 TB2.7 TB

PICPIC1296 cores1296 cores

10 TB10 TB

IFCAIFCA867 cores867 cores

1 TB1 TB

• gLite-based• Own BDII (EGEE-Compatible)• Supporting IBERGRID

(ES+PT)• 3 Different WMs (Xbroker,

gLite-WMS, GridWay)

Page 6: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Applications

• 3 Roles are identified– Mature applications aiming at a challenging experiment.– Pilots that require intensive porting and a feasibility study.– Support groups with experience on porting applications.

• Pilots, Applications and Support Groups

are certified by an expert board.

• An internal call for projects was set up.

Bioinformatics Session - EGEE’09 - Barcelona 6

PilotsPilots

ApplicationsApplications

Pilot Selection

Pilot Selection

Expert PanelExpert Panel

Analysis and Selection

Analysis and Selection

Resource AllocationResource Allocation

Pilot migration

Pilot migration

Support GroupsSupport Groups

Deploym. and test

Deploym. and test ReportReport

Applications proposal

Applications proposal

Expert panelExpert panel

Autonom. migration

Autonom. migration

Assisted MigrationAssisted

Migration

ProductionProduction

NGI infrastructure

Support GroupsSupport Groups

Page 7: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Overview of the Bioinformatics Applications

• Consolidated Use– Work on current databases to analyse quality, improve

annotation or increase the usability CD-HIT. GSBLAST. BiG - Metagenomics.

• Emerging Use– Port new applications on

the Grid for providing new services

Gfrodock. G-MIRA. Filogen.

Bioinformatics Session - EGEE’09 - Barcelona 7

Page 8: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

http://www.e-ciencia.es/wiki/index.php/CD-HIT

CD-HIT

• Identification of Representative Sequences of Protein Families using CD-HIT– Proposed by the National Centre of Oncological

Research (CNIO).

– It proposes using the resources available through the Spanish Network for e-Science and the CD-HIT algorithm to create more regularly non redundant versions of the available databases.

Bioinformatics Session - EGEE’09 - Barcelona 8

Page 9: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

http://www.e-ciencia.es/wiki/index.php/BLAST

GBLAST

• Analysis of the horizontal transference of genes through a BLAST Processing Service– Proposed by the “Instituto de Biología Celular y Molecular

de Plantas” and the GRyCAP, from the Universidad Politécnica de Valencia.

– This experiment aims at identifying the horizontal transference of gens between prokaryotes and plants, using the UINPROT database, and comparing all known prokaryotic sequences (~4M) among all the known sequences of plants (~0.5M), animals (~1.5M) and fungus (~0.4M).

Output size using the

columns as input and the

rows as reference database

Bioinformatics Session - EGEE’09 - Barcelona 9

Page 10: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

http://www.e-ciencia.es/wiki/index.php/GFrodock

GFrodDock

• Grid-Fast ROtational DOCKing– Proposed by the Centro de Investigaciones

Biológicas – CSIC.– The objective is determining the interaction between two proteins by

means of the analysis of their atomic structure.– Aiming at solving one of the CAPRI (Critical Assessment of

Predicted Interactions) scientific challenges.

Bioinformatics Session - EGEE’09 - Barcelona 10

Page 11: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Metagenomic Analysis on the GridBiG

• Quality of the phylogenetic annotation of bacteria– Comparative phylogenetic experiment on a soil

sample with respect to different releases of the NR Gene Bank Database.

– Many of the associations of sample fragments to biological families have changed, even recently.

– The changing rate does not decreases as time goes by, being increased in many cases.

– This reveals that the complete diversity of such communities is not sufficiently well described on current data bases.

Bioinformatics Session - EGEE’09 - Barcelona 11

Page 12: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

http://www.e-ciencia.es/wiki/index.php/MIRA

GMIRA

• Assembly of Pyrosequences – Proposed by the “Instituto de Biología Molecular y

Celular de Plantas” and the Grid and High Performance Computing Research Group of the Universidad Politécnica de Valencia.

– The new high-throughput sequencing techniques are producing millions of readings between 80 and 500 nucleotids each, requiring intensive post-processing for their assembly.

– This pilot focuses on porting to the Grid one well-known code for this

kind of sequences, which requires vast computing and memory resources.

Bioinformatics Session - EGEE’09 - Barcelona 12

Page 13: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

http://www.e-ciencia.es/wiki/index.php/Filogen

Filogen

• Construction of Phylogenetic trees– Proposed by the Institute of Research on Engineering in

Aragon (I3A).– Phylogenetics aims at reconstructing the evolutionary

relations among species and living beings using the information from their genome.

– This pilot focuses on porting a suite of general purpose codes for such objective, in order to reduce the long response time required for challenging executions.

Bioinformatics Session - EGEE’09 - Barcelona 13

Page 14: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Current Status

• 4 Projects already have a VO created (vo.odthpiv.es-ngi.eu, vo.blast.es-ngi.eu, vo.filogen.es-ngi.eu and vo.frodock.es-ngi.eu ).

• 3 Projects (GBLAST, FILOGEN, and g-MIRA), have been granted with resources for porting through an internal project call.

• 33% of the resources have been consumed by the biomed applications.

66,9

33,1

Others

Biomed

Resource Usage

Bioinformatics Session - EGEE’09 - Barcelona 14

Page 15: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Challenges 1/2

• From the point of view of the resources– Improved scheduling of jobs

Highly dynamic nature of the behaviour of resources (multiple entry points, information system refreshment delays, wide geographic distribution, …).

Need for Quality of Service and job run-length prediction. Need for much more scalable algorithms and models

• Go beyond the simple high-throughput approach based on splitting the input.

I/O Bandwidth consume minimisation• Improvement of locality of reference for large databases.

– Specialised resources Main memory constraints. Availability of pre-existing tuned configurations of widely

used software.

Bioinformatics Session - EGEE’09 - Barcelona 15

Page 16: Bioinformatics Applications in the Spanish Network for e-Science

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Challenges 2/2

• From the point of view of the community– Trade-off on Public Database between extensively covering the

available information and its quality. Many results of using Grid in bioinformatics have been focused on

this issue. Since databases are exponentially growing on size, this issue

seems to be valid for the medium-term.

– Popularisation of community access Availability of simpler interfaces and configurable workflows But Grids are not adequate for any kind of problems

• Do not create over-expectances.

• Many research group already have medium-size computing resources which can tackle most of the daily work.

• Create user’s confidence.

Bioinformatics Session - EGEE’09 - Barcelona 16