bioinformatique sur cloud

23
Christophe Blanchet Institute of Biology and Chemistry of Proteins Head of Service ‘’Infrastructure for Biology - IDB’’ CNRS-IBCP FR3302 - LYON - FRANCE - http://idee-b.ibcp.fr IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO- RI-261552 ), the French National Research Agency's Arpege Programme (ANR-10-SEGI-001 ) and by the French Institute for Bioinformatics (IFB-RENABI) Bioinformatique sur Cloud Cas d’usage avec le portail Galaxy

Upload: dokhanh

Post on 05-Jan-2017

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bioinformatique sur Cloud

Christophe BlanchetInstitute of Biology and Chemistry of Proteins

Head of Service ‘’Infrastructure for Biology - IDB’’CNRS-IBCP FR3302 - LYON - FRANCE - http://idee-b.ibcp.fr

IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552), the French National Research Agency's Arpege Programme (ANR-10-SEGI-001) and by the French

Institute for Bioinformatics (IFB-RENABI)

Bioinformatiquesur Cloud

Cas d’usage avec le portail Galaxy

Page 2: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Bioinformatics Today• Biological data are big data

• 1512 online databases (NAR Database Issue 2013)

• Institut Sanger, UK, 5 PB

• Beijing Genome Institute, China, 5 sites, 12.6 PB➡ Big data in many places

• Analysing such data became difficult• Scale-up of the analyses : gene/protein to complete genome/

proteome, ...

• Lot of different daily-used tools

• That need to be combined in workflows

• Usual interfaces: portals, Web services, federation,...➡ Datacenters with ease of access/use

• Distributed resources• Experimental platforms: NGS, imaging, ...

• Bioinformatics platforms➡ Federation of datacenters

ADN

ADN

BI

M

M

ADN

ADN

BI

ADN

ADN

BI CC

BI

M

ADN

ADN

ADN

Page 3: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

IDB Cloud and Bioinformatics Appliances

• Cloud workbench for Biology• Running since Sept. 2011

CNRS-IBCP FR3302, Lyon, France

• opened to Biology community

• 14 bioinformatics appliances: Galaxy portal,standard compute nodes, proteomics, virtual desktop, structural biology, ...

• +40 users from all IFB regional centersPRABI 15, APLIBIO 14, RENABI-NE 8, -SO 2, -GS 1, -GO 1

• VMs up to 32cores-768GB RAM

• Infrastructure• Compute +900cores +4TB ram

• Standard nodes (32c-128GB)

• Bigmen nodes (64c 768GB)

• Powered by StratusLab

• Storage +250TB

• Virtual disks, object storage (S3)

toolsBLAST

TopHat

FastASSearch

R

ClustalW2

samtoolsBWA

Linuxsystem

Createnew cloudservices

Bioinformatics Marketplace

Structures ...Sequences ProteomicsGalaxy

+Virtual Machines

OMSSA

PeptideShaker

HMMer

Muscle

X!tandemARIA

fastQCClustalOmega

Galaxy

tools VM:BLAST,ClustalW2,etc.

BI

data

UNIPROT

PDB

EMBL

PROSITE

Genomes

Z

B

A

public datauser data

Movecloudvirtual

machines

IDB Cloud

Page 4: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Cloud extended services

• Bioinformatics Marketplace• find appropriate appliances more easily.

• reduce “noise” in the central Marketplace

• respect visibility contraints for the bioinformatic appliances, such as confidentiality

• Bioinformatics metadata ‘’bio:tool’’• additional elements related to bioinformatics tools

• to annotate appliances

• help users to search for the tools themselves or the type of analysis

• select suitable bioinformatics appliances containing the required tools

• Integrated Web interface• VM & virtual disks management

• browse bionformatics appliances with ‘‘bio:tool’’ MDz

Native cloud services• Authentication

• Virtual machine management

• Persistent disk service• Client CLI

• etc.

IDB

Page 5: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Driven throught a simple web interface

Page 6: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Run your Bioinformatics Cloud InstancesBioinformatics Marketplace

NGSStructure Galaxy ARIA (…)Sequence

IBCP's CloudResources

BLAST,Clustal,

etc.

PaaS

WorkersVM CNS

Shar

ed F

S

launch jobssshIaaS

Master & StorageVM ARIA

Portal

Laun

chIn

stan

ces

Page 7: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

UNIPROT

PDB

EMBLPROSITE

Genomes

Public

Data sources

BioinformaticsCloud

BLAST,Clustal,

etc.

PaaS

WorkersVM CNS

Shar

ed F

S

launch jobssshIaaS

Master & StorageVM ARIA

Portal

shared(NFS)

User

Persistent data

pdisk(iSCSI)

Biological Data in CloudUpload your data

Get your results

sftp/http/S3

sftp/http/S3

Page 8: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Examples of CloudBionformatics Appliances

Page 9: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Standard Bioinformatics node

• ‘Biocompute’ appliance

• Use your own instance(s)

• With pre-installed standard bioinformatics tools• BLAST, FastA, SSearch,HMM,...

• ClustalW2, Clustal-Omega, Muscle,..

• Bowtie(2), BWA, samtools, ...

• MEME, R, etc.

• Connected to public reference data• Uniprot, EMBL, genomes, PDB, etc.

• Automaticaly shared to the VMs

Page 10: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Structural Biology• TOwards StruCtural AssignmeNt Improvement

• To improve the determination of protein structures based on Nuclear Magnetic Resonance (NMR) information with ARIA software

• Large computational needs.

• A NMR laboratory will not specially invest in building a cluster of about 100 nodes to be able to run such NMR structure calculations.

• Flexibility of the cloud to deploy the different required bioinformatics tools can accelerate such a procedure.

• Commercial interest in providing such tools to structural biologists on a “pay as you go” basis.

• Endorsers:Institut Pasteur Parisand CNRS IBCP

Page 11: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Proteomics desktop• Motivation

• Collaboration with a mass spectroscopy platform

• Running out of space on their local resources

• Protein identification• Mass experimental data

• Reference databases : nr, Swiss-Prot

• Reference screening tools:OMSSA, X!Tandem

• User interface• Remote display

• NX

• Reference GUIs

• SearchGUI

• PeptidShaker

source: PeptideShaker site

Page 12: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

MapReduce Biology• Provide turnkey virtual machine with pre-

configured mapreduce framework• Accelerate bigadata analysis with the two steps map &

reduce paradigm

• Hadoop MapReduce 1.0.4

• Appliances (2)• standard hadoop mapreduce

• bioinformaytics software integrated in hadoop

• Sequences similarity with mapreduce paradigm• FastA & SSearch

• deploy database of sequences in HDFS

• compare each structure to others

Developed in the context of the French project MapReduce, ANR ARPEGE

Mappers

Databank

FastA #01

Reducers

subset#01

subset#02

...

FastA #02 ...

User'sSequences

Resultsscore sequencescore sequence

...

FastAMR

Each mapper

send the

score and

sequences to

reducers

Reducers copy the

best scores of the

whole experiment

in the DFS

Each mapper

runs a FastA

program on a

part of the

databank

FastAMR splits the

databank into subsets

and puts them in the

DFS along with the

sequences file

Users run the FastAMR

script with its sequences

and the databank

Page 13: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Cas d’usage avec Galaxy

Page 14: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Compte Cloud IDB

• Connectez-vous

• Remplissez les différents champs• adresse mail

institutionnelle

• Créer la demande implique l’acceptation des conditions d’utilisations !

https://idee-b.ibcp.fr/cloud.html

Page 15: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Appliances disponibles

• Liste des appliances existantes

• Documentation spécifique aux appliances

• Création directe• bouton ‘Power’

Page 16: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Créer mon portail Galaxy

• Appelée aussi ‘Instance’

• Compléter les différents paramètres• lui assigner un nom

• nombre de CPUs

• taille mémoire

• attacher un disque virtuel

• Cluster de VM• remplir le nombre de VMs

• choix du nom unique

Page 17: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Connexion sur mon instance Galaxy

Page 18: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Les disques durs virtuels

• Un disque virtuel permet de conserver ses données indépendamment de l’exécution des VMs• retrouver ses données d’une

VM à la suivante.

• Actions• créer un vdisk

• gérer ses vdisks

• Utiliser un vdisk• à la création de la VM

• montage à chaud

Page 19: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Echanger les données avec mon portail Galaxy

• sftp / scp

• client graphique: Cyberduck, Transmit, Filezilla, ...

• Web: Galaxy - Get Data - Lien pour download

Page 20: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

Conclusion• Added value of cloud, e.g. NGS with Galaxy

• for scientific analyses: user-specific resources, isolated, different instances together

• for training: Oct 2012 Bordeaux, Mai 2013 Galaxy Lille, (next) 2014 Galaxy Jouy

• for tools integration: semantic annotation, solve software dependencies

• for development & operations (DevOps): different versions at the same time

• Provide turnkey bioinformatics appliances• Standard tools and pipelines

• New developments• Ready to run on clouds

• Public bioinformatics cloud (e.g. IDB)• Tightly connected to existing bioinformatics

resources

• Linked to public biological databases

• In collaboration with the French Institute of Bioinformatics

toolsBLAST

TopHat

FastASSearch

R

ClustalW2

samtoolsBWA

Linuxsystem

Createnew cloudservices

Bioinformatics Marketplace

Structures ...Sequences ProteomicsGalaxy

+Virtual Machines

OMSSA

PeptideShaker

HMMer

Muscle

X!tandemARIA

fastQCClustalOmega

Galaxy

tools VM:BLAST,ClustalW2,etc.

BI

data

UNIPROT

PDB

EMBL

PROSITE

Genomes

Z

B

A

public datauser data

Movecloudvirtual

machines

IDB Cloud

Page 21: Bioinformatique sur Cloud

Institut Français de Bioinformatique

IFB - French Institute of Bioinformatics

Mission : to make available core bioinformatics resources to the national/international life science research community.

• To provide support for biology programs

• supporting projects

• training users

• To provide an IT infrastructure devoted to management and analysis of biological data

• material resources : CPUs, disks, etc.

• availability of biology data collections

• deployment of bioinformatics tools

• To act as a middleman between the life science community and the bioinformatics/computer science research community

Page 22: Bioinformatique sur Cloud

Institut Français de Bioinformatique

IFB - Infrastructure

• IFB-Core resources• Academic cloud for life cience

• Will be hosted at CNRS IDRIS supercomputing center (PARIS)

• A pilot infrastructure (2014-Q1)

• Production infrastructure +5,000cores 1PB (2014-S2)

• + Regional resources• 6 regional bioinformatics centers

• +6,000 cores ~1PB

• 2 existing clouds: PRABI-IBCP IDB cloud (Lyon) & Genouest genocloud (Rennes)

• Deploy a clouds federation

- RENABI IFB -Bioinformatics French Institute

RENABI-GO

APLIBIO

PRABI

RENABI-SO

RENABI-NE

RENABI-GS

FIB-core ITCNRS-IDRIS, Paris

Page 23: Bioinformatique sur Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013

• Acknowledgment

• Clément Gauthey (IDB)

• StratusLab members

• co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and by the French National Research Agency's Arpege Programme (ANR-10-SEGI-001).

Questions ?

http://idee-b.ibcp.fr