Download - Bioinformatique sur Cloud
Christophe BlanchetInstitute of Biology and Chemistry of Proteins
Head of Service ‘’Infrastructure for Biology - IDB’’CNRS-IBCP FR3302 - LYON - FRANCE - http://idee-b.ibcp.fr
IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552), the French National Research Agency's Arpege Programme (ANR-10-SEGI-001) and by the French
Institute for Bioinformatics (IFB-RENABI)
Bioinformatiquesur Cloud
Cas d’usage avec le portail Galaxy
Ecole Bioinformatique Aviesan, 18 octobre 2013
Bioinformatics Today• Biological data are big data
• 1512 online databases (NAR Database Issue 2013)
• Institut Sanger, UK, 5 PB
• Beijing Genome Institute, China, 5 sites, 12.6 PB➡ Big data in many places
• Analysing such data became difficult• Scale-up of the analyses : gene/protein to complete genome/
proteome, ...
• Lot of different daily-used tools
• That need to be combined in workflows
• Usual interfaces: portals, Web services, federation,...➡ Datacenters with ease of access/use
• Distributed resources• Experimental platforms: NGS, imaging, ...
• Bioinformatics platforms➡ Federation of datacenters
ADN
ADN
BI
M
M
ADN
ADN
BI
ADN
ADN
BI CC
BI
M
ADN
ADN
ADN
Ecole Bioinformatique Aviesan, 18 octobre 2013
IDB Cloud and Bioinformatics Appliances
• Cloud workbench for Biology• Running since Sept. 2011
CNRS-IBCP FR3302, Lyon, France
• opened to Biology community
• 14 bioinformatics appliances: Galaxy portal,standard compute nodes, proteomics, virtual desktop, structural biology, ...
• +40 users from all IFB regional centersPRABI 15, APLIBIO 14, RENABI-NE 8, -SO 2, -GS 1, -GO 1
• VMs up to 32cores-768GB RAM
• Infrastructure• Compute +900cores +4TB ram
• Standard nodes (32c-128GB)
• Bigmen nodes (64c 768GB)
• Powered by StratusLab
• Storage +250TB
• Virtual disks, object storage (S3)
toolsBLAST
TopHat
FastASSearch
R
ClustalW2
samtoolsBWA
Linuxsystem
Createnew cloudservices
Bioinformatics Marketplace
Structures ...Sequences ProteomicsGalaxy
+Virtual Machines
OMSSA
PeptideShaker
HMMer
Muscle
X!tandemARIA
fastQCClustalOmega
Galaxy
tools VM:BLAST,ClustalW2,etc.
BI
data
UNIPROT
PDB
EMBL
PROSITE
Genomes
Z
B
A
public datauser data
Movecloudvirtual
machines
IDB Cloud
Ecole Bioinformatique Aviesan, 18 octobre 2013
Cloud extended services
• Bioinformatics Marketplace• find appropriate appliances more easily.
• reduce “noise” in the central Marketplace
• respect visibility contraints for the bioinformatic appliances, such as confidentiality
• Bioinformatics metadata ‘’bio:tool’’• additional elements related to bioinformatics tools
• to annotate appliances
• help users to search for the tools themselves or the type of analysis
• select suitable bioinformatics appliances containing the required tools
• Integrated Web interface• VM & virtual disks management
• browse bionformatics appliances with ‘‘bio:tool’’ MDz
Native cloud services• Authentication
• Virtual machine management
• Persistent disk service• Client CLI
• etc.
IDB
Ecole Bioinformatique Aviesan, 18 octobre 2013
Driven throught a simple web interface
Ecole Bioinformatique Aviesan, 18 octobre 2013
Run your Bioinformatics Cloud InstancesBioinformatics Marketplace
NGSStructure Galaxy ARIA (…)Sequence
IBCP's CloudResources
BLAST,Clustal,
etc.
PaaS
WorkersVM CNS
Shar
ed F
S
launch jobssshIaaS
Master & StorageVM ARIA
Portal
Laun
chIn
stan
ces
Ecole Bioinformatique Aviesan, 18 octobre 2013
UNIPROT
PDB
EMBLPROSITE
Genomes
Public
Data sources
BioinformaticsCloud
BLAST,Clustal,
etc.
PaaS
WorkersVM CNS
Shar
ed F
S
launch jobssshIaaS
Master & StorageVM ARIA
Portal
shared(NFS)
User
Persistent data
pdisk(iSCSI)
Biological Data in CloudUpload your data
Get your results
sftp/http/S3
sftp/http/S3
Ecole Bioinformatique Aviesan, 18 octobre 2013
Examples of CloudBionformatics Appliances
Ecole Bioinformatique Aviesan, 18 octobre 2013
Standard Bioinformatics node
• ‘Biocompute’ appliance
• Use your own instance(s)
• With pre-installed standard bioinformatics tools• BLAST, FastA, SSearch,HMM,...
• ClustalW2, Clustal-Omega, Muscle,..
• Bowtie(2), BWA, samtools, ...
• MEME, R, etc.
• Connected to public reference data• Uniprot, EMBL, genomes, PDB, etc.
• Automaticaly shared to the VMs
Ecole Bioinformatique Aviesan, 18 octobre 2013
Structural Biology• TOwards StruCtural AssignmeNt Improvement
• To improve the determination of protein structures based on Nuclear Magnetic Resonance (NMR) information with ARIA software
• Large computational needs.
• A NMR laboratory will not specially invest in building a cluster of about 100 nodes to be able to run such NMR structure calculations.
• Flexibility of the cloud to deploy the different required bioinformatics tools can accelerate such a procedure.
• Commercial interest in providing such tools to structural biologists on a “pay as you go” basis.
• Endorsers:Institut Pasteur Parisand CNRS IBCP
Ecole Bioinformatique Aviesan, 18 octobre 2013
Proteomics desktop• Motivation
• Collaboration with a mass spectroscopy platform
• Running out of space on their local resources
• Protein identification• Mass experimental data
• Reference databases : nr, Swiss-Prot
• Reference screening tools:OMSSA, X!Tandem
• User interface• Remote display
• NX
• Reference GUIs
• SearchGUI
• PeptidShaker
source: PeptideShaker site
Ecole Bioinformatique Aviesan, 18 octobre 2013
MapReduce Biology• Provide turnkey virtual machine with pre-
configured mapreduce framework• Accelerate bigadata analysis with the two steps map &
reduce paradigm
• Hadoop MapReduce 1.0.4
• Appliances (2)• standard hadoop mapreduce
• bioinformaytics software integrated in hadoop
• Sequences similarity with mapreduce paradigm• FastA & SSearch
• deploy database of sequences in HDFS
• compare each structure to others
Developed in the context of the French project MapReduce, ANR ARPEGE
Mappers
Databank
FastA #01
Reducers
subset#01
subset#02
...
FastA #02 ...
User'sSequences
Resultsscore sequencescore sequence
...
FastAMR
Each mapper
send the
score and
sequences to
reducers
Reducers copy the
best scores of the
whole experiment
in the DFS
Each mapper
runs a FastA
program on a
part of the
databank
FastAMR splits the
databank into subsets
and puts them in the
DFS along with the
sequences file
Users run the FastAMR
script with its sequences
and the databank
Ecole Bioinformatique Aviesan, 18 octobre 2013
Cas d’usage avec Galaxy
Ecole Bioinformatique Aviesan, 18 octobre 2013
Compte Cloud IDB
• Connectez-vous
• Remplissez les différents champs• adresse mail
institutionnelle
• Créer la demande implique l’acceptation des conditions d’utilisations !
https://idee-b.ibcp.fr/cloud.html
Ecole Bioinformatique Aviesan, 18 octobre 2013
Appliances disponibles
• Liste des appliances existantes
• Documentation spécifique aux appliances
• Création directe• bouton ‘Power’
Ecole Bioinformatique Aviesan, 18 octobre 2013
Créer mon portail Galaxy
• Appelée aussi ‘Instance’
• Compléter les différents paramètres• lui assigner un nom
• nombre de CPUs
• taille mémoire
• attacher un disque virtuel
• Cluster de VM• remplir le nombre de VMs
• choix du nom unique
Ecole Bioinformatique Aviesan, 18 octobre 2013
Connexion sur mon instance Galaxy
Ecole Bioinformatique Aviesan, 18 octobre 2013
Les disques durs virtuels
• Un disque virtuel permet de conserver ses données indépendamment de l’exécution des VMs• retrouver ses données d’une
VM à la suivante.
• Actions• créer un vdisk
• gérer ses vdisks
• Utiliser un vdisk• à la création de la VM
• montage à chaud
Ecole Bioinformatique Aviesan, 18 octobre 2013
Echanger les données avec mon portail Galaxy
• sftp / scp
• client graphique: Cyberduck, Transmit, Filezilla, ...
• Web: Galaxy - Get Data - Lien pour download
Ecole Bioinformatique Aviesan, 18 octobre 2013
Conclusion• Added value of cloud, e.g. NGS with Galaxy
• for scientific analyses: user-specific resources, isolated, different instances together
• for training: Oct 2012 Bordeaux, Mai 2013 Galaxy Lille, (next) 2014 Galaxy Jouy
• for tools integration: semantic annotation, solve software dependencies
• for development & operations (DevOps): different versions at the same time
• Provide turnkey bioinformatics appliances• Standard tools and pipelines
• New developments• Ready to run on clouds
• Public bioinformatics cloud (e.g. IDB)• Tightly connected to existing bioinformatics
resources
• Linked to public biological databases
• In collaboration with the French Institute of Bioinformatics
toolsBLAST
TopHat
FastASSearch
R
ClustalW2
samtoolsBWA
Linuxsystem
Createnew cloudservices
Bioinformatics Marketplace
Structures ...Sequences ProteomicsGalaxy
+Virtual Machines
OMSSA
PeptideShaker
HMMer
Muscle
X!tandemARIA
fastQCClustalOmega
Galaxy
tools VM:BLAST,ClustalW2,etc.
BI
data
UNIPROT
PDB
EMBL
PROSITE
Genomes
Z
B
A
public datauser data
Movecloudvirtual
machines
IDB Cloud
Institut Français de Bioinformatique
IFB - French Institute of Bioinformatics
Mission : to make available core bioinformatics resources to the national/international life science research community.
• To provide support for biology programs
• supporting projects
• training users
• To provide an IT infrastructure devoted to management and analysis of biological data
• material resources : CPUs, disks, etc.
• availability of biology data collections
• deployment of bioinformatics tools
• To act as a middleman between the life science community and the bioinformatics/computer science research community
Institut Français de Bioinformatique
IFB - Infrastructure
• IFB-Core resources• Academic cloud for life cience
• Will be hosted at CNRS IDRIS supercomputing center (PARIS)
• A pilot infrastructure (2014-Q1)
• Production infrastructure +5,000cores 1PB (2014-S2)
• + Regional resources• 6 regional bioinformatics centers
• +6,000 cores ~1PB
• 2 existing clouds: PRABI-IBCP IDB cloud (Lyon) & Genouest genocloud (Rennes)
• Deploy a clouds federation
- RENABI IFB -Bioinformatics French Institute
RENABI-GO
APLIBIO
PRABI
RENABI-SO
RENABI-NE
RENABI-GS
FIB-core ITCNRS-IDRIS, Paris
Ecole Bioinformatique Aviesan, 18 octobre 2013
• Acknowledgment
• Clément Gauthey (IDB)
• StratusLab members
• co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and by the French National Research Agency's Arpege Programme (ANR-10-SEGI-001).
Questions ?
http://idee-b.ibcp.fr