biologie computationnelle

Biologie ComputationnellePour quoi faire ?

Protein family Gene sequence Structure

Non-coding regions

Gene networks

Gene order

Bioinformatique TranslationnellePLOS Computational BiologyA Peer-Reviewed, Open Access JournalTranslational Bioinformatics : a collection of Education Articles, 2012http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v03.i11 Terrain d'illustration de l'impact de la Biologie Computationnelle :

- intégrer des quantités considérables de données moléculaires et cliniques pour une meilleure compréhension des bases moléculaires de maladies qui à son tour modifiera les pratiques cliniques et comme horizon améliorer la santé humaine.

Informatics : the study of how to represent, store, search, retrieve and analyze information

Medical Informatics : concerns medical information

BioInformatics : concerns basic biological information

Clinical Informatics : focuses on the clinical delivery part of medical informatics

Biomedical Informatics : merges bioinformatics and medical informatics

Imaging Informatics : focuses on images

http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v03.i11

Translationnelle :- Comment améliorer efficacement le diagnostic, le pronostic et le traitement des

patients ?● Création d'appareils médicaux● Diagnostic moléculaire● Thérapeutique via de petites molécules● Vaccins

Etc.● Maîtrise de l'explosion des connaissances en biologie moléculaire, génétique et génomique.

Y a-t-il eu transfert de la connaissance de la structure en double hélice de l'ADN vers une amélioration pratique avec un impact sur la santé humaine d'un point de vue de l'innovation technologique ?

Ce qui est certain, on est capable aujourd'hui de mesurer rapidement :- les séquences ADN (à l'échelle d'un génome entier !)- les séquences ARN et leur expression - les séquences protéique, leur structure, expression et modification- la structure, la présence et quantité de petits métabolites moléculaires

Translational Bioinformatics : the development and application of informatics methods that connect molecular entities to clininal entities :

Molecular Informatics : concerns molecular/cellular information

Clinical Informatics : focuses on the clinical delivery part of medical informatics

PubMED : http://www.ncbi.nlm.nih.gov/pubmedUMLS : Unified Medical Language Systemhttp://www.nlm.nih.gov/research/umls/

Genes ; DNA ;RNA messengers ;MicroRNAs ; ProteinsSignaling molecules and cascades ;Metabolites ;Cellular communication processes ;Cellular organization

Genbank http://www.ncbi.nlm.nih.gov/genbank/Gene Expression Omnibushttp://www.ncbi.nlm.nih.gov/geo Protein Data Bankhttp://www.wwpdb.org/ KEGGhttp://www.genome.jp/kegg/ MetaCychttp://metacyc.org Reactomehttp://www.reactome.org

Impact : Médecine personnalisée ?Association de génomes, fouille de marqueurs génétiquesAnalyse de données génomiques personnelles, fouille de données issues d'enregistrements électroniques etc.

http://www.ncbi.nlm.nih.gov/pubmed

http://www.nlm.nih.gov/research/umls/

http://www.ncbi.nlm.nih.gov/genbank/

http://www.ncbi.nlm.nih.gov/geo

http://www.wwpdb.org/

http://www.genome.jp/kegg/

http://metacyc.org/

http://www.reactome.org/

4 Grands Chapitres pour cette session :

- Imagerie Quantitative

- Machine Learning / Data Mining

- Représentation par graphes et réseaux

- Représentation des connaissances : données, banque de données, ontologies

Des technologies/ environnements de développement :- Java : ImageJ, Weka

- Python : Biopython, Numpy, Scipy, Matplotlib, Enthought Python Distribution et Canopy

- scripts : Perl, Gawk

- Inkscape, ImageMagik / Sphinx / XML, SBML, BioPax, GPML, JSON, SQL, noSQL, Hadoop

Des softwares :

- Clustal → T-coffee, PathwayAPI, BioGRID, PatternHunter ….

→ Des choix, un co-design du cours ...

Des travaux pratiques sur :

- Intégration de Connaissance (Ontologies, données)

- Construction de Réseaux d'interaction de gènes : R, statistiques

- Visualisation de l'interactome : PPI , maladies, Cytoscape software, analyse topologique de graphes (network analyzer, arbres de Steiner trees, MST)

- Assemblage de connaissance et interprétation : GSEA, GO, KEGG, BioLattice, pathology and ontology based analysis

- Priorisation de gènes pathogènes : MLP, Weka

- Analyse d'Images : ImageJ, cellProfiler, DcellIQ, NeuronStudio, ITK

- Visualisation -omiques : R, python

- Algorithmes de Graphes, Complexité

Une perspective biologique permanente :

- Gene feature recognition :→ TIS (Translation Initiation Site) → TSS (Transcriptional Start Sites)→ Feature Generation → Feature Selection → Feature integration → Gene finding

- Gene expression analysis : → Affymetrix Gene Chip Data→ Gene expression Profile Classification→ Gene expression Profile Clustering→ Gene Regulatory Circuits Reconstruction : Differentially Expressed

Genes, Gene Interaction Prediction

- Sequence Alignment / Comparaison / Homology :→ Multiple Sequence Alignment (Dynamic Programming)→ Function assignment to protein sequence (Guilt by Association)→ Discovery of Active Site or Domain of a function→ PPI / Proteomic Profile Analysis→ key mutation site identification

- Phylogenetic tree : → Construction→ Comparison

-Biological Networks / Graph of Interactions : → Natural pathways→ PPI Networks→ Protein Complex Prediction

TIS : Translation Initiation Site Recognition/Prediction

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

Un échantillon de cDNA

Pourquoi le second ATG est-il un TIS?

Gene prediction

T-Cell Epitopes Prediction By Artificial Neural Network

• Honeyman et al., Nature Biotechnology 16:966-969, 1998

PAS Prediction

Feature generati on

I ncomi ngsequences

Feature sel ecti on

Feature i ntegrati on

END

BEGI N

SVM in Weka

Histone Promoter Recognition Programs

9 Motifs Discovered by MEME algo in Histone Promoter 5’ Region [-250,-1] among 127 histone promoters

Diagnosis of Childhood Acute Lymphoblastic Leukemia (ALL) and Optimization of Risk-Benefit Ratio of Therapy

Immunophenotyping

TEL-AML1BCR-ABLHyperdiploid >50E2A-PBX1

MLL T-ALL Novel

Affymetrix GeneChip Micro Array Analysis

Proteomics Data : Guilt-by-Association

Compare T with seqs of known function in a db

Assign to T same function as homologs

Confirm with suitable wet experiments

Discard this functionas a candidate

Topology of Protein Interaction Networks: Hubs, Cores, Bipartites

Maslov & Sneppen, Science, v296, 2002

318 edges329 nodes

In nucleus ofS. cerevisiae

Proteomics Data: Subgraphs in Protein Interaction

Yeast SH3 domain-domainInteraction network: 394 edges, 206 nodesTong et al. Science, v295. 2002

8 proteins containing SH35 binding at least 6 of them

a bConfiguration a is less likely than b in protein interaction networks → Graph/Network Mining

Decision Tree Based In-Silico Cancer Diagnosis

A B A

B A

x1

x2

x4

x3

> a1

> a2

Leaf nodes

Internal nodes

Root node

B

AA

B

B A

A

Prognosis based on Gene Expression Profiling

• Yeoh et al., Cancer Cell 1:133-143, 2002; Differentiating MLL subtype from other subtypes of childhood leukemia

● Training data (14 MLL vs 201 others), Test data (6 MLL vs 106 others), Number of features: 12558

Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision!

Discovery of Diagnostic Biomarkers for Ovarian Cancer

● Motivation: cure rate ~ 95% if correct diagnosis at early stage

● Proteomic profiling data obtained from patients’ serum samples

● The first data set by Petricoin et al was published in Lancet, 2002

● Data set of June-2002.● 253 samples: 91 controls

and 162 patients suffering from the disease; 15154 features (proteins, peptides, precisely, mass/charge identities) SVM: 0 errors; Naïve Bayes: 19

errors; k-NN: 15 errors.

Mining Errors from Bio Databases

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Uninformative sequences

Undersized sequences

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

• Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).

• In Nov 2004, the total number of undersized protein sequences increases to 3,350.

• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004).

• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.

Undersized protein sequences in major databases

218 171

42

528

116 151

1015

364 383

12351

1253 2 120 0 23

0

200

400

600

800

1000

1200

1 2 3

Sequence Length

Num

ber

of r

ecor

ds

DDBJ

EMBL

GenBank

PDB

SwissProt

PIR

Undersized nucleotide sequences in major databases

2 3 924

5573 69

115

81

233228

108 108

5167

6

40 45

77104

0

50

100

150

200

250

1 2 3 4 5

Sequence LengthN

um

ber

of

reco

rds

DDBJ

EMBL

GenBank

PDB

Example Meaningless Seqs

Duplicates detected by association rules

6

36.3

0.3

49.4

5.2

32.7

0.11.8 2.4 3.8

5.77.5 7.9 9.4

0

10

20

30

40

50

60

Rule 1

Rule 2

Rule 3

Rule 4

Rule 5

Rule 6

Rule 7

Association rules

FP

% a

nd

FN

%

FP% FN% x 1000

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7

S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6

S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5

S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3

S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2

S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1

Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.

Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.

Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.

Time since split

Australian

Papuan

Polynesian

Indonesian

Cherokee

Navajo

Japanese

Tibetan

English

Italian

Ethiopian

Mbuti PygmyAfrica

Europe

Asia

America

Oceania

Austalasia

Root

● Estimate order in which “populations” evolved

● Based on assimilated freq of many different genes

● But …– is human evolution a

succession of population fissions?

– Is there such thing as a proto-Anglo-Italian population which split, never to meet again, and became inhabitants of England and Italy?

Phylogenetic tree construction

Predicting interactions using phylogenetic profile

Pellegrini et al. PNAS 96, 4285-4288 (1999)

Comparative genomics

GWAS Genome Wide Association Studies

biologie computationnelle

Documents