biologie computationnelle

29
Biologie Computationnelle Pour quoi faire ? Protein family Gene sequence Structure Non-coding regions Gene networks Gene order

Upload: phamnhu

Post on 31-Dec-2016

231 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Biologie Computationnelle

Biologie ComputationnellePour quoi faire ?

Protein family Gene sequence Structure

Non-coding regions

Gene networks

Gene order

Page 2: Biologie Computationnelle

Bioinformatique TranslationnellePLOS Computational BiologyA Peer-Reviewed, Open Access JournalTranslational Bioinformatics : a collection of Education Articles, 2012http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v03.i11 Terrain d'illustration de l'impact de la Biologie Computationnelle :

- intégrer des quantités considérables de données moléculaires et cliniques pour une meilleure compréhension des bases moléculaires de maladies qui à son tour modifiera les pratiques cliniques et comme horizon améliorer la santé humaine.

Informatics : the study of how to represent, store, search, retrieve and analyze information

Medical Informatics : concerns medical information

BioInformatics : concerns basic biological information

Clinical Informatics : focuses on the clinical delivery part of medical informatics

Biomedical Informatics : merges bioinformatics and medical informatics

Imaging Informatics : focuses on images

Page 3: Biologie Computationnelle

Translationnelle :- Comment améliorer efficacement le diagnostic, le pronostic et le traitement des

patients ?● Création d'appareils médicaux● Diagnostic moléculaire● Thérapeutique via de petites molécules● Vaccins

Etc.● Maîtrise de l'explosion des connaissances en biologie moléculaire, génétique et génomique.

Y a-t-il eu transfert de la connaissance de la structure en double hélice de l'ADN vers une amélioration pratique avec un impact sur la santé humaine d'un point de vue de l'innovation technologique ?

Ce qui est certain, on est capable aujourd'hui de mesurer rapidement :- les séquences ADN (à l'échelle d'un génome entier !)- les séquences ARN et leur expression - les séquences protéique, leur structure, expression et modification- la structure, la présence et quantité de petits métabolites moléculaires

Page 4: Biologie Computationnelle

Translational Bioinformatics : the development and application of informatics methods that connect molecular entities to clininal entities :

Molecular Informatics : concerns molecular/cellular information

Clinical Informatics : focuses on the clinical delivery part of medical informatics

PubMED : http://www.ncbi.nlm.nih.gov/pubmedUMLS : Unified Medical Language Systemhttp://www.nlm.nih.gov/research/umls/

Genes ; DNA ;RNA messengers ;MicroRNAs ; ProteinsSignaling molecules and cascades ;Metabolites ;Cellular communication processes ;Cellular organization

Genbank http://www.ncbi.nlm.nih.gov/genbank/Gene Expression Omnibushttp://www.ncbi.nlm.nih.gov/geo Protein Data Bankhttp://www.wwpdb.org/ KEGGhttp://www.genome.jp/kegg/ MetaCychttp://metacyc.org Reactomehttp://www.reactome.org

Impact : Médecine personnalisée ?Association de génomes, fouille de marqueurs génétiquesAnalyse de données génomiques personnelles, fouille de données issues d'enregistrements électroniques etc.

Page 5: Biologie Computationnelle

4 Grands Chapitres pour cette session :

- Imagerie Quantitative

- Machine Learning / Data Mining

- Représentation par graphes et réseaux

- Représentation des connaissances : données, banque de données, ontologies

Des technologies/ environnements de développement :- Java : ImageJ, Weka

- Python : Biopython, Numpy, Scipy, Matplotlib, Enthought Python Distribution et Canopy

- scripts : Perl, Gawk

- Inkscape, ImageMagik / Sphinx / XML, SBML, BioPax, GPML, JSON, SQL, noSQL, Hadoop

Des softwares :

- Clustal → T-coffee, PathwayAPI, BioGRID, PatternHunter ….

→ Des choix, un co-design du cours ...

Page 6: Biologie Computationnelle

Des travaux pratiques sur :

- Intégration de Connaissance (Ontologies, données)

- Construction de Réseaux d'interaction de gènes : R, statistiques

- Visualisation de l'interactome : PPI , maladies, Cytoscape software, analyse topologique de graphes (network analyzer, arbres de Steiner trees, MST)

- Assemblage de connaissance et interprétation : GSEA, GO, KEGG, BioLattice, pathology and ontology based analysis

- Priorisation de gènes pathogènes : MLP, Weka

- Analyse d'Images : ImageJ, cellProfiler, DcellIQ, NeuronStudio, ITK

- Visualisation -omiques : R, python

- Algorithmes de Graphes, Complexité

Page 7: Biologie Computationnelle

Une perspective biologique permanente :

- Gene feature recognition :→ TIS (Translation Initiation Site) → TSS (Transcriptional Start Sites)→ Feature Generation → Feature Selection → Feature integration → Gene finding

- Gene expression analysis : → Affymetrix Gene Chip Data→ Gene expression Profile Classification→ Gene expression Profile Clustering→ Gene Regulatory Circuits Reconstruction : Differentially Expressed

Genes, Gene Interaction Prediction

- Sequence Alignment / Comparaison / Homology :→ Multiple Sequence Alignment (Dynamic Programming)→ Function assignment to protein sequence (Guilt by Association)→ Discovery of Active Site or Domain of a function→ PPI / Proteomic Profile Analysis→ key mutation site identification

- Phylogenetic tree : → Construction→ Comparison

-Biological Networks / Graph of Interactions : → Natural pathways→ PPI Networks→ Protein Complex Prediction

Page 8: Biologie Computationnelle

TIS : Translation Initiation Site Recognition/Prediction

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

Un échantillon de cDNA

Pourquoi le second ATG est-il un TIS?

Page 9: Biologie Computationnelle

Gene prediction

Page 10: Biologie Computationnelle

T-Cell Epitopes Prediction By Artificial Neural Network

• Honeyman et al., Nature Biotechnology 16:966-969, 1998

Page 11: Biologie Computationnelle

PAS Prediction

Feature generati on

I ncomi ngsequences

Feature sel ecti on

Feature i ntegrati on

END

BEGI N

SVM in Weka

Page 12: Biologie Computationnelle

Histone Promoter Recognition Programs

Page 13: Biologie Computationnelle

9 Motifs Discovered by MEME algo in Histone Promoter 5’ Region [-250,-1] among 127 histone promoters

Page 14: Biologie Computationnelle

Diagnosis of Childhood Acute Lymphoblastic Leukemia (ALL) and Optimization of Risk-Benefit Ratio of Therapy

Immunophenotyping

Page 15: Biologie Computationnelle

TEL-AML1BCR-ABLHyperdiploid >50E2A-PBX1

MLL T-ALL Novel

Affymetrix GeneChip Micro Array Analysis

Page 16: Biologie Computationnelle

Proteomics Data : Guilt-by-Association

Compare T with seqs of known function in a db

Assign to T same function as homologs

Confirm with suitable wet experiments

Discard this functionas a candidate

Page 17: Biologie Computationnelle

Topology of Protein Interaction Networks: Hubs, Cores, Bipartites

Maslov & Sneppen, Science, v296, 2002

318 edges329 nodes

In nucleus ofS. cerevisiae

Proteomics Data: Subgraphs in Protein Interaction

Page 18: Biologie Computationnelle

Yeast SH3 domain-domainInteraction network: 394 edges, 206 nodesTong et al. Science, v295. 2002

8 proteins containing SH35 binding at least 6 of them

Page 19: Biologie Computationnelle

a bConfiguration a is less likely than b in protein interaction networks → Graph/Network Mining

Page 20: Biologie Computationnelle

Decision Tree Based In-Silico Cancer Diagnosis

A B A

B A

x1

x2

x4

x3

> a1

> a2

Leaf nodes

Internal nodes

Root node

B

AA

B

B A

A

Prognosis based on Gene Expression Profiling

Page 21: Biologie Computationnelle

• Yeoh et al., Cancer Cell 1:133-143, 2002; Differentiating MLL subtype from other subtypes of childhood leukemia

● Training data (14 MLL vs 201 others), Test data (6 MLL vs 106 others), Number of features: 12558

Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision!

Page 22: Biologie Computationnelle

Discovery of Diagnostic Biomarkers for Ovarian Cancer

● Motivation: cure rate ~ 95% if correct diagnosis at early stage

● Proteomic profiling data obtained from patients’ serum samples

● The first data set by Petricoin et al was published in Lancet, 2002

● Data set of June-2002.● 253 samples: 91 controls

and 162 patients suffering from the disease; 15154 features (proteins, peptides, precisely, mass/charge identities) SVM: 0 errors; Naïve Bayes: 19

errors; k-NN: 15 errors.

Page 23: Biologie Computationnelle

Mining Errors from Bio Databases

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Uninformative sequences

Undersized sequences

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

• Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).

• In Nov 2004, the total number of undersized protein sequences increases to 3,350.

• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004).

• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.

Undersized protein sequences in major databases

218 171

42

528

116 151

1015

364 383

12351

1253 2 120 0 23

0

200

400

600

800

1000

1200

1 2 3

Sequence Length

Num

ber

of r

ecor

ds

DDBJ

EMBL

GenBank

PDB

SwissProt

PIR

Undersized nucleotide sequences in major databases

2 3 924

5573 69

115

81

233228

108 108

5167

6

40 45

77104

0

50

100

150

200

250

1 2 3 4 5

Sequence LengthN

um

ber

of

reco

rds

DDBJ

EMBL

GenBank

PDB

Example Meaningless Seqs

Page 24: Biologie Computationnelle

Duplicates detected by association rules

6

36.3

0.3

49.4

5.2

32.7

0.11.8 2.4 3.8

5.77.5 7.9 9.4

0

10

20

30

40

50

60

Rule 1

Rule 2

Rule 3

Rule 4

Rule 5

Rule 6

Rule 7

Association rules

FP

% a

nd

FN

%

FP% FN% x 1000

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7

S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6

S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5

S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3

S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2

S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1

Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.

Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.

Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.

Page 25: Biologie Computationnelle

Time since split

Australian

Papuan

Polynesian

Indonesian

Cherokee

Navajo

Japanese

Tibetan

English

Italian

Ethiopian

Mbuti PygmyAfrica

Europe

Asia

America

Oceania

Austalasia

Root

● Estimate order in which “populations” evolved

● Based on assimilated freq of many different genes

● But …– is human evolution a

succession of population fissions?

– Is there such thing as a proto-Anglo-Italian population which split, never to meet again, and became inhabitants of England and Italy?

Phylogenetic tree construction

Page 26: Biologie Computationnelle
Page 27: Biologie Computationnelle

Predicting interactions using phylogenetic profile

Pellegrini et al. PNAS 96, 4285-4288 (1999)

Page 28: Biologie Computationnelle

Comparative genomics

Page 29: Biologie Computationnelle

GWAS Genome Wide Association Studies