bioinformatics & machine learningbioinformatics bioinformatics machine learning microarrays...

32
Bioinformatics & Machine Learning Daniel Glez-Peña IP Leiria, June 3 2009

Upload: others

Post on 07-Jun-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Bioinformatics &Machine Learning

Daniel Glez-Peña

IP Leiria, June 3 2009

Page 2: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Agenda

1.1. BioinformaticsBioinformaticsDefinition, major research areas, databases

2.2. Machine Learning for bioinformaticsMachine Learning for bioinformaticsAlgorithm types, examples in bioinformatics

3.3. DNA MicroarraysDNA MicroarraysTechnological overview

4.4. ApplicationsApplicationsGeneCBR and WhichGenes?

Page 3: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Bioinformatics

Page 4: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Bioinformatics

BioinformaticsBioinformatics Machine Learning Microarrays ApplicationsMachine Learning Microarrays Applications

““Application of the Information Application of the Information Technologies to the field of Technologies to the field of molecular biologymolecular biology””

Creation and enhancement of:Creation and enhancement of:− Databases with biological information

− Algorithms

− Statistical techniques

……to solve formal and practical problems to solve formal and practical problems arising from the management and analysis of arising from the management and analysis of biological databiological data

Page 5: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Major research areas

GENOMICSGENOMICS− Sequence analysis

− Genome annotation

− Analysis of mutations in cancer

PROTEOMICSPROTEOMICS− Protein-protein docking

− Analysis of protein expression

− Prediction of protein structure

MICROARRAYSMICROARRAYS− Analysis of gene expression

− Genetic network induction

TEXT MININGTEXT MINING− Gene annotation

− Protein annotation

− Relation extraction

EVOLUTIONEVOLUTION− Phylogenetic reconstruction

− Comparative genomics

SYSTEMS BIOLOGYSYSTEMS BIOLOGY− Modelling biological systems

OTHEROTHER− Image Analysis

BioinformaticsBioinformatics Machine Learning Microarrays ApplicationsMachine Learning Microarrays Applications

Page 6: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Major research areasLarrañaga et al (2005), Briefings in Bioinformatics 7(1):86-112

2009

2007

2005

2008

2009

Page 7: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Molecular biology dogma

ACTTGTCATGGCGACTGTCCTTTGTGC…

RNA

Trascription

& splicing Protein Sequence

Translation

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLS…

Nucleotide sequence

Transcripts Aminoacid sequence

FoldingProtein

3Dstructure

GENOMICS

Interacts

Set of Reactions

DNA involved Pathway

PROTEOMICS METABOLOMICS

Interactomics

Sequence analysisGenome

annotationAnalysis of

mutations

EvolutionPhylogenetic

reconstructionComparative

genomics

Protein expression analysis

[mass spectometry]

Protein structure prediction [folding]

Gene expression analysis

[DNA microarray]

Protein interaction prediction

[3D docking]

Modelling biological systems

Functional analylis

BioinformaticsBioinformatics Machine Learning Microarrays ApplicationsMachine Learning Microarrays Applications

Page 8: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Databases

Genomics ProteomicsSequences

Genomes

Cene-centric

Experimental data

Ontologies

Prot-Prot interactions

Pathways

Bibliome

Proteins

Structure

Domains

Interactomics & Metabolomics

BioinformaticsBioinformatics Machine Learning Microarrays ApplicationsMachine Learning Microarrays Applications

Page 9: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Machine Learning for Bioinformatics

Page 10: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Machine Learning & Bioinformatics

CLASSIFICATION (SUPERVISED LEARNING)CLASSIFICATION (SUPERVISED LEARNING)

CLUSTERING (UNSUPVERVISED LEARNING)CLUSTERING (UNSUPVERVISED LEARNING)

GRAPHICAL PROBABILISTIC MODELSGRAPHICAL PROBABILISTIC MODELS

OPTIMIZATIONOPTIMIZATION

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 11: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Classification

Classification (supervised learning)Classification (supervised learning)

− Given a set of “instances”, each one with a set of measured “attributtes” and a “outcome” value we want to train a model that predicts the outcome in further problem instances

If the “outcome” is discrete (typical 2 o more different values) we are talking about classification (if not: regression)

Training data

Test data

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 12: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Classification

ClassificationClassification

− Feature subset selection.Are all input attributes useful?

Advantages: reduced cost in data adquisition, improved uniderstability of the model, faster training, and better accuracy

It is a search space problem (2n-1), in general:− 1. Generate a subset

[brute force, deterministic/not deterministic heuristic search]− 2. Evaluate subset

Statistical estimation: Information Gain, X2, t-test, DFP, CFSWrapper (use classifier accuracy in training set)

− 3. if (!halt_condition) GOTO 1

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 13: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Classification

ClassificationClassification− Popular techniques

Logistic regressionLinear discriminant analysis (LDA)Bayesian classifiers: Naive Bayes, semi-NB, Tree augmented NB, k dependence Bayesian…Classification trees: CART, C4.5, RandomForest, J48…K-Nearest NeighboursSupport Vector MachinesMeta: Bagging, Boosting

Classification Tree kNN classifier Support Vector Machine

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 14: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Classification

Examples of classification in Bioinformatics (I)Examples of classification in Bioinformatics (I)

− GenomicsGene finding (if a sequence is a coding region)

Splice site prediction (if a sequence is a splice site)

Predict disease genes (from i.e. its sequence length?)

Prediction of mutation (SNP) effect

Cancer prediction from gene expression (microarrays)

− ProteomicsPrediction of secondary structure (alpha-helix, beta-sheet,etc.)

Prediction of sub-cellular location of the protein

Cancer prediction from protein expression (mass spectra)

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 15: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Classification

Examples of classification in Bioinformatics (and II)Examples of classification in Bioinformatics (and II)

− Systems biologyPredict the cell migration speed (high, low) from the phosphorilation levels of signalling proteins

Predict a gene regulatory level (up-regulated or down-regulated given the ‘related’ genes expression)

− Text miningProtein/gene recognition in biomedical literature (is this word a gene/protein given some word features: ortographic, part-of-speech, suffix, trigger words, etc…??)

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 16: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Clustering

ClusteringClustering

− Partition a set of “instances” in several groups (clusters) given the differences between them

Their are based on “distances” between instances that is a problem-dependant issue

− Typical: Euclidean, Pearson, Sperman

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 17: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Clustering

ClusteringClustering

− Popular techniques

Partition clustering: k-means, SOM, GCS, PAM

Hierarchical clustering with single-linkage, complete linkage, centroid linkage and wards-criterion

− They produce the popular “dendograms”

Model-based clusteringPartition clustering Hierarchical clustering

(dendogram)

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 18: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Clustering

Clustering in BioinformaticsClustering in Bioinformatics

− Mainly applied to analyze gene expression dataCo-Expression detection (group genes with similar expression)

Subclass discovery (group samples given the expression of its genes)

Expression data visualization/summarization with dendograms

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 19: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Probabilistic graphical models

DAGs where nodes are DAGs where nodes are random variables and links random variables and links are probabilities from any are probabilities from any kind of conditional kind of conditional dependencedependence

− ExamplesHidden Markov Models

Bayesian Networks

Bayesian Network

Hidden Markov Model

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 20: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics:Probabilistic graphical models

Probabilistic Graph Models in BioinformaticsProbabilistic Graph Models in Bioinformatics

− GenomicsHMM to gene finding (does a gene sequence come from a coding or a non coding DNA region?)

Bayesian networks to detect splice sites (does a gene sequence come from a splice-site)

− Systems BiologyInference of regulatory genetic networks. Bayesian networks to expression pattern recognition (which genes cause other genes to express?)

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 21: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Optimization

OptimizationOptimization− Search of the best solution in a huge (exponential) space.− Popular techniques

Exact optimization− Brute force

Deterministic− Hill climbing, local optimization

Stochastic− Monte Carlo− Simulated Annealing− Tabu search− Evolutionary

Genetic algorithmsGenetic ProgrammingEstimation of probability

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 22: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

ML & Bioinformatics: Optimization

Optimization techniques in BioinformaticsOptimization techniques in Bioinformatics− Genomics

Multiple sequence alignment (used almost all optimization algorithms)Splice site prediction with estimation of distribution algorithmsDNA sequencingCluster microarray data

− ProteomicsProtein folding (predict 3D structure)Protein side-chain prediction (determine the optimal set of ‘angles’ in the 3D structure that minimize the energy)

− Systems BiologyInference of gene networks and estimate the parameters of bioprocesses

− EvolutionInference of phylogenetic treesHaplotype reconstruction

Bioinformatics Bioinformatics Machine Learning Machine Learning Microarrays ApplicationsMicroarrays Applications

Page 23: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

DNA Microarrays

Page 24: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

DNA Microarrays

DNA microarray. Objetive: Measure gene expressionDNA microarray. Objetive: Measure gene expression

DescriptionDescription

− Matrix with measures the expression of thousands of genes simultaneously

− Gives a “global” vision of gene activity, and allows comparison

Between different individuals

Same individual at different times

Different tissues

24/25Bioinformatics Bioinformatics Machine Learning Machine Learning MicroarraysMicroarrays ApplicationsApplications

Page 25: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

DNA Microarrays

• How it works– DNA fragments are spotted or

printed in probes on the array surface

• Each probe is a gene – Hibridation is performed with a

sample putted onto the array– A scanner measures the intensity in

each probe

microarray

DNA fragments sample

hibridation

Image measurement

scanner

Microarray data

adquisition

Dataset

25/25Bioinformatics Bioinformatics Machine Learning Machine Learning MicroarraysMicroarrays ApplicationsApplications

Page 26: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

DNA Microarrays

Human Genome U133Human Genome U133

− HG U133A, HG U133B

− 22.000 probes aprox. ( ≅1 probe x gen)

Human Genome U133 plusHuman Genome U133 plus

− 44.000 probes (≅2 probes x gen)

Exon arrayExon array

− 1.4 millions of probes (≅16 probes x gen)

26/25Bioinformatics Bioinformatics Machine Learning Machine Learning MicroarraysMicroarrays ApplicationsApplications

Page 27: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

DNA microarrays

Typical analyses & ML TechniquesTypical analyses & ML Techniques

− Gene-based analysis

Co-expression detection with clustering techniques (unsupervised)

− Differential gene expression analysis

Detect which genes has a significant expression variation among samples of two or more conditions (feature selection)

− Sample-based analysis

Class predicion with classification techniques (supervised)

Class discovery with clustering techniques (unsupervised)

− Problems:

Huge number of features (thousands of genes) y low number of samples (dozens) V.S. Machine Learning

High false positive rate

Bioinformatics Bioinformatics Machine Learning Machine Learning MicroarraysMicroarrays ApplicationsApplications

Page 28: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

DNA microarrays

Functional interpratation after data analysisFunctional interpratation after data analysis

− Typically we have a list of genes of interest (ie. differentially expressed)

− Question: who are those genes?

− Solution: Use the available gene annotations (Gene Ontology, Pathways, etc) and see if there is a correlation with a functional module.

They answer to the question: Are my genes significantly chosen from a given gene function? If so, which function?

On-line tools− List-based: FatiGO, DAVID, Pathjam− Gene-set based: GSEA, FatiScan

Bioinformatics Bioinformatics Machine Learning Machine Learning MicroarraysMicroarrays ApplicationsApplications

Page 29: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Sample applications

Page 30: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

geneCBR

Translational tool for DNA microarrayTranslational tool for DNA microarray--based based diagnosticsdiagnostics

− www.genecbr.org

− Glez-Peña et al. BMC Bioinformatics 10:37 2007

− Classification guided by a clustering algorithm GCS

Bioinformatics Bioinformatics Machine Learning MicroarraysMachine Learning Microarrays ApplicationsApplications

Page 31: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

WhichGenes?

OnOn--line geneset building toolline geneset building tool

− Create your own genesets from multiple datasources and use them in your favourite geneset-based analysis tools like GSEA

− www.whichgenes.org

− Glez-Peña et al. Nucleic Acids Res (web server issue) 2009

Bioinformatics Bioinformatics Machine Learning MicroarraysMachine Learning Microarrays ApplicationsApplications

Page 32: Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

Questions?