machine learning methods for rna-seq-based transcriptome...

106
Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar R¨ atsch Friedrich Miescher Laboratory Max Planck Society, T¨ ubingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010) c Gunnar R¨ atsch (FML, T¨ ubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 1 / 36

Upload: others

Post on 19-Aug-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Machine Learning Methods for RNA-seq-based

Transcriptome Reconstruction

Gunnar Ratsch

Friedrich Miescher Laboratory

Max Planck Society, Tubingen, Germany

NGS Bioinformatics Meeting, Paris (March 24, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 1 / 36

Page 2: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Discovery of the Nuclein(Friedrich Miescher, 1869) fml

Discovery of Nuclein:from lymphocyte & salmon

“multi-basic acid” (≥ 4)

Tubingen, around 1869

“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 36

Page 3: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Discovery of the Nuclein(Friedrich Miescher, 1869) fml

Discovery of Nuclein:from lymphocyte & salmon

“multi-basic acid” (≥ 4)

Tubingen, around 1869

“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 36

Page 4: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Learning about the Transcriptome What is encoded on the genome and how is it processed? fml

Computational Point of View

How to learn to predict what these processes accomplish?

How well can we predict it from the available information?

Biological View

What can we not predict yet? What is missing?

Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

Page 5: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Learning about the Transcriptome What is encoded on the genome and how is it processed? fml

Computational Point of View

How to learn to predict what these processes accomplish?

How well can we predict it from the available information?

Biological View

What can we not predict yet? What is missing?

Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

Page 6: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Learning about the Transcriptome What is encoded on the genome and how is it processed? fml

Computational Point of View

How to learn to predict what these processes accomplish?

How well can we predict it from the available information?

Biological View

What can we not predict yet? What is missing?

Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

Page 7: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Learning about the Transcriptome What is encoded on the genome and how is it processed? fml

Computational Point of View

How to learn to predict what these processes accomplish?

How well can we predict it from the available information?

Biological View

What can we not predict yet? What is missing?

Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

Page 8: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Machine LearningLearning from empirical observations fmlGiven: Observations of some complex phenomenonGoal: Learn from data & build predictive models

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

Page 9: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Machine LearningLearning from empirical observations fmlGiven: Observations of some complex phenomenonGoal: Learn from data & build predictive models

Example:

Two different classes of observations

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

Page 10: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Machine LearningLearning from empirical observations fmlGiven: Observations of some complex phenomenonGoal: Learn from data & build predictive models

Example:

Inferred classification rule

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

Page 11: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Motivation

Machine LearningLearning from empirical observations fmlGiven: Observations of some complex phenomenonGoal: Learn from data & build predictive models

1 Large scale sequence classification

2 Analysis and explanation of learning results

3 Sequence segmentation & structure prediction

Position

k−m

er L

engt

h

−30 −20 −10 0 10 20 30

87654321

0

5

10

Log-

inte

nsity

transcript

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

Page 12: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

Deep RNA Sequencing (RNA-Seq)fml

RNA-Seq allows . . .

High-throughputtranscriptomemeasurements

Qualitative studies

New transcriptsImproved gene models

Quantitative studies athigh resolution

Differential expressionin tissues, conditions,genotypes, etc.

pre-mRNAintron

exon

mRNAshort reads

junction reads

reference genomeFigure adapted from Wikipedia

Goal: Obtain complete transcriptome for further analyses

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

Page 13: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

Deep RNA Sequencing (RNA-Seq)fml

RNA-Seq allows . . .

High-throughputtranscriptomemeasurements

Qualitative studies

New transcriptsImproved gene models

Quantitative studies athigh resolution

Differential expressionin tissues, conditions,genotypes, etc.

pre-mRNAintron

exon

mRNAshort reads

junction reads

reference genomeFigure adapted from Wikipedia

Goal: Obtain complete transcriptome for further analyses

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

Page 14: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

Deep RNA Sequencing (RNA-Seq)fml

RNA-Seq allows . . .

High-throughputtranscriptomemeasurements

Qualitative studies

New transcriptsImproved gene models

Quantitative studies athigh resolution

Differential expressionin tissues, conditions,genotypes, etc.

pre-mRNAintron

exon

mRNAshort reads

junction reads

reference genomeFigure adapted from Wikipedia

Goal: Obtain complete transcriptome for further analyses

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

Page 15: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

Deep RNA Sequencing (RNA-Seq)fml

RNA-Seq allows . . .

High-throughputtranscriptomemeasurements

Qualitative studies

New transcriptsImproved gene models

Quantitative studies athigh resolution

Differential expressionin tissues, conditions,genotypes, etc.

pre-mRNAintron

exon

mRNAshort reads

junction reads

reference genomeFigure adapted from Wikipedia

Goal: Obtain complete transcriptome for further analyses

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

Page 16: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

Common RNA-Seq Analysis Stepsfml

RNA-SeqReads

Read Alignment

Expression level~ #reads

Significance Testing

Differentially Processed Transcripts/

Segments

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

Page 17: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

Common RNA-Seq Analysis Stepsfml

RNA-SeqReads

Read Alignment

Expression level~ #reads

Significance Testing

Differentially Processed Transcripts/

Segments

De novo Assembly

Transcript Quantitation

TARs and Transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

Page 18: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

Common RNA-Seq Analysis Stepsfml

RNA-SeqReads

Read Alignment

Expression level~ #reads

Significance Testing

Differentially Processed Transcripts/

Segments

De novo Assembly

Transcript Quantitation

TARs and Transcripts

Transcript Reconstruction

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

Page 19: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

RNA-Seq Pipeline Overviewfml

PALMapper

mGene.ngs

mTim Segmentation

rQuant

Short Reads Transcripts w/ QuantitationAlignm

ents

Transcripts

Read alignment Transcript finding Quantitation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 36

Page 20: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

RNA-Seq Pipeline

RNA-Seq Pipeline Overviewfml

PALMapper

mGene.ngs

mTim Segmentation

rQuant

Short Reads Transcripts w/ QuantitationAlignm

ents

Transcripts

Read alignment Transcript finding Quantitation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 36

Page 21: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Overview

Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:

Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]

k-mer based index, well suited for smaller genomes

More info: http://fml.mpg.de/raetsch/suppl/palmapper

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

Page 22: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Overview

Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:

Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]

k-mer based index, well suited for smaller genomes

ACCGTCGCGCGCGT...TCGGCG...AGAACGCT

TCGCGCGCAACGTCGC CGCG GCGC CGCG GCGC CGCA GCAA CAAC AACG

...matchingk-mers

DNA

Read

k-mers

...

More info: http://fml.mpg.de/raetsch/suppl/palmapper

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

Page 23: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Overview

Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:

Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]

k-mer based index, well suited for smaller genomes

ACCGTCGCGCGCGT...TCGGCG...AGAACGCT

TCGCGCGCAACGTCGC CGCG GCGC CGCG GCGC CGCA GCAA CAAC AACG

...matchingk-mers

DNA

Read

k-mers

...

More info: http://fml.mpg.de/raetsch/suppl/palmapper

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

Page 24: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Overview

Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:

Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]

k-mer based index, well suited for smaller genomes

QPALMA for spliced read alignments:

GenomeMapper identifies seed regions

Spliced alignments by QPALMA [De Bona et al., 2008]

ACCGTCGCGCGCGT...TCGGCG...AGAACGCT

TCGCGCGCAACG

DNA

ReadMore info: http://fml.mpg.de/raetsch/suppl/palmapper

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

Page 25: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Accuracy

PALMapper Accuracy EvaluationHow accurately can PALMapper identify introns? fml

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C. elegans

TopHat

TopHat sensitive

Palmapper

recall precision

PALMapper (3.5h) andTopHat (3.5h/10h) align-ing 24M reads

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 36

Page 26: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Accuracy

PALMapper Accuracy EvaluationHow accurately can PALMapper identify introns? fml

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C. elegans

TopHat

TopHat sensitive

Palmapper

recall precision

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1H. sapiens (chromosome 1)

Palmapper

recall precision

PALMapper (3.5h) andTopHat (3.5h/10h) align-ing 24M reads

Comparison of PALMapper with other align-ment programs within the RGASP project(preliminary)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 36

Page 27: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Accuracy

QPALMA: Extended Smith-Waterman Scoringfml

Classical scoring f : Σ× Σ→ R

Source of information

Sequence matches

Computational splicesite predictions

Intron length model

Read qualityinformation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

Page 28: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Accuracy

QPALMA: Extended Smith-Waterman Scoringfml

Classical scoring f : Σ× Σ→ R

Source of information

Sequence matches

Computational splicesite predictions

Intron length model

Read qualityinformation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

Page 29: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Accuracy

QPALMA: Extended Smith-Waterman Scoringfml

Classical scoring f : Σ× Σ→ R

Source of information

Sequence matches

Computational splicesite predictions

Intron length model

Read qualityinformation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

Page 30: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Accuracy

QPALMA: Extended Smith-Waterman Scoringfml

Quality scoring f : (Σ× R)× Σ→ R [De Bona et al., 2008]

Source of information

Sequence matches

Computational splicesite predictions

Intron length model

Read qualityinformation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

Page 31: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Learning Algorithm

Scoring Parameter Inferencefml

What are optimal parameters?

How do we jointly optimize the 336 parameters?

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 36

Page 32: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Learning Algorithm

Scoring Parameter Inferencefml

What are optimal parameters?

How do we jointly optimize the 336 parameters?

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 36

Page 33: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Learning Algorithm

Cartoon: Maximize the Marginfml

truefalse false

Correct alignment is not highest scoring one

Correct alignment is highest scoring one

Can we do better?

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

Page 34: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Learning Algorithm

Cartoon: Maximize the Marginfml

truefalse

Correct alignment is not highest scoring one

Correct alignment is highest scoring one

Can we do better?

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

Page 35: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Learning Algorithm

Cartoon: Maximize the Marginfml

false true

Technique motivated by SVMs (“large-margin”)

Enforce a margin between correct and incorrect examples

One has to solve a big quadratic problem

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

Page 36: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Learning Algorithm

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome reads with known origin

Strategy:

1 Estimate relationship between quality score and error probabilityfrom given reads

2 Use annotation of a few genes to simulate spliced reads

3 Introduce errors according to error model using quality stringsfrom given read set

4 Train QPalma on generated read set with known alignments

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 36

Page 37: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Learning Algorithm

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome reads with known origin

Strategy:

1 Estimate relationship between quality score and error probabilityfrom given reads

2 Use annotation of a few genes to simulate spliced reads

3 Introduce errors according to error model using quality stringsfrom given read set

4 Train QPalma on generated read set with known alignments

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 36

Page 38: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Reasons for High Accuracy

QPALMA RNA-Seq Read Alignmentfml

Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality [De Bona et al., 2008]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

Page 39: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Reasons for High Accuracy

QPALMA RNA-Seq Read Alignmentfml

Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality

Error vs. intron position

[De Bona et al., 2008]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

Page 40: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Reasons for High Accuracy

QPALMA RNA-Seq Read Alignmentfml

Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality

8nt12nt12nt8ntError vs. intron position

[De Bona et al., 2008]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

Page 41: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

PALMapper Read Alignment Reasons for High Accuracy

Step 2: Transcript Predictionfml

PALMapper

mGene.ngs

mTim Segmentation

rQuant

Short Reads Transcripts w/ QuantitationAlignm

ents

Transcripts

Read alignment Transcript finding Quantitation

A. Coverage segmentation algorithm mTiM for general transcripts(no coding bias/assumption)

B. Extension of the mGene gene finding system to use NGS datafor protein coding transcript prediction (mGene.ngs)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 36

Page 42: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Idea

mTiM: Read Coverage Segmentationfml

Goal: Characterize each base as intergenic, exonic, or intronic

0

10

100

read

cov

erag

e (lo

g-sc

ale)

5 10 15 20 25Kbp

0

Kbpc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

Page 43: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Idea

mTiM: Read Coverage Segmentationfml

Goal: Characterize each base as intergenic, exonic, or intronic

0

10

100

read

cov

erag

e (lo

g-sc

ale)

5 10 15 20 25Kbp

0

Kbpc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

Page 44: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Idea

mTiM: Read Coverage Segmentationfml

Goal: Characterize each base as intergenic, exonic, or intronic

0

10

100

annotated gene

read

cov

erag

e (lo

g-sc

ale)

5 10 15 20 25Kbp

0

Kbpc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

Page 45: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Approach

The mTiM Segmentation Approachfml

intergenic exonic intronic

E IS

Learn to associate a state with each position given its readcoverage and local contextHM-SVM training: Optimize transformations: signal → scoreExtension: Score spliced reads and splice sites

(G. Zeller et al., 2008; G. Zeller et al., in prep., 2009)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

Page 46: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Approach

The mTiM Segmentation Approachfml

0

0.2

0.4

0.6

0.8

1

5 6 7 8 9 10 11 12 13 14

read coverage

0

0.2

0.4

5 6 7 8 9 10 11 12 13 14

read coverage

0

0.5

1

feat

ure

scor

e

5 6 7 8 9 10 11 12 13 14

read coverage

intergenic exonic intronic

E IS

Learn to associate a state with each position given its readcoverage and local contextHM-SVM training: Optimize transformations: signal → scoreExtension: Score spliced reads and splice sites

(G. Zeller et al., 2008; G. Zeller et al., in prep., 2009)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

Page 47: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Approach

The mTiM Segmentation Approachfml

intergenic exonic intronic

E ISacc

don

0

0.2

0.4

0.6

0.8

1

5 6 7 8 9 10 11 12 13 14

read coverage

0

0.2

0.4

5 6 7 8 9 10 11 12 13 14

read coverage

0

0.5

1

feat

ure

scor

e

5 6 7 8 9 10 11 12 13 14

read coverage

Learn to associate a state with each position given its readcoverage and local contextHM-SVM training: Optimize transformations: signal → scoreExtension: Score spliced reads and splice sites

(G. Zeller et al., 2008; G. Zeller et al., in prep., 2009)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

Page 48: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Approach

The mTiM Segmentation Approachfml

Idea: Assume uniform read coverage within exons of same transcript

0

10

100

annotated genes

read

cov

erag

e (lo

g-sc

ale)

5 10 15 20 25Kbp

0

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 18 / 36

Page 49: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Approach

The mTiM Segmentation Approachfml

Discreteexpression level

1

2

Q

. . .

intergenic exonic intronic

S

EQ

E2

E1

IQ

I2

I1

acc

don

acc

don

acc

don

. . .

. . .

. . .

Carry “expression level” information between exons of same transcript

(G. Zeller et al., 2008; G. Zeller et al., in prep., 2010)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 36

Page 50: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Approach

Discriminative training of HM-SVMsfml

f : R? → Σ?

given a sequence of hybridization measurements χ ∈ R?

predicts a state sequence (path) σ ∈ Σ?

Discriminant function Fθ : R? × Σ? → R such that fordecoding: f (χ) = argmax

σ∈S?

Fθ(χ, σ).

Training:For each training example (χ(i), σ(i)), enforce a large margin ofseparation

Fθ(χ(i), σ(i))− Fθ(χ

(i), σ) ≥ ρ

between the correct path σ(i) and any other wrong path σ 6= σ(i).

A quadratic programming problem (QP) is solved to optimize θ.[Altun et al., 2003, Ratsch et al., 2007, Zeller et al., 2008b]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

Page 51: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Approach

Discriminative training of HM-SVMsfml

f : R? → Σ?

given a sequence of hybridization measurements χ ∈ R?

predicts a state sequence (path) σ ∈ Σ?

Discriminant function Fθ : R? × Σ? → R such that fordecoding: f (χ) = argmax

σ∈S?

Fθ(χ, σ).

Training:For each training example (χ(i), σ(i)), enforce a large margin ofseparation

Fθ(χ(i), σ(i))− Fθ(χ

(i), σ) ≥ ρ

between the correct path σ(i) and any other wrong path σ 6= σ(i).

A quadratic programming problem (QP) is solved to optimize θ.[Altun et al., 2003, Ratsch et al., 2007, Zeller et al., 2008b]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

Page 52: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Approach

Discriminative training of HM-SVMsfml

f : R? → Σ?

given a sequence of hybridization measurements χ ∈ R?

predicts a state sequence (path) σ ∈ Σ?

Discriminant function Fθ : R? × Σ? → R such that fordecoding: f (χ) = argmax

σ∈S?

Fθ(χ, σ).

Training:For each training example (χ(i), σ(i)), enforce a large margin ofseparation

Fθ(χ(i), σ(i))− Fθ(χ

(i), σ) ≥ ρ

between the correct path σ(i) and any other wrong path σ 6= σ(i).

A quadratic programming problem (QP) is solved to optimize θ.[Altun et al., 2003, Ratsch et al., 2007, Zeller et al., 2008b]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

Page 53: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Results

Preliminary Evaluation (C. elegans)fml

CDS (precision+recall)/2

expression percentiles [%]

mTiM

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Sensitivity heavily depends on read densityc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 36

Page 54: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

mTIM Coverage Segmentation Results

Preliminary Evaluation (C. elegans)fml

CDS (precision+recall)/2

expression percentiles [%]

mTiM

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Sensitivity heavily depends on read densityc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 36

Page 55: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Idea

Computational Gene Finding Labeling the Genome fml

DNA

pre-mRNA

mRNA

Protein

polyAcap

TSS

SpliceDonor

SpliceAcceptor

SpliceDonor

SpliceAcceptor

TIS Stop

polyA/cleavage

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

Page 56: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Idea

Computational Gene Finding Labeling the Genome fml

DNA

pre-mRNA

mRNA

Protein

polyAcap

TSS Donor Acceptor Donor Acceptor

TIS Stop

polyA/cleavage

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

Page 57: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Idea

Computational Gene Finding Labeling the Genome fml

DNA

pre-mRNA

mRNA

Protein

polyAcap

TSS Donor Acceptor Donor Acceptor

TIS Stop

polyA/cleavage

TSS TIS cleaveStop

Don Acc

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

Page 58: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Idea

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

True gene model 2 3 4 5

STEP 1: SVM Signal Predictions

genomic position

genomic position

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

Page 59: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Idea

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

True gene model 2 3 4 5

F(x,y)

transform features

STEP 1: SVM Signal Predictions

STEP 2: Integration

genomic position

genomic position

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

Page 60: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Idea

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

True gene model 2 3 4 5

Wrong gene model

large margin

F(x,y)

transform features

STEP 1: SVM Signal Predictions

STEP 2: Integration

genomic position

genomic position

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

Page 61: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Modeling Uncertainty

Learning to use Expression Measurementsfml

Two approaches:

Heuristic to incorporate ESTs/reads/tiling array measurementsto refine predictions

Directly use evidence during learning to learn to appropriatelyweight its importance

Exon Level Transcript LevelSN SP F SN SP F

ab initio 82.3 82.6 82.5 43.1 49.5 46.1ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9

Gene prediction in C. elegans (CDS evaluation)

Behr et al., in pre., 2010

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 24 / 36

Page 62: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Modeling Uncertainty

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

True gene model 2 3 4 5

Wrong gene model

large margin

F(x,y)

transform features

STEP 1: SVM Signal Predictions

STEP 2: Integration

genomic position

genomic position

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 36

Page 63: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Modeling Uncertainty

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

position

True gene model 2 3 4 5

Wrong gene model

large margin

score

transform SVM outputs with PLiF

STEP 1: SVM Signal Predictions

STEP 2: Integration

Expression Evidence

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 36

Page 64: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Modeling Uncertainty

Learning to use Expression Measurementsfml

Two approaches:

Heuristic to incorporate ESTs/reads/tiling array measurementsto refine predictions

Directly use evidence during learning to learn to appropriatelyweight its importance

Exon Level Transcript LevelSN SP F SN SP F

ab initio 82.3 82.6 82.5 43.1 49.5 46.1ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9RNA-Seq trained 84.6 84.9 84.8 49.1 55.2 52.0RNA-Seq/ESTs trained 84.7 86.9 85.8 50.3 60.5 54.9

Gene prediction in C. elegans (CDS evaluation)

Behr et al., in prep., 2010

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 36

Page 65: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Results

Preliminary Evaluation (C. elegans)fml

CDS (precision+recall)/2

expression percentiles [%]

mGene ab initiomGene.ngs

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 36

Page 66: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Results

Preliminary Evaluation (C. elegans)fml

CDS (precision+recall)/2

expression percentiles [%]

mTiMmGene ab initiomGene.ngs

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 36

Page 67: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alt. Transcripts: Spliced reads for splicing graph completion:

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

Page 68: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alt. Transcripts: Spliced reads for splicing graph completion:

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

Page 69: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alt. Transcripts: Spliced reads for splicing graph completion:

Transcript prediction(result of mGene/mTIM)

Spliced reads (result of QPALMA)

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

Page 70: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alt. Transcripts: Spliced reads for splicing graph completion:

Transcript prediction(result of mGene/mTIM)

Spliced reads (result of QPALMA)

Infered edges

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

Page 71: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alt. Transcripts: Spliced reads for splicing graph completion:

Transcript prediction(result of mGene/mTIM)

Spliced reads (result of QPALMA)

Infered edges

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

Page 72: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant

RNA-Seq Pipeline Overviewfml

PALMapper

mGene.ngs

mTim Segmentation

rQuant

Short Reads Transcripts w/ QuantitationAlignm

ents

Transcripts

Read alignment Transcript finding Quantitation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 29 / 36

Page 73: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Biases

RNA-Seq Biases and Quantitationfml

Sequencer short sequence reads

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

fragmentation

sequence library

RT

fragmen-tation

RT

exonic reads

genome

transcripts

junction reads

aligned reads

Biases due to . . .

cDNA library construction

Sequencing

Read mapping

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

80

(average over annotated

transcripts of length ≈1kb for the

C. elegans SRX001872 dataset)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 36

Page 74: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Biases

RNA-Seq Biases and Quantitationfml

Sequencer short sequence reads

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

fragmentation

sequence library

RT

fragmen-tation

RT

exonic reads

genome

transcripts

junction reads

aligned reads

Biases due to . . .

cDNA library construction

Sequencing

Read mapping

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

80

(average over annotated

transcripts of length ≈1kb for the

C. elegans SRX001872 dataset)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 36

Page 75: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

A

read

cov

erag

e

Short transcript

relative transcript position 5’ -> 3’

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

Page 76: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

A

B

read

cov

erag

e

Short transcript

relative transcript position 5’ -> 3’

read

cov

erag

e

Long transcript

relative transcript position 5’ -> 3’

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

Page 77: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

read

cov

erag

e

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

A

B

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

Page 78: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

read

cov

erag

e

genome position 5’ -> 3’

Mixture of transcripts

read

cov

erag

e

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

MA

B

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

Page 79: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

read

cov

erag

e

genome position 5’ -> 3’

Mixture of transcripts

expected

observed

read

cov

erag

e

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

MA

B R

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

Page 80: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2 Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

99,960 100,368 100,776 101,184 101,592

70

60

50

40

30

20

10

0

read

cou

nts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

Page 81: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2 Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

99,960 100,368 100,776 101,184 101,592

27.20

70

60

50

40

30

20

10

0

read

cou

nts

27.20

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

Page 82: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2 Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

99,960 100,368 100,776 101,184 101,592

4.80

27.20

4.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

Page 83: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2 Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

Page 84: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2 Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

38.00

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

Page 85: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2 Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3 Repeat 1. and 2. until convergence.

distance to transcript start/end

pro!

le w

eigh

t

0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

38.00

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

Page 86: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Approach

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2 Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3 Repeat 1. and 2. until convergence.

distance to transcript start/end

pro!

le w

eigh

t

0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

4.95

8.35

23.99

70

60

50

40

30

20

10

0

read

cou

nts

8.354.95

23.99

37.29

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

Page 87: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

Page 88: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

Page 89: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

Page 90: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

Page 91: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

Page 92: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

Page 93: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation IIfml

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

(Bohnert et al., submitted, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

Page 94: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation IIfml

Position-wise, w/o pro!les

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

Segment-wise, w/ pro!les

(Bohnert et al., submitted, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

Page 95: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

rQuant Evaluation IIfml

rQuantPosition-wise,

w/ pro!lesPosition-wise, w/o pro!les

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

Segment-wise, w/ pro!les

(Bohnert et al., submitted, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

Page 96: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Transcript Quantitation with rQuant Results

Galaxy-based Web Services for NGS Analysesfml

Galaxy-based web service http://galaxy.fml.mpg.de

PALMapper http://fml.mpg.de/raetsch/suppl/palmapper

mGene http://mgene.org/web

mTIM http://fml.mpg.de/raetsch/suppl/mtim (in prep.)

rQuant http://fml.mpg.de/raetsch/suppl/rquant/web

(Ratsch et al., in preparation, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 36

Page 97: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

Summaryfml

PALMapperSplice site predictions improve alignment performanceOutperforms many other read mappers in intron accuracy

mTiMHigh specificity, sensitivity depends on read coverageBetter for identifying transcripts specific to experimental data

mGeneHigh sensitivity (also for lowly expressed genes)Identifies also non-expressed genes ⇒ good for annotation

rQuantModels library prep., sequencing, alignment biasesAccurately quantifies transcripts

Galaxy instanceEasy use of these tools

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

Page 98: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

Summaryfml

PALMapperSplice site predictions improve alignment performanceOutperforms many other read mappers in intron accuracy

mTiMHigh specificity, sensitivity depends on read coverageBetter for identifying transcripts specific to experimental data

mGeneHigh sensitivity (also for lowly expressed genes)Identifies also non-expressed genes ⇒ good for annotation

rQuantModels library prep., sequencing, alignment biasesAccurately quantifies transcripts

Galaxy instanceEasy use of these tools

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

Page 99: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

Summaryfml

PALMapperSplice site predictions improve alignment performanceOutperforms many other read mappers in intron accuracy

mTiMHigh specificity, sensitivity depends on read coverageBetter for identifying transcripts specific to experimental data

mGeneHigh sensitivity (also for lowly expressed genes)Identifies also non-expressed genes ⇒ good for annotation

rQuantModels library prep., sequencing, alignment biasesAccurately quantifies transcripts

Galaxy instanceEasy use of these tools

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

Page 100: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

Summaryfml

PALMapperSplice site predictions improve alignment performanceOutperforms many other read mappers in intron accuracy

mTiMHigh specificity, sensitivity depends on read coverageBetter for identifying transcripts specific to experimental data

mGeneHigh sensitivity (also for lowly expressed genes)Identifies also non-expressed genes ⇒ good for annotation

rQuantModels library prep., sequencing, alignment biasesAccurately quantifies transcripts

Galaxy instanceEasy use of these tools

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

Page 101: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

Summaryfml

PALMapperSplice site predictions improve alignment performanceOutperforms many other read mappers in intron accuracy

mTiMHigh specificity, sensitivity depends on read coverageBetter for identifying transcripts specific to experimental data

mGeneHigh sensitivity (also for lowly expressed genes)Identifies also non-expressed genes ⇒ good for annotation

rQuantModels library prep., sequencing, alignment biasesAccurately quantifies transcripts

Galaxy instanceEasy use of these tools

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

Page 102: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

Acknowledgementsfml

Fabio De Bona

Alignments

Jonas Behr

Gene finding

Georg Zeller

Segmentation

Regina Bohnert

Quantitation

RGASP Team

Jonas Behr (FML)

Georg Zeller (FML & MPI)

Regina Bohnert (FML)

Funding by DFG &Max Planck Society.

Thank you for your attention!

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 36

Page 103: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

Acknowledgementsfml

Fabio De Bona

Alignments

Jonas Behr

Gene finding

Georg Zeller

Segmentation

Regina Bohnert

Quantitation

RGASP Team

Jonas Behr (FML)

Georg Zeller (FML & MPI)

Regina Bohnert (FML)

Funding by DFG &Max Planck Society.

Thank you for your attention!

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 36

Page 104: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

References I

Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. InProc. 20th Int. Conf. Mach. Learn., pages 3–10, 2003.

J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski,K. Schneeberger, D. Weigel, and G. Ratsch. Rna-seq and tiling arrays for improved genefinding. URL http:

//www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.Oral presentation at the CSHL Genome Informatics Meeting, September 2008.

RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu,G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Scholkopf, M Nordborg, G Ratsch,JR Ecker, and D Weigel. Common sequence polymorphisms shaping genetic diversity inarabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:10.1126/science.1138632.

F. De Bona, S. Ossowski, K. Schneeberger, and G. Ratsch. Qpalma: Optimal splicedalignments of short sequence reads. Bioinformatics, 24:i174–i180, 2008.

Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-Seq.Bioinformatics, 25(8):1026–1032, April 2009.

G. Ratsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. InK. Tsuda B. Schoelkopf and J.-P. Vert, editors, Kernel Methods in Computational Biology.MIT Press, 2004.

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 38 / 36

Page 105: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

References II

G. Ratsch, S. Sonnenburg, and B. Scholkopf. RASE: recognition of alternatively spliced exonsin C. elegans. Bioinformatics, 21(Suppl. 1):i369–i377, June 2005.

G. Ratsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.-R. Muller, R. Sommer, and B. Schlkopf.Improving the c. elegans genome annotation using machine learning. PLoS ComputationalBiology, 3(2):e20, 2007. URL http://dx.doi.org/10.1371/journal.pcbi.0030020.eor.

M. Sammeth. The Flux Capacitor. Website, 2009a. http://flux.sammeth.net/capacitor.html.

M. Sammeth. The Flux Simulator. Website, 2009b. http://flux.sammeth.net/simulator.html.

Korbinian Schneeberger, Jorg Hagmann, Stephan Ossowski, Norman Warthmann, SandraGesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short readsagainst multiple genomes. Genome Biol, 10(9):R98, Jan 2009. doi:10.1186/gb-2009-10-9-r98. URL http://genomebiology.com/2009/10/9/R98.

Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich,Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann, Anja Bohlen, Nina Kruger,Soren Sonnenburg, and Gunnar Ratsch. mgene: Accurate svm-based gene finding with anapplication to nematode genomes. Genome Research, 2009. URLhttp://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html.Advance access June 29, 2009.

S. Sonnenburg, G. Ratsch, A. Jagota, and K.-R. Muller. New methods for splice-siterecognition. In Proc. International Conference on Artificial Neural Networks, 2002.

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 36

Page 106: Machine Learning Methods for RNA-seq-based Transcriptome ...rivals/SHD-2010/SPEECH/Raetsch-rnaseq-SHD-240310.pdfRNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) fml RNA-Seq allows

Summary

References III

Soren Sonnenburg, Alexander Zien, and Gunnar Ratsch. ARTS: Accurate Recognition ofTranscription Starts in Human. Bioinformatics, 22(14):e472–480, 2006.

G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detectingpolymorphic regions in arabidopsis thaliana with resequencing microarrays. Genome Res, 18(6):918–929, 2008a. ISSN 1088-9051 (Print). doi: 10.1101/gr.070169.107.

G. Zeller, S. Henz, S. Laubinger, D. Weigel, and G. Ratsch. Transcript normalization andsegmentation of tiling array data. In Proc. PSB 2008. World Scientific, 2008b.

A. Zien, G. Ratsch, S. Mika, B. Scholkopf, T. Lengauer, and K.-R. Muller. Engineering SupportVector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 40 / 36