central dogma of biology dna rna pre-mrna mrna protein central dogma

45
Central dogma of biology DNA RNA pre-mRNA mRNA Protein Central dogma

Upload: tracey-curtis

Post on 02-Jan-2016

228 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Central dogma of biology

DNA RNA pre-mRNA

mRNA Protein

Central dogma

Page 2: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

CGAACAAACCTCGAACCTGCTDNA:

mRNA: GCU UGU UUA CGA

Polypeptide: Ala Cys Leu Arg

Translation

Transcription

Basic molecular biology

Page 3: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Transcription

End modification

Splicing

Transport

Translation

Less basic molecular biology

Page 4: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Cy3 Cy5

ReferenceTest Sample

cDNA Clone(LIBRARY)

PCR Product

PE

Test Sample

OligonucleotideSynthesis

Biological Sample

RNA

ARRAY

ARRAY

Ramaswamy and Golub, JCO

Microarray technology

Page 5: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Lockhart and Winzler 2000

Oligonucleotide cDNA

Microarray technology

Page 6: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Yeast experiment

Microarray experiment

Page 7: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

When the science is not well understood, resort to statistics:

Ultimate goal: discover the genetic pathways of cancers

Infer cancer genetics by analyzing microarray data from tumors

Curse of dimensionality: Far too few examples for so many dimensions to predict accurately

Immediate goal: models that discriminate tumor types or treatment outcomes and determine genes used in model

Basic difficulty: few examples 20-100, high-dimensionality 7,000-16,000 genes measured for each sample, ill-posed problem

Analytic challenge

Page 8: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Cancer Diagnosis

Acute Myeloblastic Leukemia v

Acute Lymphoblastic Leukemia

Page 9: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes)

34 examples to test classifier

Results: 33/34 correct

d perpendicular distancefrom hyperplane

Test data

d

Cancer Classification

Page 10: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Coregulation: the expression of two genes must be correlated for a protein to be made, so we need to look at pairwise correlations as well as individual expression

Size of feature space: if there are 7,000 genes, feature space is about 24 million features, so the fact that feature space is never computed is important

2

22

)1()()(),(

1,,,,,)(

,

jijiji

TrkshshTrkTrksh

Trksh

xxxxxxK

eeeeeex

eex

φφ

φ

Two gene example: two genes measuring Sonic Hedgehog and TrkC

Coregulation and kernels

Page 11: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Nonlinear SVM helps when the most informative genes are removed,Informative as ranked using Signal to Noise (Golub et al).

Genes removed errors1st order 2nd order 3rd order

polynomials

0 1 1 110 2 1 120 3 2 130 3 3 240 3 3 250 3 2 2100 3 3 2200 3 3 3 1500 7 7 8

Gene coregulation

Page 12: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes

Need to introduce concept of rejects to SVM

g1

g2

Normal

Cancer

Reject

Rejecting samples

Page 13: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Rejecting samples

Page 14: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Estimating a CDF

Page 15: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

The regularized solution

Page 16: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

1/d

P(c=1 | d)

.95

95% confidence or p = .05 d = .107

Rejections for SVMs

Page 17: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Results: 31 correct, 3 rejected of which 1 is an error

Test data

d

Results with rejections

Page 18: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

SVMs as stated use all genes/features

Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in

discriminating

Practical reasons, a clinical device with thousands of genes is not financially practical

Possible performance improvement

Wrapper method for gene/feature selection

Gene selection

Page 19: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.

B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.

d

Test data

d

Test data

Results with gene selection

Page 20: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

 

 

     

Dataset Total Samples

Class 0

Class 1

Leukemia Morphology (train)

38 27 ALL

11 AML

Leukemia Morpholgy (test)

34 20 ALL

14 AML

Leukemia Lineage (ALL)

23 15 B-Cell

8 T-Cell

Lymphoma Outcome (AML)

15 8 Low risk

7 High risk

Dataset Total Samples

Class 0

Class 1

Lymphoma Morphology

77 19 FSC

58 DLCL

Lymphoma Outcome

58 22 Low risk

36 High risk

Brain Morphology

41 14 Glioma

27 MD

Brain Outcome

50 38 Low risk

12 High risk

Hierarchy of difficulty:1. Histological differences: normal vs. malignant, skin vs. brain2. Morphologies: different leukemia types, ALL vs. AML3. Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma4. Outcome: treatment outcome, elapse, or drug sensitivity.

Molecular classification of cancer

Page 21: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Dataset Algorithm Total Samples

Total errors

Class 1 errors

Class 0 errors

Number Genes

SVM 35 0/35 0/21 0/14 40

WV 35 2/35 1/21 1/14 50

Leukemia Morphology (trest) AML vs ALL

k-NN 35 3/35 1/21 2/14 10

SVM 23 0/23 0/15 0/8 10

WV 23 0/23 0/15 0/8 9

Leukemia Lineage (ALL) B vs T

k-NN 23 0/23 0/15 0/8 10

SVM 77 4/77 2/32 2/35 200

WV 77 6/77 1/32 5/35 30

Lymphoma FS vs DLCL

k-NN 77 3/77 1/32 2/35 250

SVM

41 1/41 1/27 0/14 100

WV

41 1/41 1/27 0/14 3

Brain MD vs Glioma

k-NN

41 0/41 0/27 0/14 5

Morphology classification

Page 22: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Dataset Algorithm Total Samples

Total errors

Class 1 errors

Class 0 errors

Number Genes

SVM 58 13/58 3/32 10/26 100

WV 58 15/58 5/32 10/26 12

Lymphoma LBC treatment outcome

k-NN 58 15/58 8/32 7/26 15

SVM 50 7/50 6/12 1/38 50

WV 50 13/50 6/12 7/38 6

Brain MD treatment outcome

k-NN 50 10/50 6/12 4/38 5

Outcome classification

Page 23: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Error rates ignore temporal information such as when a patient dies. Survivalanalysis takes temporal information into account. The Kaplan-Meier survivalplots and statistics for the above predictions show significance.

0 20 40 60 80 100 120

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.0015

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.00039

Lymphoma Medulloblastoma

Outcome classification

Page 24: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Breast Prostate Lung Colorectal

Lymphoma

Bladder

Melenoma Uterus Leuke

mia Renal Pancreas Ovary Mesothel

ioma Brain

Abrev B P L CR Ly Bl M U Le R PA Ov MS C

Total 11 10 11 13 22 11 10 10 30 11 11 11 11 20

Train 8 8 8 8 16 8 8 8 24 8 8 8 8 16

Test 3 2 3 5 6 3 2 3 6 3 3 3 3 4

Note that most of these tumors came from secondary sources and were notat the tissue of origin.

Multi tumor classification

Page 25: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

CNS, Lymphoma, Leukemia tumors separate

Adenocarcinomas do not separate

Clustering is not accurate

Page 26: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

+

+

+

+

R+1+1

Y-1-1

G+1-1

B-1+1

ClassG+RB+R

Combination approaches: All pairsOne versus all (OVA)

Multi tumor classification

Page 27: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

GeneExpression

Dataset

FinalMulticlass

Call(Highest OVA

PredictionStrength)

Breast OVAClassifier

. . .

. . .

Prostate OVAClassifier

CNS OVAClassifier

TEST SAMPLE

BREAST TUMORS

ALL OTHER TUMORS

Hyperplane

Confidence

Breast (High Confidence)

-2

0

+2

Figure 2

Supervised methodology

Page 28: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

 

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4

Accuracy Fraction of Calls

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4-1

0

1

2

3

4

5

Low HighLow High

Correct Errors Correct Errors

Lo

w

H

igh

Confidence Confidence

Co

nfi

den

ce

Train/ Test 1

cross -val.

Train/cross -val. Test 1

00.1

0.20.3

0.40.5

0.6

0.7

0.8

0.91

First Top 2 Top 3

Prediction Calls

Train/cross -val. Test 1

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4

Accuracy Fraction of Calls

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4-1

0

1

2

3

4

5

Low HighLow High

Correct Errors Correct Errors

Lo

w

H

igh

Confidence Confidence

Co

nfi

den

ce

Train/ Test 1

cross -val.

Train/cross -val. Test 1

0

0.1

0.20.3

0.40.5

0.6

0.7

0.8

0.91

First Top 2 Top 3

Prediction Calls

Train/cross -val. Test 1

Dataset Sample Type ValidationMethod

Sample Number

TotalAccuracy

Confidence High LowFraction Accuracy Fraction Accuracy

Train Well Differentiated Cross-val. 144 78% 80% 90% 20% 28%

Test 1 Well Differentiated Train/Test 54 78% 78% 83% 22% 58%

Well differentiated tumors

Page 29: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Feature selection hurts performance

Page 30: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4

Accuracy Fraction of Calls

-1

0

1

2

3

4

5

Low High

Confidence

Lo

w

H

igh

Co

nfi

den

ce

Correct Errors

00.10.20.30.40.50.60.70.80.9

1

First Top 2 Top 3

Prediction Calls

Dataset Sample Type ValidationMethod

Sample Number

TotalAccuracy

Confidence High LowFraction Accuracy Fraction Accuracy

Test Poorly Differentiated Train/test 20 30% 50% 50% 50% 10%

Poorly differentiated tumors

Page 31: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Morphing

Page 32: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Morphing

Page 33: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Talking faces

Page 34: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Talking faces

Page 35: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Talking faces

Page 36: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Recursive feature elimination (RFE): based upon perturbationanalysis, eliminate genes that perturb the margin the least

Optimize leave-one out (LOO): based upon optimization of leave-one out error of a SVM, leave-one out error is

unbiased

Two feature selection algorithms

Page 37: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

 

(2) step goto and set gene reduced on SVM Retrain 4.

10%)

smallest (for magnitude absolute small withelements vector

those to ingcorrespond enesfeatures/g input Discard 3.

value absoluteby vector of elements order Rank

vector for problem SVM the Solve

w

w

.2

.1

Recursive feature elimination

Page 38: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound.

When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables.

The rescaling can be done in the input space or in a “Principal Components” space.

 

 

Optimizing the LOO

Page 39: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Rescale features to minimize the LOO bound R2/M2

x2

x1

R2/M2 >1

M

R

x2

R2/M2 =1

M = R

Pictorial demonstration

Page 40: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Radius margin bound: simple to compute, continuous very loose but often tracks LOO well

Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth,

valid only for SVMs with no b term

Span bound: tight complicated to compute, discontinuous so need to smooth

Three LOO bounds

Page 41: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

 

 

tion.multiplica elementby element denote

, where

vu

σσyσxKyxK

bxxKαyxf

iσSVi

ii

),,(),(

.),()(

We add a scaling parameter to the SVM, which scales genes, genes corresponding to small j are removed.

The SVM function has the form:

Classification function with scaling

Page 42: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

 

 

. ,)

:formquadratic following the maximizingby computed are s' The

ij 0),((),(,

1212 αxxKααyyασαW

α

jiCjiσjiji

ii

.

data the around variance For

and ,

maximize data the around radius the computing For

1

jijiσ

iiiσ

ii

jijiσji

iiiσi

xxKxxKσV

ββxxKββxxKβσβR

,

12

,212

),(),()(

.10),(),(),(

2

).(),(),(),()( 221221 σVσαWσβRσαWTDL Remeber

SVM and other functionals

Page 43: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

2. step to return and in elements small to ingcorrespond dimensions Discard 5.

3. step goto reached not is of minima local 4.If

step. gradient a with to respect with error of estimate the Minimize 3.

algorithm SVM standard the 2.Solve

Initialize

and compute to used are steps following The

σ

T

σT

σαWσα

σ

σα

α).,(maxarg)(

.1,...,1.1

Algorithm

Page 44: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Computing gradients

.01

1KK~

where

,K~K

~K~

span therespect toith gradient w The

,

,

radius therespect toith gradient w The

.,W

margin therespect toith gradient w The

SV

1-SV

SV1-SV

42

1,1

2

1,

2

T

fi

f

i

ji f

jiji

i i

iii

f

ji f

jijiji

f

SS

xxKxxKR

xxKyy

Page 45: Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma

Linear problem with 6 relevant dimensions of 202

Nonlinear problem with 2 relevant dimensions of 52

number of samples number of samples

erro

r ra

te

erro

r ra

te

Toy data