computational biology algorithmic techniques & medical applications cse 590ya august 15, 2001

Computational Biology

Algorithmic Techniques & Medical Applications

CSE 590YAAugust 15, 2001

2

Outline Overview Biology Technology Algorithms & Applications

Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes

Conclusions

3

Overview Human Genome Project

Why is it important? Sequence functionality Prevention & treatment of disease

Where is there computation in it? Lab hardware/software Analysis: assembly, element discovery Could not accomplish w/o computers

4

Bigger Picture Biology of the (not so) past

Isolated Low level (one X at a time) Slow accumulation of knowledge

Biology of the present Global High level (organismal/theoretical) Rapid accumulation of knowledge Rapid generation of open questions

5

Example: S. cerevisiae (yeast) Yeast: before expression arrays

Model organism for experiments Easy to grow, modify, and study Genetics similar to higher organisms

Yeast: after expression arrays Immensely more useful Now know most gene functions New results every month that used to take

five years Results are directly applicable to higher

organisms

6

A good beginning … The genome is not the end

Code to be deciphered Human road map Greater need for computational tools

and power Example: dbSNP

Data exists Need help finding and relating it all

7

Computers – not just for analysis Role reversal

Before: Biologists generate data, computers analyze it

Now: Computers generate experiments, biologists perform them

Cycle New future for CMBists

Biotech has greatest opportunity for real science to be done, and CS is crucial!

8

CB is good for CS Old research revisited and applied

Clustering Expired in the 70s, reborn 3 years ago New papers reacceptance as research

topic Data mining, web statistics, e-commerce

Machine learning Well-studied over the past couple decades New needs in CB new research on tuning

9



Conclusions

10

Biochemistry 101 Cells

Basic building blocks of life * Proteins

Key to functionality Catalyze reactions *

Store and release energy Build cells and cell components

Process-specific, yet resource-efficient

11

The genetics of proteins DNA

Four-base alphabet * Genes are instructions for building

proteins Cell cycle *

Extensive regulatory mechanism Construct proteins at right time and place Break down proteins and reuse components

Incredibly complex series of steps

12

Transcription & translation DNA RNA

Transcription factors * RNA polymerase

RNA protein Translation at ribosome * Amino acid chains

Protein degradation

13



Conclusions

14

Technology DNA microarrays

Consensus RNAs adhered to slide Test and control cDNAs produced *

Fluorescently labeled Hybridized with RNAs on slide

Scan fluorescence with computer Results: how much RNA present! *

What does this signify?

15

Example uses Timepoints in the cell cycle

Which genes are always “on”? Which genes are responsible for certain

events in the cycle? Differential expression in experiment

Which genes are responsible for a particular cell response?

What is the response pattern over time?

16



Conclusions

17

“Low tech” algorithms 90s: DNA is just a bunch of strings Questions became answerable!

Are there gross similarities in the genome? What do they imply?

Are there smaller recurring elements in the genome? What is their function?

I know what Gene A does? Can I use that to figure out what Gene B does?

18

String and sequence matching String matching

Find exact replicas of DNA sequence elsewhere in the genome

Are they statistically unlikely? Sequence matching

Regions of DNA that look similar: allows for evolution

Also applied to proteins In reality, sequences are more important

19

Computer tools Biological questions could be

answered better by a computer than by a biologist

GenBank, FASTA, BLAST, GAP Not trivial developments, even for CS Required novel approaches to NP-hard

problems Web proliferation (ongoing)

www.cs.jhu.edu/~salzberg/appendixa.html

20

High tech: expression arrays Use active gene data to classify a

cell Example: Cancer type prediction

Subtypes appear very similar histologically

Very different clinical courses Diagnoses: biologists’ insight rather

than systematic/unbiased approaches

21

Classifying cancer ALL vs. AML

Two kinds of leukemia (only recently separated)

Must be treated very differently Distinguishable in clinic, but not 100% reliable

Golub (1999) Goal: Determine cancer type by overall gene

expression; build an automated classifier By-product: One of earliest quantitative uses

of DNA microarrays

22

Strategy Get expression data for 6800 genes

from 27 ALL and 11 AML patients Clustering: Find genes with expression

levels that are strongly correlated with the ALL-AML class distinction

Give each such gene a weighted predictive vote for its class

Let important genes vote on test cases

23

Determining correlation w/ class Idealized expression patterns Neighborhood analysis *

Correlation metric Euclidean distance, regression, TNOM

Significance Q: Is gene more highly correlated with IEP than would

be expected by chance? A: Examine correlation w/ random IEP permutations

Results: 1100 genes more highly correlated with ALL-AML class distinction than expected by chance

24

Making a class predictor Subset of informative genes will elect

the class of a new sample Each casts weighted vote for its class: *

Expression level of gene in test sample Original correlation of gene w/ class

distinction Prediction strength (PS)

Margin of victory after all genes vote If less than threshold, then uncertain

25

Validation of the model (a) Initial data set: cross-validation

For each patient sample: Build a classifier without it (i.e. w/ 37 others) Predict class of left-out sample

Calculate cumulative error rate Results

Used top 50 genes 36/38 samples classified correctly, 2

uncertain

26

Validation of the model (b) Independent data set: test validation

34 samples from diverse tissues 29/34 “strong” predictions; 100% accuracy

PS values quite high for both .77 in cross-validation; .73 in independent Mean PS lower for samples from one

particular laboratory: importance of standardization in clinical setting

27

Further results of clinical importance 10 200 voting gene set had same accuracy Voter gene function: not just lineage markers

Surface receptors, anti-apoptotic agents, cell cycle regulators, DNA manipulators, known oncogenes

These genes provide insight into cancer causes New biological knowledge as a result of

computational methods! Other applications of CP & feature selection

Response to chemotherapy Eventual outcome of disease

28

Other array-based classifiers (a) k-means clustering

Select “high-scoring” features like before

Pick k points as initial cluster centroids Add each new data point to nearest cluster Move that cluster centroid to new mean

Use these centroids to classify test cases

29

Other array-based classifiers (b) Support Vector Machines

Goal: find a plane that separates data points If not separable

Boost the data points into a higher dimensional space using some well-behaved kernel function

Try to find a separating hyperplane there Key benefits of SVM version

Kernel avoids explicit representation of higher-dim space

Finding the maximum margin separating classes avoids overfitting

30

Class discovery What if we don’t know how many

clusters we want? The discovery of finer-grained subtypes

of cancer has been arduous and slow How can microarrays help here?

Golub (1999) again … Automatic class discovery based solely

on gene expression

31

Self-organizing maps (SOMs) Very much like k-means clustering

However, we don’t know the discriminating features in advance

Cluster based on all gene expression levels

Results for 27 ALL/11 AML data set Class A: 24/25 samples were ALL Class B: 10/13 samples were AML

Quite effective, but not perfect

32

SOMs (cont’d) How can we evaluate the “learned” clusters

w/o knowing the true classes? Test by class prediction – accuracy should

be high if classes reflect true structure Results

Predictors w/ variety of genes did well in cross-validation

Exception: the one AML in class A was often predicted to be in class B

This suggests an iterative method for class discovery: discover, predict, refine

33

Independent model validation Cannot assess “accuracy” on test data Instead, assess prediction strength

High PS indicates that structure in initial data is also present in test data

Results Median PS=.61, 74% of samples above

threshold Compared w/ random clusters, PS’s were

highly statistically significant We have discovered ALL-AML distinction! Even lower-level distinctions also discovered

34

Other CS w/ expression arrays Regulatory element detection

Correlate expression data with frequency of DNA motifs

Taxing even for fastest processors today Discovery of regulatory pathways

Treat expression arrays over time as a graph Establish a Bayesian network model for

regulatory pathways over the array graph structure

Infer network parameters pathway structure

35

Problems with DNA arrays Different companies, different types Even within one company

Different products over time Different binding efficiencies

Much time spent on normalization Even then, different groups’ results are hard

to compare Biggest worry: RNA levels in cells do not

accurately reflect current protein content Perhaps limits our discovery potential

36

Proteonomics If protein is most important, why not

study it directly? Much work is done on proteins already But difficult to purify, prepare, quantify Results are very coarse

Emerging technologies More efficient protein purification and

protein arrays are being developed! Lots of discoveries to come

37



Conclusions

38

Looking to the future Biology is becoming a more theoretical,

unified science The problem w/ biology has always been that there

are too many layers Work has always been somewhere in the middle Now research is beginning to focus on processes

and pathways and networks in general This is the proper path to developing theories

Along the way … Lots of hard computational problems to be solved!

computational biology algorithmic techniques & medical applications cse 590ya august 15, 2001

Documents

tuning slide

slide scan fluorescence

resourceefficient slide

string algorithms high

cb new research

genetics of proteins

proteins key

rna present