computational biology algorithmic techniques & medical applications cse 590ya august 15, 2001
Post on 21-Dec-2015
229 views
TRANSCRIPT
Computational Biology
Algorithmic Techniques & Medical Applications
CSE 590YAAugust 15, 2001
2
Outline Overview Biology Technology Algorithms & Applications
Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes
Conclusions
3
Overview Human Genome Project
Why is it important? Sequence functionality Prevention & treatment of disease
Where is there computation in it? Lab hardware/software Analysis: assembly, element discovery Could not accomplish w/o computers
4
Bigger Picture Biology of the (not so) past
Isolated Low level (one X at a time) Slow accumulation of knowledge
Biology of the present Global High level (organismal/theoretical) Rapid accumulation of knowledge Rapid generation of open questions
5
Example: S. cerevisiae (yeast) Yeast: before expression arrays
Model organism for experiments Easy to grow, modify, and study Genetics similar to higher organisms
Yeast: after expression arrays Immensely more useful Now know most gene functions New results every month that used to take
five years Results are directly applicable to higher
organisms
6
A good beginning … The genome is not the end
Code to be deciphered Human road map Greater need for computational tools
and power Example: dbSNP
Data exists Need help finding and relating it all
7
Computers – not just for analysis Role reversal
Before: Biologists generate data, computers analyze it
Now: Computers generate experiments, biologists perform them
Cycle New future for CMBists
Biotech has greatest opportunity for real science to be done, and CS is crucial!
8
CB is good for CS Old research revisited and applied
Clustering Expired in the 70s, reborn 3 years ago New papers reacceptance as research
topic Data mining, web statistics, e-commerce
Machine learning Well-studied over the past couple decades New needs in CB new research on tuning
9
Outline Overview Biology Technology Algorithms & Applications
Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes
Conclusions
10
Biochemistry 101 Cells
Basic building blocks of life * Proteins
Key to functionality Catalyze reactions *
Store and release energy Build cells and cell components
Process-specific, yet resource-efficient
11
The genetics of proteins DNA
Four-base alphabet * Genes are instructions for building
proteins Cell cycle *
Extensive regulatory mechanism Construct proteins at right time and place Break down proteins and reuse components
Incredibly complex series of steps
12
Transcription & translation DNA RNA
Transcription factors * RNA polymerase
RNA protein Translation at ribosome * Amino acid chains
Protein degradation
13
Outline Overview Biology Technology Algorithms & Applications
Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes
Conclusions
14
Technology DNA microarrays
Consensus RNAs adhered to slide Test and control cDNAs produced *
Fluorescently labeled Hybridized with RNAs on slide
Scan fluorescence with computer Results: how much RNA present! *
What does this signify?
15
Example uses Timepoints in the cell cycle
Which genes are always “on”? Which genes are responsible for certain
events in the cycle? Differential expression in experiment
Which genes are responsible for a particular cell response?
What is the response pattern over time?
16
Outline Overview Biology Technology Algorithms & Applications
Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes
Conclusions
17
“Low tech” algorithms 90s: DNA is just a bunch of strings Questions became answerable!
Are there gross similarities in the genome? What do they imply?
Are there smaller recurring elements in the genome? What is their function?
I know what Gene A does? Can I use that to figure out what Gene B does?
18
String and sequence matching String matching
Find exact replicas of DNA sequence elsewhere in the genome
Are they statistically unlikely? Sequence matching
Regions of DNA that look similar: allows for evolution
Also applied to proteins In reality, sequences are more important
19
Computer tools Biological questions could be
answered better by a computer than by a biologist
GenBank, FASTA, BLAST, GAP Not trivial developments, even for CS Required novel approaches to NP-hard
problems Web proliferation (ongoing)
www.cs.jhu.edu/~salzberg/appendixa.html
20
High tech: expression arrays Use active gene data to classify a
cell Example: Cancer type prediction
Subtypes appear very similar histologically
Very different clinical courses Diagnoses: biologists’ insight rather
than systematic/unbiased approaches
21
Classifying cancer ALL vs. AML
Two kinds of leukemia (only recently separated)
Must be treated very differently Distinguishable in clinic, but not 100% reliable
Golub (1999) Goal: Determine cancer type by overall gene
expression; build an automated classifier By-product: One of earliest quantitative uses
of DNA microarrays
22
Strategy Get expression data for 6800 genes
from 27 ALL and 11 AML patients Clustering: Find genes with expression
levels that are strongly correlated with the ALL-AML class distinction
Give each such gene a weighted predictive vote for its class
Let important genes vote on test cases
23
Determining correlation w/ class Idealized expression patterns Neighborhood analysis *
Correlation metric Euclidean distance, regression, TNOM
Significance Q: Is gene more highly correlated with IEP than would
be expected by chance? A: Examine correlation w/ random IEP permutations
Results: 1100 genes more highly correlated with ALL-AML class distinction than expected by chance
24
Making a class predictor Subset of informative genes will elect
the class of a new sample Each casts weighted vote for its class: *
Expression level of gene in test sample Original correlation of gene w/ class
distinction Prediction strength (PS)
Margin of victory after all genes vote If less than threshold, then uncertain
25
Validation of the model (a) Initial data set: cross-validation
For each patient sample: Build a classifier without it (i.e. w/ 37 others) Predict class of left-out sample
Calculate cumulative error rate Results
Used top 50 genes 36/38 samples classified correctly, 2
uncertain
26
Validation of the model (b) Independent data set: test validation
34 samples from diverse tissues 29/34 “strong” predictions; 100% accuracy
PS values quite high for both .77 in cross-validation; .73 in independent Mean PS lower for samples from one
particular laboratory: importance of standardization in clinical setting
27
Further results of clinical importance 10 200 voting gene set had same accuracy Voter gene function: not just lineage markers
Surface receptors, anti-apoptotic agents, cell cycle regulators, DNA manipulators, known oncogenes
These genes provide insight into cancer causes New biological knowledge as a result of
computational methods! Other applications of CP & feature selection
Response to chemotherapy Eventual outcome of disease
28
Other array-based classifiers (a) k-means clustering
Select “high-scoring” features like before
Pick k points as initial cluster centroids Add each new data point to nearest cluster Move that cluster centroid to new mean
Use these centroids to classify test cases
29
Other array-based classifiers (b) Support Vector Machines
Goal: find a plane that separates data points If not separable
Boost the data points into a higher dimensional space using some well-behaved kernel function
Try to find a separating hyperplane there Key benefits of SVM version
Kernel avoids explicit representation of higher-dim space
Finding the maximum margin separating classes avoids overfitting
30
Class discovery What if we don’t know how many
clusters we want? The discovery of finer-grained subtypes
of cancer has been arduous and slow How can microarrays help here?
Golub (1999) again … Automatic class discovery based solely
on gene expression
31
Self-organizing maps (SOMs) Very much like k-means clustering
However, we don’t know the discriminating features in advance
Cluster based on all gene expression levels
Results for 27 ALL/11 AML data set Class A: 24/25 samples were ALL Class B: 10/13 samples were AML
Quite effective, but not perfect
32
SOMs (cont’d) How can we evaluate the “learned” clusters
w/o knowing the true classes? Test by class prediction – accuracy should
be high if classes reflect true structure Results
Predictors w/ variety of genes did well in cross-validation
Exception: the one AML in class A was often predicted to be in class B
This suggests an iterative method for class discovery: discover, predict, refine
33
Independent model validation Cannot assess “accuracy” on test data Instead, assess prediction strength
High PS indicates that structure in initial data is also present in test data
Results Median PS=.61, 74% of samples above
threshold Compared w/ random clusters, PS’s were
highly statistically significant We have discovered ALL-AML distinction! Even lower-level distinctions also discovered
34
Other CS w/ expression arrays Regulatory element detection
Correlate expression data with frequency of DNA motifs
Taxing even for fastest processors today Discovery of regulatory pathways
Treat expression arrays over time as a graph Establish a Bayesian network model for
regulatory pathways over the array graph structure
Infer network parameters pathway structure
35
Problems with DNA arrays Different companies, different types Even within one company
Different products over time Different binding efficiencies
Much time spent on normalization Even then, different groups’ results are hard
to compare Biggest worry: RNA levels in cells do not
accurately reflect current protein content Perhaps limits our discovery potential
36
Proteonomics If protein is most important, why not
study it directly? Much work is done on proteins already But difficult to purify, prepare, quantify Results are very coarse
Emerging technologies More efficient protein purification and
protein arrays are being developed! Lots of discoveries to come
37
Outline Overview Biology Technology Algorithms & Applications
Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes
Conclusions
38
Looking to the future Biology is becoming a more theoretical,
unified science The problem w/ biology has always been that there
are too many layers Work has always been somewhere in the middle Now research is beginning to focus on processes
and pathways and networks in general This is the proper path to developing theories
Along the way … Lots of hard computational problems to be solved!