clustering and classification · 2017-10-25 · 6 statistical methods in microarray analysis...
TRANSCRIPT
![Page 1: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/1.jpg)
1
Statistical Methods in Microarray Analysis Tutorial
Clustering and Classification
Jean Yee Hwa YangUniversity of California, San Francisco
http://www.biostat.ucsf.edu/jean/
Institute for Mathematical Sciences, National University of Singapore
January 2 - 6, 2004
Statistical Methods in Microarray Analysis Tutorial
Classification
• Task: assign objects to classes (groups) on the basis of measurements made on the objects
• Unsupervised: classes unknown, want to discover them from the data (cluster analysis)
• Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations
Statistical Methods in Microarray Analysis Tutorial
Clustering
Statistical Methods in Microarray Analysis Tutorial
Basic principles of clustering
Aim: to group observations that are “similar”based on predefined criteria.
Issues: Which genes / arrays to use?Which similarity or dissimilarity measure?Which clustering algorithm?
It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific.
![Page 2: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/2.jpg)
2
Statistical Methods in Microarray Analysis Tutorial
Expression Data
For each gene, calculate a summary statistics and/or
adjusted p-values
Clustering
Clustering
Set of candidate DE genes.Biological verification
Descriptive interpretation
Similarity metrics
Clustering algorithm
Statistical Methods in Microarray Analysis Tutorial
Which similarity or dissimilarity measure?
• A metric is a measure of the similarity or dissimilarity between two data objects
• Used to form data points into clusters
• Two main classes of distance:- Correlation coefficients
- Compares shape of expression curves- Two types of correlation:
- Centered.- Un-centered.
- Distance metrics- City Block (Manhattan) distance- Euclidean distance
Statistical Methods in Microarray Analysis Tutorial
• Pearson Correlation CoefficientSx = Standard deviation of x
Sy = Standard deviation of y
• Others include Spearman’s ρ and Kendall’s τ
∑=
−
−
−n
i y
i
x
in S
yyS
xx
11
1
Correlation (a measure between -1 and 1)
Positive correlation Negative correlation
You can use absolute correlation to capture both positive and negative correlation
Statistical Methods in Microarray Analysis Tutorial
Potential pitfalls
Correlation = 1
![Page 3: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/3.jpg)
3
Statistical Methods in Microarray Analysis Tutorial
Distance metrics
• City Block (Manhattan) distance:- Sum of differences across
dimensions- Less sensitive to outliers - Diamond shaped clusters
• Euclidean distance:- Most commonly used
distance- Sphere shaped cluster- Corresponds to the
geometric distance into the multidimensional space
∑ −=i
ii yxYXd ),( ∑ −=i
ii yxYXd 2)(),(
where gene X = (x1,… ,xn) and gene Y=(y1,… ,yn)
X
Y
Condition 1
Con
ditio
n 2
Condition 1
X
Y
Con
ditio
n 2
Statistical Methods in Microarray Analysis Tutorial
Euclidean vs Correlation (I)
• Euclidean distance• Correlation
Statistical Methods in Microarray Analysis Tutorial
xx
Complete (minimum)
Distance between centroids
Distance between clustersBetween-cluster dissimilarity measures
Average (Mean) linkage
xx
Single (maximum)
Statistical Methods in Microarray Analysis Tutorial
Clustering algorithms
• Clustering algorithm comes in 2 basic flavors
Partitioning Hierarchical
![Page 4: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/4.jpg)
4
Statistical Methods in Microarray Analysis Tutorial
Partitioning methods
• Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups.
• Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares.
• Examples:- k-means, self-organizing maps (SOM), PAM, etc.;- Fuzzy: needs stochastic model, e.g. Gaussian
mixtures.
Statistical Methods in Microarray Analysis Tutorial
K = 2
Partitioning methods
Statistical Methods in Microarray Analysis Tutorial
K = 4
Partitioning methods
Statistical Methods in Microarray Analysis Tutorial
Hierarchical methods
• Hierarchical clustering methods produce a tree or dendrogram.
• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.
• The tree can be built in two distinct ways- bottom-up: agglomerative clustering.- top-down: divisive clustering.
![Page 5: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/5.jpg)
5
Statistical Methods in Microarray Analysis Tutorial
1 5 2 3 4
1 5 2 3 4
1,2,5
3,41,5
1,2,3,4,5
Agglomerative
Illustration of points In two dimensional space
15
34
2
Statistical Methods in Microarray Analysis Tutorial
1 5 2 3 4
1 5 2 3 4
1,2,5
3,41,5
1,2,3,4,5
Agglomerative
Tree re-ordering?
15
34
2
1 52 3 4
Statistical Methods in Microarray Analysis Tutorial
Partitioning vs. hierarchical
Partitioning:Advantages• Optimal for certain
criteria.• Genes automatically
assigned to clustersDisadvantages• Need initial k;• Often require long
computation times.• All genes are forced into
a cluster.
HierarchicalAdvantages• Faster computation.• Visual.Disadvantages• Unrelated genes are
eventually joined• Rigid, cannot correct later
for erroneous decisions made earlier.
• Hard to define clusters.
Statistical Methods in Microarray Analysis Tutorial
Clustering microarray data
• Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space.
Examples:• We can cluster cell samples (cols),
e.g. 1) for identification (profiles). Here, we might want to estimate the number of different neuron cell types in a set of samples, based on gene expression.
2) the identification of new / unknown tumor classes using gene expression profiles.
• We can cluster genes (rows) , e.g. using large numbers of yeast experiments, to identify groups of co-regulated genes.
• We can cluster genes (rows) to reduce redundancy (cf. variable selection) in predictive models.
![Page 6: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/6.jpg)
6
Statistical Methods in Microarray Analysis Tutorial
Taken from Nature February, 2000Paper by A Alizadeh et alDistinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,
Clustering both cell samplesand genes
Statistical Methods in Microarray Analysis Tutorial
Clustering cell samplesDiscovering sub-groups
Taken fromAlizadeh et al(Nature, 2000)
Statistical Methods in Microarray Analysis Tutorial
Yeast Cell Cycle(Cho et al, 1998)6 × 5 SOM with 828 genes
Taken from Tamayo et al, (PNAS, 1999)
Clustering genesFinding different patterns in the data
Statistical Methods in Microarray Analysis Tutorial
Some other issues on clustering
• Two-way clustering.
• Comparing trees.
• Estimating the number of cluster.- Silhouette width (used in PAM).- Gap statistics
Ref: Tibshirani, Walther and Hastie “Estimating the number of clusters in a dataset via the gap statistic”.
- Clest: Ref: Dudoit & Fridlyand “A prediction-based resampling method for estimating the number of clusters in a dataset.”
• Note: select a subsets of genes using their class association before clustering.
![Page 7: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/7.jpg)
7
Statistical Methods in Microarray Analysis Tutorial
Summary
Which clustering method should I use?- What is the biological question?- Do I have a preconceived notion of how many clusters there
should be?- Can a gene be in multiple clusters?- Hard or soft boundaries between clusters
Keep in mind:- Clustering cannot NOT work. That is, every clustering methods
will return clusters.- Clustering helps to group / order information and is a
visualization tool for learning about the data. However, clustering results do not provide biological “proof”.
Statistical Methods in Microarray Analysis Tutorial
Comparison of clustering and single gene approaches for microarray data analysis
Cluster analysis:1) Usually outside the normal framework of statistical
inference.2) Less appropriate when only a few genes are likely to
change.3) Needs lots of experiments.
Single gene approaches1) May be too noisy in general to show much.2) May not reveal coordinated effects of positively correlated
genes.3) Harder to relate to pathways.
Statistical Methods in Microarray Analysis Tutorial
Discrimination
Classification procedureFeature selection
Performance assessmentComparison study
Statistical Methods in Microarray Analysis Tutorial
Predefined Class
{1,2,… K}
1 2 K
Objects
Basic principles of discrimination•Each object associated with a class label (or response) Y ∈ {1, 2, … , K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, … , XG)
Aim: predict Y from X.
X = {red, square} Y = ?
Y = Class Label = 2
X = Feature vector{colour, shape}
Classification rule ?
![Page 8: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/8.jpg)
8
Statistical Methods in Microarray Analysis Tutorial
Classification
Learning SetData with
known classes
ClassificationTechnique
Classificationrule
Data with unknown classes
ClassAssignment
Discrimination
Prediction
Statistical Methods in Microarray Analysis Tutorial
?Bad prognosisrecurrence < 5yrs
Good Prognosisrecurrence > 5yrs
ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..
ObjectsArray
Feature vectorsGene
expression
Predefine classesClinical
outcome
new array
Learning set
Classificationrule
Good PrognosisMatesis > 5
Statistical Methods in Microarray Analysis Tutorial
B-ALL T-ALL AML
ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.
ObjectsArray
Feature vectorsGene
expression
Predefine classes
Tumor type
?
new array
Learning set
ClassificationRule
T-ALL
Statistical Methods in Microarray Analysis Tutorial
Classification Rule
-Classification procedure,-Feature selection,
-Parameters [pre-determine, estimable],
Distance measure,Aggregation methods
Performance Assessmente.g. Cross validation
• One can think of the classification rule as a black box, some methods provides more insight into the box.
• Performance assessment needs to be looked at for all classification rule.
![Page 9: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/9.jpg)
9
Statistical Methods in Microarray Analysis Tutorial
Classification rule Maximum likelihood discriminant rule
• A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest
• For known class conditional densities pk(X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by
C(X) = argmaxk pk(X)
Statistical Methods in Microarray Analysis Tutorial
Gaussian ML discriminant rules
• For multivariate Gaussian (normal) class densities X|Y= k ~ N(µk, Σk), the ML classifier is
C(X) = argmink {(X - µk) Σk-1 (X - µk)’+ log| Σk |}
• In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)
• In practice, population mean vectors µk and covariance matrices Σk are estimated by corresponding sample quantities
Statistical Methods in Microarray Analysis Tutorial
ML discriminant rules - special cases
[DLDA] Diagonal linear discriminant analysisclass densities have the same diagonal covariance matrix ∇= diag(s1
2, … , sp2)
[DQDA] Diagonal quadratic discriminant analysis)class densities have different diagonal covariance matrix ∇k= diag(s1k
2, … , spk2)
Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA fortwo classes (wrong variance calculation).
Statistical Methods in Microarray Analysis Tutorial
Nearest neighbor classification
• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation).
• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:- find the k observations in the learning set closest to X- predict the class of X by majority vote, i.e., choose
the class that is most common among those k observations.
• The number of neighbors k can be chosen by cross-validation (more on this later).
![Page 10: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/10.jpg)
10
Statistical Methods in Microarray Analysis Tutorial
Nearest neighbor rule
Statistical Methods in Microarray Analysis Tutorial
Classification tree
• Partition the feature space into a set of rectangles, then fit a simple model in each one
• Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself)
• Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier
Statistical Methods in Microarray Analysis Tutorial
Classification tree
Gene 1Mi1 < -0.67
Gene 2Mi2 > 0.18
0
2
1
yes
yes
no
no 0
1
2
Gene 1
Gene 2
-0.67
0.18
Statistical Methods in Microarray Analysis Tutorial
Three aspects of tree construction
• Split selection rule: - Example, at each node, choose split maximizing decrease in
impurity (e.g. Gini index, entropy, misclassification error).
• Split-stopping: - Example, grow large tree, prune to obtain a sequence of
subtrees, then use cross-validation to identify the subtree with lowest misclassification rate.
• Class assignment: - Example, for each terminal node, choose the class minimizing
the resubstitution estimate of misclassification probability, given that a case falls into this node.
![Page 11: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/11.jpg)
11
Statistical Methods in Microarray Analysis Tutorial
Classification with SVMs
Statistical Methods in Microarray Analysis Tutorial
Other classifiers include…
• Neural networks
• Logistic regression
• Projection pursuit
• Bayesian belief networks
Statistical Methods in Microarray Analysis Tutorial
Why select features?
Correlation plotData: Leukemia, 3 class
No feature selection
Top 100feature selection
Selection based on variance
-1 +1
Statistical Methods in Microarray Analysis Tutorial
Explicit feature selection
• One-gene-at-a-time approaches. Genes are ranked based on the value of a univariate test statistic such as: t- or F-statistic or their non-parametric variants (Wilcoxon/Kruskal-Wallis); p-value.
• Possible meta-parameters include the number of genes G or a p-value cut-off. A formal choice of these parameters may be achieved by cross-validation or bootstrap procedures.
• Multivariate approaches. More refined feature selection procedures consider the joint distribution of the expression measures, in order to detect genes with weak main effects but possibly strong interaction.
• Bo & Jonassen (2002): Subset selection procedures for screening gene pairs to be used in classification.
• Breiman (1999): ranks genes according to importance statistic define in terms of prediction accuracy. Note that tree building itself does not involve explicit feature selection.
![Page 12: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/12.jpg)
12
Statistical Methods in Microarray Analysis Tutorial
Implicit feature selection
• Feature selection may also be performed implicitly by the classification rule itself.
• In classification trees, features are selected at each step based on reduction in impurity and The number of features used (or size of the tree) is determined by pruning the tree using cross-validation. Thus, feature selection is inherent part of tree-building and pruning deals with over-fitting.
• Shrinkage methods and adaptive distance function. May be used for LDA and kNN.
Statistical Methods in Microarray Analysis Tutorial
Performance assessment
• Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.
• One needs to estimate future performance based on what is available: often the same set that is used to build the classifier.
• Assessing performance of the classifier based on- Cross-validation.- Test set- Independent testing on future dataset
Statistical Methods in Microarray Analysis Tutorial
Diagram of performance assessment
Training set
Performance assessment
TrainingSet
Independenttest set
Classifier
Classifier
Resubstitutionestimation
Test set estimation
Statistical Methods in Microarray Analysis Tutorial
Performance assessment (I)
• Resubstitution estimation: error rate on the learning set.- Problem: downward bias
• Test set estimation:
1) divide learning set into two sub-sets, L and T; Build the classifier on L and compute the error rate on T.
2) Build the classifier on the training set (L) and compute the error rate on an independent test set (T). - L and T must be independent and identically distributed (i.i.d).- Problem: reduced effective sample size
![Page 13: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/13.jpg)
13
Statistical Methods in Microarray Analysis Tutorial
Diagram of performance assessment
Training set
Performance assessment
TrainingSet
Independenttest set
(CV) Learningset
(CV) Test set
Classifier
Classifier
Classifier
Resubstitutionestimation
Test set estimation
Cross Validation
Statistical Methods in Microarray Analysis Tutorial
Performance assessment (II)
• V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. - Bias-variance tradeoff: smaller V can give larger bias but smaller
variance- Computationally intensive.
• Leave-one-out cross validation (LOOCV).
(Special case for V=n). Works well for stable classifiers (k-NN, LOOCV)
Statistical Methods in Microarray Analysis Tutorial
Performance assessment (III)
• Common practice to do feature selection using the learning , then CV only for model building and classification.
• However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased.
• Features (variables) should be selected only from the learning set used to build the model (and not the entire set)
Statistical Methods in Microarray Analysis Tutorial
Another component in classification rule:aggregating classifiers
Training Set
X1, X2, … X100
Classifier 1Resample 1
Classifier 2Resample 2
Classifier 499Resample 499
Classifier 500Resample 500
Examples:BaggingBoosting
Random Forest
Aggregateclassifier
![Page 14: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/14.jpg)
14
Statistical Methods in Microarray Analysis Tutorial
Aggregating classifiers:Bagging
Training Set (arrays)X1, X2, … X100
Tree 1Resample 1X*1, X*2, … X*100
Lets the treevote
Tree 2Resample 2X*1, X*2, … X*100
Tree 499Resample 499X*1, X*2, … X*100
Tree 500Resample 500X*1, X*2, … X*100
Testsample
Class 1
Class 2
Class 1
Class 1
90% Class 110% Class 2
Statistical Methods in Microarray Analysis Tutorial
Comparison study
• Leukemia data – Golub et al. (1999)- n = 72 samples, - G = 3,571 genes,- 3 classes (B-cell ALL, T-cell ALL, AML).
• Reference:S. Dudoit, J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p. 77-87
Statistical Methods in Microarray Analysis Tutorial
Leukemia data, 3 classes: Test set error rates;150 LS/TS runs
Statistical Methods in Microarray Analysis Tutorial
Results
• In the main comparison, NN and DLDA had the smallest error rates.
• Aggregation improved the performance of CART classifiers.
• For the leukemia datasets, increasing the number of genes to G=200 didn't greatly affectthe performance of the various classifiers.
![Page 15: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/15.jpg)
15
Statistical Methods in Microarray Analysis Tutorial
Comparison study – discussion (I)
• “Diagonal”LDA: ignoring correlation between genes helped here. Unlike classification trees and nearest neighbors, DLDA is unable to take into account gene interactions.
• Classification trees are capable of handling and revealing interactions between variables. In addition, they have useful by-product of aggregated classifiers: prediction votes, variable importance statistics.
• Although nearest neighbors are simple and intuitiveclassifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions.
Statistical Methods in Microarray Analysis Tutorial
Summary (I)
• Bias-variance trade-off. Simple classifiers do well on small datasets. As the number of samples increases, we expect to see that classifiers capable of considering higher-order interactions (and aggregated classifiers) will have an edge.
• Cross-validation . It is of utmost importance to cross-validate for every parameter that has been chosen based on the data, including meta-parameters - what and how many features- how many neighbors- pooled or unpooled variance- classifier itself.
If this is not done, it is possible to wrongly declare having discrimination power when there is none.
Statistical Methods in Microarray Analysis Tutorial
Summary (II)
• Generalization error rate estimation. It is necessary to keep sampling scheme in mind.
• Thousands and thousands of independent samples from variety of sources are needed to be able to address the true performance of the classifier.
• We are not at that point yet with microarrays studies. Van Veer et al (2002) study is probably the only study to date with ~300 test samples.
Statistical Methods in Microarray Analysis Tutorial
Reference 1Retrospective studyL van’t Veer et al Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan 2002..
Learning set
Bad Good
ClassificationRule
Reference 2Retrospective studyM Van de Vijver et al. A gene expression signature as a predictor of survival in breast cancer. The New England Jouranl of Medicine, Dec 2002.
Reference 3Prospective trials.Aug 2003Clinical trialshttp://www.agendia.com/
Feature selection.Correlation with class
labels, very similar to t-test.
Using cross validation toselect 70 genes
295 samples selected from Netherland Cancer Institute
tissue bank (1984 – 1995).
Results” Gene expression profile is a morepowerful predictor then standard systems
based on clinical and histologic criteria
Agendia (formed by reseachers from the Netherlands Cancer Institute)Plan to start in Oct, 2003
1) 3000 subjects [Health Council of the Netherlands]2) 5000 subjects New York based Avon Foundation.
Custorm arrays are made by Aglient including70 genes + 1000 controls
Case studies
![Page 16: Clustering and Classification · 2017-10-25 · 6 Statistical Methods in Microarray Analysis Tutorial Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of](https://reader033.vdocuments.mx/reader033/viewer/2022042113/5e8f728c0fb8542a1a4da6ed/html5/thumbnails/16.jpg)
16
Statistical Methods in Microarray Analysis Tutorial
Acknowledgements
Clustering
• Agnes Paquet
• David Erle
• Andrea Barczac,
• UCSF SandlerGenomics Core Facility.
Discrimination
• Jane Fridlyand
• Mark Segal
• Terry Speed
• Sandrine Dudoit