smd data analysis tutorial april 7, 2009 catherine ball ([email protected]) janos demeter...
TRANSCRIPT
![Page 2: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/2.jpg)
SMD: Getting Help
• Click on the “Help” menu– Tool-specific links
will be listed at the top.
• Use the SMD help index to look for specific subjects
• Send e-mail to:[email protected]
![Page 3: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/3.jpg)
You will learn…• How to use SMD Data Analysis Pipeline
– Gene Selection and Annotation– Data Filtering– Data Retrieval– Gene Filtering– Clustering and Image Generation
• How to use SMD’s data repository• How to use SMD’s implementation of
the GenePattern data analysis suite
![Page 4: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/4.jpg)
Data Retrieval and Analysis
• Experiment names will be listed with feature extraction software indicated.
![Page 5: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/5.jpg)
Gene Selection and Annotation
• Specify genes or clones
• Collapse data by SUID or LUID
• Determine UID column
• Choose biological annotation
• Label result set
![Page 6: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/6.jpg)
Gene Selection: All genes
• Ten arrays• All genes• 8690 Biosequence IDs used
in cluster• Using all genes results in a
very long cluster!
![Page 7: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/7.jpg)
Gene Selection: Specify Genes or Clones
• Use all genes or clones on an array• Select a Genelist from your loader.stanford.edu account• Enter a list of genes to select. The names should be
separated by two colons• Optionally include controls and empty spots.
![Page 8: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/8.jpg)
Gene Selection: Genelists
• Ten arrays• 500-gene genelist• 380 Biosequence IDs used
for cluster• Using a genelist limits the
data analyzed to a subset of genes
![Page 9: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/9.jpg)
Gene Selection: Retrieving and Collapsing Data
• Collapse or averaging occurs within each individual array. Multiple instances of the same entity will be combined as specified.
• Duplicated entities can be defined in three ways:• Biosequence ID is the identifier for the molecule in SMD.• Laboratory Unique ID is the identifier for the source of the
sample in the lab. • SPOT is a individual feature on a print.
![Page 10: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/10.jpg)
Gene Selection: Collapse by SUID• Ten arrays• 500 gene genelist• Data retrieved by
Biosequence ID• 380 Biosequence IDs used
for cluster• duplicated spots will be
averaged
![Page 11: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/11.jpg)
Gene Annotation: Biological Annotation
• The list includes all information stored within SMD for any gene from the organism in question. Not all genes will have all annotations.
• Annotations from a genelist (if one was selected) can be used to describe the genes.
![Page 12: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/12.jpg)
Array Annotation: Name Choices
• Arrays (hybridizations) are identified in SMD by slide name (e.g., serial number) and experiment name, both unique.
• Agilent and Affymetrix data sets are further identified by a “result set” name – possibly more than one per hybridization, and not guaranteed to be unique.
![Page 13: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/13.jpg)
SMD Data Analysis Tutorial• How to use SMD Data Analysis Pipeline
– Gene Selection and Annotation– Data Filtering– Data Retrieval– Gene Filtering– Clustering and Image Generation
• How to use SMD’s data repository• How to use SMD’s implementation of
the GenePattern data analysis suite
![Page 14: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/14.jpg)
Data Filtering• Choose data column to
retrieve• Elect to invert reverse dye
replicates• Elect to filter by spot flag• Select spot criteria for
filtering• Define image presentation
options• Retrieve data in
background (not shown) - goes to repository
![Page 15: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/15.jpg)
Data Filtering: Choose Data to Retrieve
• You can retrieve and cluster any numerical measurement from your data.
• Clustering doesn’t necessarily make sense for all fields.
• Default (and most appropriate) fields for clustering are log ratio (two-channel data) and signal or intensity (single-channel data).
![Page 16: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/16.jpg)
Data Filtering: Selecting Filtering Criteria
• Each spot will be individually assessed as specified, prior to any averaging or collapse.
• Each filter can be made active and customized as desired.• Filters can be combined using logical operators (filter string),
defaulting to a logical AND.• Filters available will be appropriate to the feature extraction
software used.
![Page 17: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/17.jpg)
Data Filtering: Default Spot Filters• Regression correlation measures pixel-by-pixel agreement
between the two channels.• Foreground/Background intensities are a simple measure of
signal to noise.• Absolute intensity cutoffs impose a minimum net signal.• “Failed” and “Is Contaminated” refer to the quality of the spot
material.• Equivalent defaults are presented for Agilent data.• Affymetrix data can be filtered on detection, detection p-value,
etc.• Any data, including biological annotations, can be used for
customized filters.
![Page 18: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/18.jpg)
Spots with low regression correlation
![Page 19: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/19.jpg)
Data Filtering: Regression Correlation• Ten arrays• 500 gene Genelist• Spot flag = 0• Regression correlation > 0.6• 380 Biosequence IDs used for filtering• Filtering away spots with low
regression correlation removes many spots
![Page 20: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/20.jpg)
Data Filtering: Combinations of Filters• Ten arrays• 500-gene genelist• Regression correlation > 0.6• Net intensity in each channel >= 350• 371 Biosequence IDs selected for
clustering• This data set was formed by
selecting spots that are good quality (via the regression correlation) and good intensity in both channels
![Page 21: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/21.jpg)
Data Filtering: Image Presentation Options
• Retrieve spot coordinates will allow you to see an assembled image of each array after clustering.
• Show all spots allows you to view the spots you filtered out (in addition to the ones that passed filtering) after clustering. This might slow down data retrieval.
![Page 22: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/22.jpg)
SMD Data Analysis Tutorial• How to use SMD Data Analysis Pipeline
– Gene Selection and Annotation– Data Filtering– Data Retrieval– Gene Filtering– Clustering and Image Generation
• How to use SMD’s data repository• How to use SMD’s implementation of
the GenePattern data analysis suite
![Page 23: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/23.jpg)
Data Filtering: Retrieve Data in Background
• Long running data retrieval jobs can be submitted and you’ll be e-mailed with a progress report.
• Data sets will be saved to your data repository.
![Page 24: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/24.jpg)
Data Retrieval
• General results and progress
• PreClustering (.pcl) file
• Data retrieval summary report
• Option to deposit data in repository
…….
![Page 25: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/25.jpg)
Data Retrieval Summary
![Page 26: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/26.jpg)
SMD Data Analysis Tutorial• How to use SMD Data Analysis Pipeline
– Gene Selection and Annotation– Data Filtering– Data Retrieval– Gene Filtering– Clustering and Image Generation
• How to use SMD’s data repository• How to use SMD’s implementation of
the GenePattern data analysis suite
![Page 27: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/27.jpg)
Gene Filtering
• Transform single-channel data
• Filter genes based on data distribution
• Data centering• Filter genes based on
data values• Filter genes and arrays
based on spot filter criteria
• Zero-transform data
![Page 28: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/28.jpg)
Gene Filtering: Transformation
• Single-channel (e.g., Affymetrix) data only.• Adjust arrays for simple cross-array
comparison.• Log-transform data for clustering.
– May add a constant for variance stabilization– May replace non-positive values with very small
values
![Page 29: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/29.jpg)
Gene Filtering: Data Distribution
• Rank will select genes whose retrieved value is in the top Nth percentile.
• Deviations selects those genes whose retrieved value has a value significantly above or below the mean.
![Page 30: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/30.jpg)
Gene Filtering: Percentile Rank
• Ten arrays• 500-gene genelist• Regression correlation > 0.6• Net intensity in either channel >=
350• Rank > 95% in at least one array• Many data are removed, since
only those that were very intense in the yellow (red) channel are included.
![Page 31: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/31.jpg)
Gene Filtering: Deviation from Mean Value
• Ten arrays, 500-gene genelist• Regression correlation > 0.6• Net intensity in either channel >= 350• Genes whose Log(Normalized
Red/Green) is more than one standard deviation from mean in at least one array
• This filter removes data that do not show significant variance from the mean – a good way to identify genes with potentially interesting behavior.
![Page 32: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/32.jpg)
Gene Filtering: Centering Data
• Data can be centered at this stage. This transforms the data so that the mean value is equal to zero. Images and downloaded files will reflect this transformation.
• During clustering, data can be treated as if they were centered, but the values of the data are not affected.
• Gene centering is useful for common references.• Array centering amounts to renormalizing each array,
using the spots that pass the spot filter criteria.
![Page 33: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/33.jpg)
Data Centering
• Centering sets the average value of a vector to zero.
• This results in a loss of information, but may reveal important patterns.
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Array
log(ratio)
![Page 34: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/34.jpg)
Data Centering
• Gene centering is useful when the actual value of the ratio is not important or is not meaningful (e.g., common reference).
• Centering is generally not appropriate when using a biologically meaningful control sample, such as a matched, untreated sample, or a zero timepoint.
-4
-3
-2
-1
0
1
2
3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Array
log(ratio)
![Page 35: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/35.jpg)
Data Transformation: Centering
• To illustrate how centering affects data, a small sample of data were duplicated. A constant was added to the second copy of each row
![Page 36: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/36.jpg)
Uncentered Data, No Centering Metric During Clustering
Uncentered Data, Centering Metric During Clustering
Centered Data, No Centering Metric During Clustering
Centered Data, Centering Metric During Clustering
![Page 37: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/37.jpg)
Gene Filtering: Center Genes• Ten arrays, 500-gene genelist• Regression correlation > 0.6• Net intensity in either channel >=
350• Genes centered • No effect on number of
biosequence IDs clustered, but data values are changed (centered data is displayed on left)
Centered Uncentered
![Page 38: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/38.jpg)
Gene Filtering: Data Values
• Cutoff requires data to exceed a user-defined value in at least A arrays. Think hard before using this filter. Especially when data are centered, you could be losing important information.
![Page 39: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/39.jpg)
Gene Filtering: Spot Filter Criteria
• Genes can be screened out if they do not meet the spot criteria a given percentage of the time, as specified by the user.
• Arrays can be similarly filtered out if they do not meet the spot filter criteria.
![Page 40: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/40.jpg)
Spot Filtering vs. Gene FilteringSpot filters remove individual data points. That means there will be more missing (gray) data.
Gene filters remove the genes that do not meet the filter criteria often enough. This reduces the number of genes.
![Page 41: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/41.jpg)
Gene Filtering: Zero Time Point Transformation
• Data can be transformed by subtracting one state of a series from all other data
![Page 42: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/42.jpg)
Gene Filtering: Zero Time Point Transformation
Subtract the values
from the first time
point from all the other
time points
![Page 43: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/43.jpg)
Gene Filtering: Results
• Download PreClustering files (.pcl)
• Go to GenePattern• Summary report• Deposit to repository• Another round of
filtering• Proceed to clustering
![Page 44: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/44.jpg)
Gene Filtering: Data Retrieval Summary Report
![Page 45: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/45.jpg)
SMD Data Analysis Tutorial• How to use SMD Data Analysis Pipeline
– Gene Selection and Annotation– Data Filtering– Data Retrieval– Gene Filtering– Clustering and Image Generation
• How to use SMD’s data repository• How to use SMD’s implementation of
the GenePattern data analysis suite
![Page 46: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/46.jpg)
Clustering and Image Generation
• Partitioning options• Clustering metric
selections• Correlated genes• Image generation
options
![Page 47: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/47.jpg)
Clustering Algorithms
In microarray studies, we often use clustering algorithms to help us identify patterns in complex data.
For example, we can randomize the data used to represent this painting and see if clustering will help us visualize the pattern.
![Page 48: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/48.jpg)
Clustering algorithms
The painting is “sliced” into rows which are then randomized.
![Page 49: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/49.jpg)
Rows ordered by hierarchical clustering with nodes flipped to optimize ordering
Clustering algorithms
![Page 50: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/50.jpg)
How do we compare expression profiles?
• Treat expression data for a gene as a multidimensional vector.
• Decide on a distance metric to compare the vectors.– Plenty to choose from…
• Pearson correlation, Euclidean Distance, Manhattan Distance etc.
![Page 51: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/51.jpg)
Similar expression
• Crucial concept for understanding clustering
• Each gene is represented by a vector where coordinates are its values (log(ratio)) in each experiment
• x = log(ratio)expt1
• y = log(ratio)expt2
• z = log(ratio)expt3
• etc.
Expression Vectors
x
y
z
![Page 52: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/52.jpg)
Clustering: Metric Selections
• Genes and arrays can be clustered.
• Pearson correlation treats vectors as if they were the same (unit) length.
• Euclidean distance will be affected by both the direction and the amplitude of the vectors.
![Page 53: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/53.jpg)
• Distances are measured “between” expression vectors
• Distance metrics define the way we measure distances
• Many different ways to measure distance:• Euclidean distance• Pearson correlation coefficient(s)• Manhattan distance• Mutual information• Kendall’s Tau• etc.
• Each has different properties and can reveal different features of the data
Distance Metrics
![Page 54: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/54.jpg)
Euclidean Distance
• The Euclidean distance metric detects similar vectors by identifying those that are closest in space. In this example, A and C are closest to one another.
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
ARRAY 1
ARRAY 2Gene B
Gene A
Gene C
![Page 55: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/55.jpg)
Pearson Correlation
• The Pearson correlation disregards the magnitude of the vectors but instead compares their directions. In this example, Gene A and Gene B have the same slope, so would be most similar to each other.
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5
ARRAY 1
ARRAY 2Gene B
Gene A
Gene C
![Page 56: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/56.jpg)
Distance Metric: Pearson vs. Euclidean
-1.5
-1
-0.5
0
0.5
1
1.5
1 2 3 4 5 6 7 8
Array
Log Ratio
• By Euclidean distance, A and C are most similar.• By Pearson correlation, A and B are most similar.
A
B
C
![Page 57: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/57.jpg)
Clustering: Tree Displays
• Clustered gene arrays are displayed adjacent to most similar arrays.
• The nodes of the trees indicate the members of an array and the degree of similarity to its neighbor.
![Page 58: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/58.jpg)
Hierarchical Clustering
1. Calculate the distance between all genes. Find the smallest distance. If several pairs share the same similarity, use a predetermined rule to decide between alternatives.
2. Fuse the two selected clusters to produce a new cluster that now contains at least two objects. Calculate the distance between the new cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single cluster remains.4. Draw a tree representing the results.
G1G6
G3
G5
G4
G2
G1
G6
G3
G5
G4
G2
![Page 59: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/59.jpg)
Clustering: Array ClusteringNo Array Clustering
With Array Clustering
![Page 60: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/60.jpg)
Clustering: Self Organizing Maps
• Map of n partitions, that is modeled on the expression data, where each partition in the map has an associated vector
• Genes are assigned to partitions of most similar genes
• Neighboring partitions are more similar to each other than they are to distant partitions
![Page 61: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/61.jpg)
Clustering: Correlated Genes
• SMD can produce a file listing the best-correlated genes, for each gene retrieved.
![Page 62: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/62.jpg)
Clustering: Visualization• Click on the image to get a
dynamic display.• Click on the TreeView button
for another dynamic option.• Click on one of the other
options to see static displays with or without the spot images.
• Download files (.cdt, .atr, .gtr, report) for use with other tools (e.g., TreeView).
• Add cluster or pre-clustering file to your repository
![Page 63: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/63.jpg)
Clustering Display: Adjacent Cluster and Clustered Spot Images
![Page 64: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/64.jpg)
Clustering Display: Hierarchical Cluster View
• Interactive view of cluster
• Link to GO term analysis (green nodes) to evaluate sub-clusters.
![Page 65: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/65.jpg)
SMD Data Analysis• Using SMD Data Analysis Pipeline• Repository Tools
– SVD– Synthetic Gene Tool– kNNimpute
• GenePattern tools
![Page 66: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/66.jpg)
SMD Help: File Formats
![Page 67: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/67.jpg)
File Formats: Pre-clustering (PCL) File
UID is the Unique Identifier for the Spot/Reporter
NAME sequence label for the Spot/Reporter
GWEIGHT indicates the weight the Spot/Reporter is given in clustering
Names and orders of arrays (if arrays are not clustered)
EWEIGHT indicates the weight the Array/Experiment is given in clustering
Values are for each spot/reporter on each array (usually log ratios)
![Page 68: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/68.jpg)
File Formats: Clustered Data Table (CDT) File
![Page 69: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/69.jpg)
File Formats: Gene Cluster Text (GCT) File
![Page 70: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/70.jpg)
File Formats: Class (CLS) File
![Page 71: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/71.jpg)
Using Your Repository: PCL Deposits
View information about your repository entry
Download data
Delete the repository entry
Edit the entry
Cluster data
Filter data
Apply SVD to data
Apply “Synthetic Genes” to data
Estimate missing data with KNN impute
Use GenePattern tools
![Page 72: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/72.jpg)
Using the Repository: CDT File Options
CDT files have a few other options
GeneXplorer
TreeView
Clustering with Proxy images
Clustering with Spotimages
Clustering with Proxy and Spot images
![Page 73: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/73.jpg)
Viewing Repository Entries
• Name• Organism• Number of genes• Number of arrays• Size of file• Date uploaded• Description• Data retrieval summary
![Page 74: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/74.jpg)
Editing Entries -- How to Share!
• Change repository entry name
• Change description• Add access to
repository entry to a GROUP
• Add access to a repository entry to a SMD USER
![Page 75: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/75.jpg)
SMD Data Analysis• Data Analysis Background
– Clustering algorithms– Data centering
• Using SMD Data Analysis Pipeline– Gene Selection and Annotation– Data Filtering– Data Retrieval– Gene Filtering– Clustering and Image Generation
• Repository Tools– SVD– Synthetic Gene Tool– kNNimpute
• GenePattern tools
![Page 76: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/76.jpg)
SVD: Singular Value Decomposition
• The goal of SVD is to find a set of patterns that describe the greatest amount of variance in a dataset
• SVD determines unique orthogonal (or uncorrelated) gene and corresponding array expression patterns (i.e. "eigengenes" and "eigenarrays," respectively) in the data
• Patterns might be correlated with biological processes OR might be correlated with technical artifacts
![Page 77: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/77.jpg)
SVD:method
![Page 78: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/78.jpg)
SVD Display in SMD
![Page 79: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/79.jpg)
SMD Data Analysis• Data Analysis Background
– Clustering algorithms– Data centering
• Using SMD Data Analysis Pipeline– Gene Selection and Annotation– Data Filtering– Data Retrieval– Gene Filtering– Clustering and Image Generation
• Repository Tools– SVD– kNNimpute– Synthetic Gene Tool
• GenePattern tools
![Page 80: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/80.jpg)
KNNImpute: The Missing Values Problem
• Microarrays can have systematic or random missing values
• Some algorithms aren’t robust to missing values
• Large literature on parameter estimation exists
• What’s best to do for microarrays?
![Page 81: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/81.jpg)
KNNimpute Algorithm• Idea: use genes with similar expression
profiles to estimate missing values
2 | 4 | 5 | 7 | 3 | 2
2 | | 5 | 7 | 3 | 1
3 | 5 | 6 | 7 | 3 | 2
Gene X
Gene B
Gene C
j
2 | 4 | 5 | 7 | 3 | 2
2 |4.3| 5 | 7 | 3 | 1
3 | 5 | 6 | 7 | 3 | 2
Gene X
Gene B
Gene C
j
![Page 82: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/82.jpg)
SMD Data Analysis• Data Analysis Background
– Clustering algorithms– Data centering
• Using SMD Data Analysis Pipeline– Gene Selection and Annotation– Data Filtering– Data Retrieval– Gene Filtering– Clustering and Image Generation
• Repository Tools– SVD– kNNimpute– Synthetic Gene Tool
• GenePattern tools
![Page 83: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/83.jpg)
Synthetic Genes• Purpose:
average data based on arbitrary groupings of genes/probes
- for biological reasons
- for technical reasons• Can average data using:
- common genelists
- your own genelists
- annotations in pcl file• After averaging:
- a new row for the synthetic gene data
- Original data can be removed/included
![Page 84: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/84.jpg)
Synthetic Genes• Common lists available (only mouse and human data):
– Unigene (all clones/oligos that report on a given Unigene id will be averaged and shown as the Unigene id)
– Entrez Geneid (same as above, but for Entrez Geneid)
These lists are useful to collapse data by gene, rather than biosequenceid/luid.
They allow comparison of experiments between different platforms - oligo print to cDNA print or spotted arrays to Agilent arrays where the arrays don’t share common reporters. Also can be used to compare cDNA prints with h/meebo arrays
These synthetic gene lists are updated on a regular basis.
![Page 85: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/85.jpg)
Synthetic Genes
• Other common synthetic gene lists:– chromosome arms– cytobands– 5 Mb tiles based on GoldenPath mappings– Tissue types– tumor types– processes
– Additional lists see: http://smd.stanford.edu/help/synthGenes.shtml
![Page 86: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/86.jpg)
SMD Data Analysis• Using SMD Data Analysis Pipeline• Repository Tools
– SVD– Synthetic Gene Tool– kNNimpute
• GenePattern tools
![Page 87: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/87.jpg)
What is GenePattern?
Software package developed at Broad Institute (Jill P. Mesirov’s group):
http://www.broad.mit.edu/cancer/software/genepattern/
Reasons to choose this package:• Large number of microarray analysis
tools (>90)• Ability to create pipelines (reproducible
research)• Ease of adding new modules to existing
ones
![Page 88: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/88.jpg)
How to find GP in SMD?From Data retrieval
From repository
![Page 89: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/89.jpg)
Terms in GenePattern
• Module (Analysis/Visualization/Utility): program that does analysis, displays or executes some other transformation of a file
• Pipelines: chained modules - output from one -> input to next
• Suites: groupings of modules/pipelines• Jobs:
– execution of module/pipeline– persistent – results are deleted after one week
• Go to web-site for navigation
![Page 90: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/90.jpg)
GenePattern comments
• SMD uses pcl• GenePattern uses gct (among others )• Converters gct -> pcl; pcl -> gct• Most tools in GenePattern need full dataset - Use
ImputeMissingValuesKNN first• Most default values are designed for Affymetrix data -
evaluate each option carefully (GeneCruiser)
![Page 91: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/91.jpg)
Input/output files in GenePattern
• Called through specific pcl file• Files in your repository• Upload data from desktop• Any file that has a url• Module to get data directly from geo
![Page 92: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/92.jpg)
• Clustering:– Hierarchical clustering/HierarchicalClusteringPCL– Self-organizing maps (SOM)– K-means clustering – Non-negative matrix factorization (NMF) (Brunet et al., 2004) is an alternative method for class
discovery. Rather than clustering genes, NMF detects context-dependent patterns of gene expression. Requires all positive values.
– Consensus clustering (Monti et al., 2003) runs a selected clustering algorithm against perturbations of the original data set. The result is a consensus matrix that assesses the stability of discovered clusters. Supported clustering methods: hierarchical clustering, K-means clustering, self-organizing maps (SOM), and non-negative matrix factorization (NMF).
• Clusters genes or samples, not both
– SubMap (Hoshida Y, et al. PLoS ONE 2(11): e1195, 2007 ) is an unsupervised method, which estimates the significance of an association between subclasses observed in two independent data sets. The subclass labels are predetermined as manually assigned phenotypes or by clustering prior to the application of the SubMap algorithm.
• Corresponding visualizers:– HierarchicalClusteringViewer– SOMClusterViewer– HeatMapViewer
– Etc…• Clustering result example
GenePattern modules by category: Clustering
![Page 93: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/93.jpg)
• ComparativeMarkerSelection (similar to SAM)– ComparativeMarkerSelectionViewer:
Visualize and explore data produced by the method
– ExtractComparativeMarkerResults: extract data based on the analysis, create genelist
• Gene Set Enrichment Analysis (GSEA)– GSEALeadingEdgeViewer
GenePattern modules by category: Marker Selection
![Page 94: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/94.jpg)
Goal: Given phenotypically distinct classes,distinct classes, find markers with distinct expression distinct expression patternspatterns (in different classes)
GenePattern modules : Marker Selection
ScoreScore
Measure of Measure of confidence/significanceconfidence/significance
Reject/Accept Reject/Accept criterioncriterion
?
YesYes
NoNo
Marker Marker candidatecandidate
ss
DatasetDataset
PhenotypePhenotype
![Page 95: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/95.jpg)
Visualize result using ComparativeMarkerSelectionViewer
GenePattern modules: Marker Selection
![Page 96: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/96.jpg)
GenePattern modules: GSEA
ScoreScore
Measure of Measure of confidence/significanceconfidence/significance
Reject/Accept Reject/Accept criterioncriterion
?YesYes
NoNoMarker Marker
candidatcandidateses
DatasetDataset
PhenotypePhenotype
• a method to determine whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states
• Very similar to comparative marker selection: sets rather than genes
![Page 97: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/97.jpg)
GenePattern modules: GSEA
Molecular Signatures DB http://www.broad.mit.edu/gsea/msigdb/index.jsp
Gene sets: groups of gene symbols
Gene sets are versioned
![Page 98: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/98.jpg)
• Requirements:– Expression dataset– Class file:
– Chip file:
GenePattern modules: GSEA
Number of slides
Number of classes
![Page 99: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/99.jpg)
GenePattern modules: GSEA
How it works:• Sorts rows based on how well a metric correlates with the class assignment (similar to marker selection tool)• Scores gene sets (using a scoring method) by walking down the ranked list of genes, increasing a running-
sum statistic when a gene is in the gene set and decreasing it when it is not.
• The enrichment score is the maximum (or minimum) value of the running sum.• Permutes labels to come up with FDR
![Page 100: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/100.jpg)
• Output results can be viewed in web-browser
• Further analysis: GSEALeadingEdgeViewer
GenePattern modules: GSEA
![Page 101: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/101.jpg)
• Pipeline can be created from a path• by concatenating individual modules• From zip file• Pipelines can be exported• Pipelines can be private or public
Create pipeline/suite
![Page 102: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/102.jpg)
• Smd curators/programmers can create/upload new modules
• If you have any programs you would like to share or use in smd, please let us know
Creating new modules
![Page 103: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/103.jpg)
• Goal: Given phenotypically distinct classes, find a gene expression signature that accurately predicts class membership.
• Computational methodology: divide data into training and test sets
• Goal: – achieve high predictive power– Avoid over-fitting
GenePattern modules : Class Prediction
Expression DataExpression Data Known ClassesKnown Classes
Assess Gene-Class Correlation
Feature Selection
Assess Gene-Class Correlation
Feature Selection
Build ClassifierBuild Classifier
Test Classifier by Cross-Validation
Test Classifier by Cross-Validation
Evaluate Classifier on Independent Test SetEvaluate Classifier on Independent Test Set
Naïve Bayes/Large Bayes
Weighted Voting
k-Nearest Neighbors
Support Vector Machines
Logistic Regression
Classification Trees
LDA, QDA, …
![Page 104: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/104.jpg)
• Simple example:Knn classifier,k=5, 2 genes, 2 classes
GenePattern modules : Class Prediction
project samples in gene space
gene 1gene 1
gene
2ge
ne 2
class orange
class black
![Page 105: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/105.jpg)
• Simple example:Knn classifier,k=5, 2 genes, 2 classes
GenePattern modules : Class Prediction
gene 1gene 1
gene
2ge
ne 2
class orange
class black
project unknown sample
?
![Page 106: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/106.jpg)
• Simple example:Knn classifier,k=5, 2 genes, 2 classes
GenePattern modules : Class Prediction
gene 1gene 1
gene 2
gene 2
class orange
class black
"consult" 5 closest neighbors:- 3 black- 2 orange Distance measures:
• Euclidean distance• 1-Pearson correlation• KL divergence• …
?
![Page 107: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/107.jpg)
GenePattern modules : Class Prediction
• Evaluation on Evaluation on independent test set test set– Build the classifier on the train settrain set.– Assess prediction performance on test set.test set.
• Maximize generalization/Avoid overfitting.
• Performance measurePerformance measure
error rate=# of cases correctly classified
total # of cases
![Page 108: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/108.jpg)
• K-nearest-neighbors (KNN) classifies an unknown sample by assigning it the phenotype label most frequently represented among the k nearest known samples (Golub and Slonim et al., 1999). In GenePattern, the user selects a weighting factor for the 'votes' of the nearest neighbors (unweighted: all votes are equal; weighted by the reciprocal of the rank of the neighbor's distance: the closest neighbor is given weight 1/1, next closest neighbor is given weight 1/2, and so on; or weighted by the reciprocal of the distance).
• Weighted Voting (Slonim et al., 2000) classifies an unknown sample using a simple weighted voting scheme. Each gene in the classifier 'votes' for the phenotype class of the unknown sample. A gene's vote is weighted by how closely its expression correlates with the differentiation between phenotype classes in the training data set.
• Support Vector Machines (SVM) is designed for multiple class classification (Rifkin et al., 2003). The algorithm creates a binary SVM classifier for each class by computing a maximal margin hyperplane that separates the given class from all other classes; that is, the hyperplane with maximal distance to the nearest data point. The binary classifiers are then combined into a multiclass classfier. For an unknown sample, the assigned class is the one with the largest margin.
• CART (Breiman et al., 1984) builds Classification And Regression Trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). It works by recursively splitting the feature space into a set of non-overlapping regions and then predicting the most likely value of the dependent variable within each region. A classification tree represents a set of nested if-then conditions that allows for the prediction of the value of the categorical dependent variable based on the observed values of the feature variables. A regression tree is similar but allows for the prediction of the value of a continuous dependent variable instead.
GenePattern modules : Class Prediction
![Page 109: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/109.jpg)
• ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) (Margolin, A., et al., BMC Bioinformatics, 2006. 7(Suppl 1): p. S7.) is an algorithm which reverse engineers a gene regulatory network from microarray gene expression data. It attemps to predict targets of select transcription factors from a microarray dataset.
• MINDY (Modulator Inference by Network Dynamics) algorithm computationally infers genes that modulate the activity of a transcription factor at post-transcriptional levels (Wang, et. al. ,2006). The algorithm uses mutual information (MI) to measure the mutual dependence of the transcription factor (TF) and its target gene to predict modulators of TF activity.
GenePattern modules: Pathway Analysis
![Page 110: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/110.jpg)
• SurvivalCurve: Draws survival curve based on cls file
• SurvivalDifference: tests if there is a difference between two or more survival curves based on sample classes defined by genomic data. The log-rank test (Mantel-Haenszel test) and the generalized Wilcoxon test can be used.
GenePattern modules: Survival Analysis
QuickTime™ and a decompressor
are needed to see this picture.
![Page 111: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/111.jpg)
• Many other modules:– Projection methods: PCA, NMF (Non-
negative Matrix Factorization)– Tools for snp analysis– Tools for proteomics data– Etc…
GenePattern modules
![Page 112: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/112.jpg)
SMD: Office Hours
• Grant S201• Mondays 3 -
5 pm• Wednesdays
2 - 4 pm
![Page 113: SMD Data Analysis Tutorial April 7, 2009 Catherine Ball (ball@genome.stanford.edu) Janos Demeter (jdemeter@genome.stanford.edu)](https://reader036.vdocuments.mx/reader036/viewer/2022081603/56649dc65503460f94abaa7a/html5/thumbnails/113.jpg)
SMD StaffGavin SherlockCo-Investigator
Catherine BallDirector
Janos DemeterComputational Biologist
Tatiparthy ReddyScientific Curator
Heng JinScientific Programmer
Patrick BrownCo-InvestigatorFarrell Wymore
Lead ProgrammerMichael NitzbergDatabase Administrator
Zac ZachariahSystems Administrator
Maria MaoSoftware Engineer
Jeremy HubbleScientific Programmer