statistical methods for microarrays christina kendziorski landon sego department of biostatistics...
TRANSCRIPT
Statistical Methods for Microarrays
Christina Kendziorski
Landon Sego
Department of Biostatistics and Medical InformaticsUniversity of Wisconsin-Madison
BASIC BIOLOGY
Introduction to Basic Biology and Microarray Experiments
• What is a DNA microarray measuring?
Gene expression.
• The novelty of a microarray is that it quantifies the abundance of thousands of genes simultaneously—which gives biologists a global perspective.
• The biological processes that give rise to microarray data can be viewed as information transfer processes.
• Data collection for microarray experiments is not a trivial task and requires imaging technology and image processing tools.
Nguyen, et al. 2002
Review of DNA molecule
Nguyen, et al. 2002
The Central Dogma of Molecular Biology
Terminology
• Amino acid The basic building block of proteins (or polypeptides)
• mRNA Messenger RNA is an RNA strand complementary to a DNA template
• TranscriptionThe process where the DNA template is copied/transcribed to mRNA
• Gene expression A gene is expressed if its DNA has been transcribed to RNA—gene
expression is the level of transcription of the DNA of the gene
• RT Reverse transcription is an experimental procedure to synthesize a DNA
strand (cDNA) which is complementary to a mRNA template
Nguyen, et al. 2002
Terminology
• cDNA/cRNA Complementary DNA is synthesized from mRNA during RT and, similarly, in the
context of oligo arrays, complementary RNA is RNA synthesized during in vitro transcription
• dNTP Deoxyribo nucleoside triphosphate; denotes any of dUTP, dTTP, dATP, or dGTP;
molecular building blocks for making DNAs in RT, PCR, or in vitro replication; free dNTP’s in solution (which are not yet incorporated into the nucleic acid strand) have three phosphates which provide the necessary energy for cDNA synthesis
Nguyen, et al. 2002
Terminology
• Primer A short, single strand of RNA or DNA that can initiate chain growth from a template
• Oligo(dT) Primer with sequence TTTT… used to initiate cDNA during RT
• Reverse transcriptase An enzyme that catalyzes the synthesis of cDNA during RT
• Poly(A) tail A sequence of A (AAA …) at the 3' end of mRNA; oligo(dT) is used in
RT to recognize mRNA by its poly(A) tail
• Target cDNAs Mixture of cDNAs obtained from the experiment and reference mRNAs
Nguyen, et al. 2002
Terminology
• Probe cDNAs Immobilized cDNA printed on the array
• Hybridization Process of bringing into contact the target and probe for binding in microarrays—also refers to the binding of two DNA strands generally
• PCR Polymerase chain reaction is a procedure to amplify a segment of DNA—mass-
replication of a segment of DNA
• Oligonucleotide A short fragment of DNA (usually in single- stranded form) which is often chemically synthesized. It can be used as a probe or primer. 'Oligo' is Greek for 'few'.
Nguyen, et al. 2002
The Central Dogma of Molecular Biology
Basic model for gene expression
• Two different levels of gene expression:
Transcription level—where RNA is made from DNA.
Translation level—where protein is made from mRNA.
• Microarrays measure gene expression at the transcription level.
Nguyen, et al. 2002
DNA mRNAamino acid protein
cell phenotype organism phenotype
transcription translation
Gene expression and abundance
• Quantification of mRNA abundances
Quantification of amount of gene expression
• Gene is expressed if its DNA has been transcribed to RNA
• A “high level of expression” would imply transcription has occurred many times and there are many copies of mRNA in the tissue.
• “low level of expression” implies fewer copies of mRNA
Nguyen, et al. 2002
DNA transcription
• DNA transcription is the information transfer process directly relevant to DNA microarray experiments because quantification of the type and amount of this copied information is the goal of the microarray experiment.
• Transcription occurs in 3 stages: initiation, elongation, and termination.
• After transcription, the mRNA is further processed by removing non-coding segments, called introns.
Nguyen, et al. 2002
DNA transcription - Initiation
http://www.brooklyn.cuny.edu/bc/ahp/BioInfo/graphics/Transcription.02.GIF
• Promoter regions on the DNA chain provide the signal for the initiation of transcription. Promoter regions recruit an enzyme (protein) called RNA polymerase II to the transcription initiation site.
DNA transcription - Elongation
http://www.brooklyn.cuny.edu/bc/ahp/BioInfo/graphics/Transcription.02.GIF
• During elongation, the RNA polymerase moves along the DNA and extends the RNA chain by adding free nucleotides with base A, G, C, or U to match the T, C, G, or A nucleotides of the DNA template strand, respectively.
DNA transcription – Termination and processing
• When the RNA polymerase reaches the template strand signal for termination, the newly synthesized RNA is released from the DNA template.
• Before the message is transported to the cytoplasm, some important posttranscriptional processing occurs.
• For example, a sequence of A’s is added to the RNA strand at the 3' end. This sequence of A’s is called the poly(A) tail.
• Non-coding regions of the mRNA (called introns) are removed in a process called splicing.
Nguyen, et al. 2002
DNA transcription and RNA processing
Nguyen, et al. 2002
One gene ≠ One protein
• Relationship between protein and mRNA is not one to one—so the simplified model shown on the previous slide is only an approximation.
• Exon: Coding region of DNA
• Intron: Non-coding region
Gene 1 Gene 2
template DNA strand
1 gaattccacattgtttgctgcacgttggattttgaaatgctagggaactttgggagactc61 atatttctgggctagaggatctgtggaccacaagatctttttatgatgacagtagcaatg
421 gagctacaagggcctggtgcatccagggtgatctagtaattgc agaacagcaagtgct ag481 ctctccctccccttccacagctctgggtgtgggagggggttgtccagcctccagcagcat541 ggggagggccttggtcagcctctgggtgccagcagggcaggggcggagtcctggggaatg601 aaggttttatagggctcctgggggaggctccccagccccaagcttaccacctgcacccgg661 agagctgtgtcaccatgtgggtcccggttgtcttcctcaccctgtccgtgacgtggattg721 gtgagaggggccatggttggggggatgcaggagagggagccagccctgactgtcaagctg781 aggctctttcccccccaacccagcaccccagcccagacagggagctgggctcttttctgt
6301 cctagagaaggctgtgagccaaggagggagggtcttcctttggcatgggatggggatgaa6361 gtaaggagagggactggaccccctggaagctgattcactatggggggaggtgtattgaag6421 tcctccagacaaccctcagatttgatgatttcctagtagaactcacagaaataaagagct6481 cttatactgt
...
...
Success Story: KLK3 (PSA Gene)
Success Story
Dhanasekaran et al., Nature, 2001
GETTING EXPRESSION MEASUREMENTS:
cDNA ARRAYS and AFFY CHIPS
cDNA Microarray Experimental Procedure: Overview
• Given a biological sample of cells, there are two possibilities: a set of genes is either expressed in the cells or it is not.
• cDNA arrays are designed to measure the expression of cells in an experimental sample relative to a reference (control) sample.
• In current practice, cDNAs from the experimental and reference samples are labeled with different fluorescent dyes, mixed, and hybridized onto the array.
• The measured fluorescence intensity for each sample is assumed to be proportional to transcript abundance (of course, this is conditional on factors such as spot characteristics, hybridization efficiency, level of dye incorporation, etc.).
Nguyen, et al. 2002
mRNA
LabelledcDNA
Tissue Sample
Microarray
2
1
xx
DataNominal
Level1
2
cDNA Array Data
cDNA Microarray Experimental Procedure
1. Fabrication of array: preparing glass slide, selecting probe DNA sequences, and depositing (printing) the probe cDNA onto the slide.
2. Sample preparation: Isolating total RNA (mRNA and other RNAs) from experimental and reference samples of interest.
3. cDNA synthesis and labeling: making cDNAs from the experimental and reference samples and labeling each sample with a fluorescent dye.
4. Hybridization: applying experimental and reference cDNA mixture to the array, letting the target and probe cDNA bind, then washing off the excess.
5. Data collection: measurement of fluorescent intensities using a confocal microscope.
Nguyen, et al. 2002
Fabrication of the array
• What cDNA sequences (probes) should be printed on the array?
• Ideally, all genes would be printed. But in most cases we don’t know the sequences of all genes.
• cDNA libraries (GenBank, UniGene etc.) can be used to select the cDNA sequences.
• Sometimes cDNAs may be spotted (printed) onto the array without knowing what gene the cDNA corresponds to.
• Sometimes pieces of expressed gene, known as an expressed sequence tag, (EST) can be spotted onto the array.
Nguyen, et al. 2002
Fabrication of the array
The MGuide. Version 2.0 The Brown Lab's complete guide to microarraying for the molecular biologist. Parts list Drawings for custom parts Assembly Guide: Step-by-Step Download software Online Software Documentation Print Tip Gallery Protocols Re-Purify Your Cy-Dyes. MicroArray Forum. NEW!
http://cmgm.stanford.edu/pbrown/mguide
Fabrication of the array
• The array itself is a pre-treated glass slide to which the cDNA probes will be attached.
• The selected cDNA sequences are amplified (mass-replicated) using PCR.
• After amplification, the solution containing the amplified cDNA probes is deposited on the array using a set of microspotting pins.
• Ideally, the amount solution deposited by the pins should be uniform—but this is not completely achieved in practice:
Nguyen, et al. 2002
Glass slides and/or treatment not uniform There are pin effects Spots are not uniform.
Fabrication of the array
• The drops of solution containing the cDNA probes form the spots on the array—each spot corresponding to a gene, EST, other.
• As a product of PCR, the cDNA probes that are spotted onto the array are double stranded.
• When the target cDNA is applied to the array, the double stranded probes are denatured (separated) in a heating process to allow the target cDNA to bond to the probe strands.
Nguyen, et al. 2002
Sample preparation
• mRNA is extracted from tissue samples. For example:
• During the sample preparation process, operator differences and heterogeneous tissue can significantly contribute to the variability.
• However, these factors are not normally considered in microarray studies.
Nguyen, et al. 2002
one sample from tumor tissue (experimental sample)
one sample from normal tissue (reference sample).
Synthesis, labeling, and hybridization
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
Simplified summary of cDNA synthesis and labeling
• RNA is isolated from experimental and reference cell pools.
• Free nucleotides (dNTPs), oligo(dT), and reverse transcriptase are added to the solution of total RNA to initiate cDNA synthesis.
• Fluorescent dye molecules (labels) are incorporated into the cDNA.
• Typically, the cDNA from the experimental sample is dyed red (Cy5) and cDNA from the reference sample is dyed green (Cy3).
Nguyen, et al. 2002
• There are several methods for adding the labels to the cDNA:
• The gene expression measurement is affected by the labeling method used.
Dye incorporation
Nguyen, et al. 2002
direct incorporation labeling method
Amino-modified (amino-allyl) nucleotide method
primer tagging method
Direct Incorporation Method
Nguyen, et al. 2002
• RNA is isolated from experimental and reference cell pools.
• The following ingredients are added to the solution of total RNA to initiate cDNA synthesis:
• Experimental and reference solutions are mixed and hybridized onto the array.
1. Oligo(dT)2. Reverse transcriptase3. Free nucleotides: dATP, dCTP, dGTP, and dTTP4. Labeled uracil nucleotides: dUTPs with a dye
molecule attached. Cy5-dUTP (red) is added to the experimental solution and Cy3-dUTP (green) is added to the reference solution.
Amino Modified Nucleotide Method
Nguyen, et al. 2002
• RNA is isolated from experimental and reference cell pools.
• The following ingredients are added to the solution of total RNA to initiate cDNA synthesis:
• After cDNA synthesis, Cy5 and Cy3 are added to the experimental and reference solutions, respectively. The dye couples with the amino-modified dUTP nucleotides.
• Experimental and reference solutions are mixed and hybridized onto the array.
1. Oligo(dT)2. Reverse transcriptase3. Free nucleotides: dATP, dCTP, dGTP, and dTTP4. Modified amino-allyl dUTPs are added to both
experimental and reference solutions.
Primer Tagging Method
Nguyen, et al. 2002
• RNA is isolated from experimental and reference cell pools.
• The following ingredients are added to the solution of total RNA to initiate cDNA synthesis:
• After cDNA synthesis, experimental and reference solutions are mixed and hybridized onto the array.
• After washing, the array is incubated with Cy5- and Cy3-labeled molecules called dendrimers. The dendrimers attach to the corresponding capture sequences.
1. Reverse transcriptase2. Free (unlabeled and unmodified) nucleotides:
dATP, dCTP, dGTP and dTTP3. Oligo(dT) primer with capture sequence TTTT----
for experimental sample and capture sequence TTTT+++ for reference sample.
• In all three methods, the red and green dyes may incorporate unequally. Likewise, spectral overlap can occur between the fluorescence of the two dyes.
• In the direct incorporation method, the Cy3-dUTP and Cy5-dUTP molecules exhibit some steric hindrance which contributes to nonefficient and nonuniform incorporation of the dye into the cDNA.
• In the amino-modified method, the amino-allyl is a smaller molecule with less steric hindrance, and so the amino modified dUTP are uniformly incorporated the into the cDNAs with higher frequency than the direct incorporation method.
Comparing the dye incorporation methods
Nguyen, et al. 2002
• In both the direct incorporation and amino-modified methods, the abundance of labeled uracil nucleotides is influenced by the composition as well as the length of the cDNA strand.
• The resulting fluorescent intensity depends on the abundance of uracil nucleotides that were incorporated into the cDNA strand.
• The primer tagging method attempts to correct this problem by attaching one dendrimer to each cDNA strand. Each dendrimer contains approximately 250 fluorescent Cy5 or Cy3 molecules. Hence there is approximately one intensity signal per cDNA molecule.
Comparing the dye incorporation methods
Nguyen, et al. 2002
cDNA Microarray Experimental Procedure: Hybridization
• The solutions containing the experimental and reference labeled cDNAs are mixed and applied to the array, which contains the probe cDNAs in each spot.
• Target and probe sequences bind by base pairing (hybridization). Note that binding can occur between sequences that are similar but not identical (cross-hybridization).
• After sufficient time is allowed for hybridization, the array then goes through a series of washes to eliminate all unbound target cDNA’s an solution.
• The washing procedure must be stringent enough to remove all extraneous material but at the same time not remove the bound cDNAs—the signals of interest.
Nguyen, et al. 2002
cDNA Microarray Experimental Procedure: Concepts
• Consider one spot on the array. This spot contains cDNA probes for a gene of interest, say gene A.
• If there are target cDNAs in the mixed solution complementary to the probe cDNAs of gene A, they should bind together by base pairing (hybridization).
• If, for example, gene A is expressed in both the experimental and reference samples, we expect that cDNA from both samples to bind with the probe cDNA—this spot will then show both red and green fluorescence.
Nguyen, et al. 2002
cDNA Microarray Experimental Procedure: Data Collection
• After the slides are prepared and the hybridization step is complete, the expression level of each gene is measured.
• The expression levels of a gene in the experimental or reference cells are measured by the spot intensities of the fluorescent dyes. We assume that a spot of high fluorescence indicates high expression of the corresponding gene.
• The array is scanned using a confocal laser microscope. Images of each spot on the array are produced, processed, and analyzed to measure the expression of each gene.
Nguyen, et al. 2002
cDNA Microarray Experimental Procedure
Nguyen, et al. 2002
cDNA Microarray Experimental Procedure – Image quality
• There are a number of factors that influence the image quality, such as noise fluorescence (fluorescence from non-dye sources), pollution of the fluorescent signal, photo-bleaching, etc.
• Raw data consists of two images, one image obtained from the red channel and one from the green channel.
• Which pixels in the target area represent signal, and which represent background? Pixels must be categorized as one or the other.
• A measurement is made of the intensity of the fluorescence for the spot and the intensity of the noise fluorescence from the background.
Nguyen, et al. 2002
cDNA Microarray – Quantifying gene expression
Cy5 (red) signal intensities Cy5 (red) background intensities1 2 . . . . . . n
12...m
= RnmI
Rijx
for i = 1,…,n samples, j = 1,…,m genes
Notation for Cy3 (green) intensities is the same with the exception of a G superscript.
1 2 . . . . . . n12...m
= RnmB
Rijb
cDNA Microarray – Quantifying gene expression
• Many analyses are based on background corrected intensities using the measurements
and
• Intensity ratios are also used:
• Negative intensities (where the background is stronger than the signal) can occur—these issues will be discussed later….
Rij
Rijij bxr G
ijGijij bxg
ij
ij
gr
( Represents the abundance of gene expression in experimental sample relative to reference sample )
cDNA Microarray Experimental Procedure
Multiple sources of variability: biological and technical
• The multiple experimental and biological processes in the cDNA microarray procedure each contribute to the overall variability.
• For example, a significant amount of variability may be introduced during the selection of the tissue samples.
• A significant amount of variability also arises during measurement and image processing.
• It is not easy to identify which portions of the microarray procedure are contributing most to the overall variability—and often times the variability at a given step of the procedure is poorly understood.
mRNA
LabelledcDNA
Tissue Sample
Microarray
2
1
xx
DataNominal
Level1
2
Recap: cDNA Array Data
Affymetrix Chip Experimental Procedure
• Affymetrix is another type of commonly used microarray technology.
• Rather than attach the pre-synthesized cDNA probes to the chip, oligonucleotides are chemically synthesized (grown) directly on the chip.
• An oligonucleotide is a short fragment of DNA (usually in single-stranded form) which is often chemically synthesized. It can be used as a probe or primer. 'Oligo' is Greek for 'few'.
• The oligonucleotides are synthesized to match known gene sequences. The process of synthesizing the oligonucleotides conceptually resembles semi-conductor fabrication by using masks, light exposure, and deposition of nucleotides.
Affymetrix Chip Experimental Procedure
Affymetrix Chip Experimental Procedure
• Each gene or EST (expressed sequence tag) is represented on the array by 11-20 features. Each feature consists of an oligonucleotide that is a perfect match (PM) to a segment of a gene.
• For each PM, there is a corresponding oligo that is identical to the PM except for a single mismatch (MM) at the central base of the oligonucleotide.
Nguyen, et al. 2002
Affymetrix – Identifying gene expression
• Only one tissue sample is applied to each Affymetrix chip.
• The Affymetrix chip currently has 11 PM features for each gene. These 11 PM features serve as unique sequence detectors and the corresponding 11 MM features serve as controls.
• Under relatively ideal conditions, when the gene is expressed in the cell sample, high intensity is expected for the PM feature and low intensity for the MM feature.
• It is assumed that differences observed between the PM and MM feature intensities are due to hybridization kinetics of the different feature sequences and nonspecific background RNA hybridizations.
Nguyen, et al. 2002
Affymetrix – Sample labeling
• Affymetrix uses a one-color detection scheme (one sample on one array).
• Targets are biotin labeled cRNA, rather than cDNA.
• Double stranded cDNA is synthesized using RT. Then the targets, biotinylated cRNA, are synthesized using in vitro transcription.
• Biotinylated cRNA are cRNA that have biotin molecules attached to them.
• The biotinylated cRNA is fragmented (to reduce segment length) and hybridized to the array.
• The slide is washed and fluorescent dye is applied. The dye couples with the biotin on the cRNA.
Nguyen, et al. 2002
Recap: Affymetrix Chip Experimental Procedure
Affymetrix – Advantages over cDNA Micorarrays
• With the one-dye system, unequal incorporation between dyes is not an issue—nor is spectral overlap between dyes.
• The reference sample is no longer needed. This reduces the required biological materials needed for experiments.
• The possibility of genomic DNA being labeled during RT is avoided because the modified biotinylated nucleotides are incorporated during IVT.
• Fragmentation of the target cRNAs ensures that the most target lengths are within a reasonable range, thus avoiding target folding.
Nguyen, et al. 2002
Affymetrix – Image processing
• Affymetrix software uses a gridding procedure to locate the features on the array.
• Each feature consists of about 64 pixels.
• The features are scanned and the intensity value for a feature is computed as the 75th percentile of the intensities for the pixels in that feature (excluding the boundary pixels).
• Signal intensities are corrected for background noise intensities.
Nguyen, et al. 2002
Affymetrix – Quantifiying gene expression levels
• Recall that each gene consists of k = 1,…,K features, each feature having a pair of perfect match and mismatch intensity measurements ( PMk , MMk ).
• Average Difference (AvDiff) method--for each gene:
• Most reported results from Affymetrix arrays are based on analyses that use AvDiff but with various ways to filter the outliers.
Nguyen, et al. 2002
Calculate the difference for each feature: dk = PMk - MMk
Remove all dk that exceed 3 SD of the trimmed mean (trimmed mean calculated by excluding the largest and smallest dk)
Take average of the remaining dk
Affymetrix – Quantifiying gene expression levels
• As another way to measure gene expression, Efron et al. (2001) investigated
avg{dk = log(PMk) – clog(MMk), k = 1,…,K}
for various scale factors c.
• Yet another approach to measure gene expression:
where CTk is the change threshold and wk is a weight.
• Naef et al. note that the information content of MM features is not clear, they proposed expression indexes using only the PM features.
K
kkkk CTPMw
KSignal
1
)log(1
exp
Affymetrix – Quantifiying gene expression levels
From Statistical Algorithms Reference Guide, Affymetrix, 2002:
“When the mismatch intensity is lower than the perfect match intensity, then the mismatch is informative and provides an estimate of stray signal. Rules are employed to ensure that negative signal values are not calculated. Negative values do not make physiological sense, and make further data processing, such as log transformations, difficult.”
PROCESSING IMAGE FILES TO GIVE ROBUST ESTIMATES OF
INTENSITY
cDNA Microarray – Quantifying gene expression
Cy5 (red) signal intensities Cy5 (red) background intensities1 2 . . . . . . n
12...m
= RnmI
Rijx
for i = 1,…,n samples, j = 1,…,m genes
Notation for Cy3 (green) intensities is the same with the exception of a G superscript.
1 2 . . . . . . n12...m
= RnmB
Rijb
cDNA Microarray – Quantifying gene expression
• Many analyses are based on background corrected intensities using the measurements
and
• Intensity ratio is of interest:
• Represents the abundance of gene expression in experimental sample relative to reference sample
Rij
Rijij bxr G
ijGijij bxg
ij
ij
gr
cDNA Microarray
http://www.beatson.gla.ac.uk/infrastructure.htm
Microarray technology. Figure showing part of a 10,000 element cDNA microarray hybridised with a cancer cell line RNA labelled with the fluorescent marker, Cy3 (green) and a control, labelled with Cy5 (red). This identifies genes with differential expression. (Dr N.I. Barr)
Considerations when obtaining data
One array:
One spot A:
One spot B: Given pixel information, how to assign a summary foreground and background value? How do you combine the two to give an intensity estimate?
Assigning coordinates to each spot (and measuring distances between spots)
Which pixels to use as signal and background?
Considerations when obtaining data
Addressing or gridding: Assign coordinates to each spot. Obtain foreground gridlines and background grid lines.
Segmentation: Classify pixels as foreground or background.
Intensity Extraction: Given pixel information, calculate summary measures for foreground and background.
Background Correction: Correct foreground for background to give estimate of intensity.
Addressing or gridding
21
19
one grid
Format of array is known:
Distance between rows and columns of gridsTranslation of gridDistance between rows and columns of spots within each gridTranslation of spotsOverall position of array in the imageRotation of array
Addressing or gridding
• Addressing records all of this information.
• Important information for use later is foreground grid lines and background grid lines.
• fgl show spot locations
• bgl separate spots
Segmentation
Fixed Circle Segmentation: Fix diameter for every spot and draw circles of this diameter.
Adaptive Circle Segmentation: Allow diameters to change from spot to spot. Takes a long time.
Adaptive Shape Segmentation
Histogram Segmentation
Adaptive Shape Segmentation (SRG)
SRG = seeded region growing
1. Start with collection of background and foreground pixels (seeds).
2. Identify neighboring pixels for each collection and calculate the mean intensity for background neighbors and foreground neighbors.
3. Identify pixel in foreground neighbors that is closest to foreground mean and include it in collection. Do the same for background.
4. Continue.
Selection of foreground and background seeds
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
• Foreground seeds are chosen by finding the maximum of the combined intensity surface over a small region centered within the square (single point within the square).
• Background seeds are constructed as crosses based on the fitted background grid.
Histogram Segmentation
• Target mask chosen larger than spot.
• Chen, et al.
• 8 random samples from patch.
• Lowest 8 from mask.
• WRS Reject Signal defined to be 8 values from mask and all pixels in mask with intensities ≥ smallest of the 8.
• Do not reject repeat with some number of the 8 masked values replaced with pixels of higher intensity.
Target site
Target mask
Target patch
Histogram Segmentation
Advantage: Simple Disadvantage: Large mask might include other spots
Plot histogram of pixels in mask:
Average used as background
Average used as foreground
5th % 20th % 80th % 95th %
Segmentation
Methods Software
Fixed Circle ScanAlyze, GenePix, QuantArray
Adaptive Circle GenePix
Adaptive Shape Spot, SRG & one other method
Histogram QuantArray
Histogram method and Chen, et al. method give you summary measures of pixel intensities. Other methods simply divide the pixels into foreground and background.
Different background adjustment methods
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
• Region inside red circle represents the spot mask.
• Local background calculation by different methods:
Green: used in QuantArray;
Blue: used in ScanAlyze;
Pink: used in Spot.
Different background adjustment methods
• Histogram based techniques give foreground and back ground measures directly.
• Other methods simply divide information into foreground pixels and background pixels and you need to perform the calculations.
• Some give summary measures of foreground and background intensity.
Estimating background
ScanAlyze: Median of all pixels outside spot mask (circle) that are within square centered at spot center.
QuantArray: Median of all pixels in concentric circles outside of (and some distance from) spot mask.
GenePix: Median of all pixels in “valleys” surrounding spot mask.
Spot: Option implemented in GenePix and other options that utilize information (e.g. morphological opening) from the entire array to give spot specific adjustment.
Morphological opening:
• Image with spot intensities removed is estimated. This provides estimate of background for the entire slide.
• Background is value of this image at spot center.
Comparison of methods considers different background adjustments
• Methods considered are implemented in the packages. Broadly classified into 4 categories:
• Unless otherwise specified, spot foreground intensities are calculated by taking the mean intensity of the pixels within the spot mask.
1. Local adjustment: use median intensity of pixels in region just outside spot mask
2. Morphological opening
3. Constant background
4. No adjustment
Description of image analysis methods
QA.fix.nbg
QA.hist.nbg
GP
Software: GenePix.Segmentation: Spot intensity is the mean of pixel values
between the 45th and 85th percentiles within a fixed circle of 9 pixels in diameter.
Background: None.
Software: QuantArray.Segmentation: Spot intensity is the mean of pixel values
between the 80th and 95th percentile of a 11-by-11 pixels square.
Background: None.
Software: GenePix.Segmentation: Proprietary algorithm that results in
adaptively sized circles.Background: Median from “valley of spot”.
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
Description of image analysis methods
SA
S.morph
Software: ScanAlyze.Segmentation: Fixed circles, 10 pixels in diameter.Background: Median value in local square region
Software: Spot.Segmentation: Seeded region growing.Background: Based on morphological opening. The
structuring element is a square region with sides of length 2.5 times the approximate spot to spot separation.
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
Comparing image analysis methods
Foreground intensities
Background intensities
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
Comparing image analysis methods
Some observations:
• Higher intensities show tighter correlation.
• Background estimates: low correlation implies very little useful information.
• SA local median has smallest variability followed by S.morph and GP. QA highly variable. Very high QA values come from concentric circle method and probably mean other spots are being included.
• S.morph lowers background.
Comparing image analysis methodsFo
regr
ound
– B
ackg
roun
d
Background
(A)
Fore
grou
nd –
Bac
kgro
und
Background
(B)
(A) Morphological background adjustment method of Spot (S.morph)
(B) QA.fix (Often gives increaseing BG estimates)
Only values from the lower half of the foreground intensity distribution are displayed.
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
Data
8 AI knockouts (Cy5)
1 Reference (Cy3)
8 Normals (Cy5)
1 Reference (Cy3)
Ref: pooled cDNA from 8 normals
6384 probes
257 (~4%) known to be related to lipid metabolism
From Callow, et al. 2000, Dudat, et al. 2002
Comparison of t-denominators of different methods
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
Comparison of the t-denominators (estimating between slide variability) for different image analysis methods in the apo AI experiment.
Comparison of t-values of different methods
http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt
Gap between p-values for 8 known: S.nbg and S.valleyDE genes largest for SA, S.morph, and S.const
Comparing image analysis methods
• Morphological opening provides lower estimates of background than other methods.
─ M.O. estimates are less variable than other approaches.
─ Accuracy (assessed by finding known DE genes) was not compromised.
• In terms of finding DE genes,
Spot ScanAlyze
GenePix QuantArray
Comparing image analysis methods
• Choice of intensity estimation method has larger impact on log intensity ratios than segmentation method.
• Means or medians over large neighborhoods can be noisy.
• No background adjustment results in decreased ability to find DE genes.
• Recommend morphological opening method.*
* No comments on false positives or false negatives
Considerations following cDNA array intensity estimation
• Dye bias
• Print tip effects
• Spatial effects
• Array effects
Normalization attempts to minimize the effect of these systematic variations, making substantive differences easier to find.
Simple Problem
One array (A1) is brighter than a second array (A2) and you would like to compare the two.
1
2log GR
2
2log GR
1
2log GR
22log G
R
Simple solution
• Scale intensities to have the same mean or median.
• Problems with this?
• Assume “shift effect” is constant across array.
• Doesn’t account for spatial effects.
Yang, et al., 2002
cGR
GR
12
*
12 loglog
Methods?log2(R/G) log2(R/G) - c = log2{R / (kG)}
Standard Practice (in most software)
c is a constant such that normalized log-ratios have zero mean or median.
Our Preference:
c is a function of overall spot intensity and print-tip-group.
What genes to use?• All genes on the array
• Constantly expressed genes (house keeping)
• Controls
– Spiked controls (e.g. plant genes)
– Genomic DNA titration series
• Other set of genes
Within-slide normalization
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
M vs. A
GRA
GRM
2
2
log
log
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
• Assumption: Changes roughly symmetric
• First panel: smooth density of log2G and log2R.
• Second panel: M vs. A plot with median set to zero
Normalization - Median
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
Normalization – lowess
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
• Global lowess• Assumption: changes roughly symmetric at all intensities.
Build the smooth function S(x) pointwise:
1. Take a point, x0. Find K nearest neighbors of x0 (N(x0)). The number of neighbors K is determined by user—specifies some percentage of the total number of points (they use 40%).
2. Calculate
3. Assign weights to N(x0) points.
4. Calculate weighted least squares fit of y on N(x0). Take
5. Repeat . . .
xxxxNx
00
0
max
00ˆ xSy
Instead of
Use
Where c(·) is the smooth curve through the M-A plot (lowess fit to the M-A plot)
Recall:
Yang, et al.
cGR
GR 2
*
2 loglog
GRA
GRM
2
2
log
log
cGR
GR
2
*
2 loglog
Assumption: For every print group, changes roughly symmetric at all intensities.
Normalization – print-tip-group
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
M vs. A – after print-tip-group normalization
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
Within print-tip-group normalization is reasonable when:
1. Only a relatively small proportion of the genes will vary significantly in expression between the 2 MRNA samples
or
2. There is symmetry in the expression levels of the up/down regulated genes.
3. There is no correlation between groups of DE genes and print tips
• Consider location normalized intensities for print tip group i.
• Suppose
• Can get estimates of ai’s and adjust.
icGR
GR
2
*
2 loglog
group print tip ofeffect :
ratios-log trueof variance:
,,0~2
2
thi
i
ia
aNX
Assumptions:• All print-tip-groups have the same spread.
• True ratio is ij where i represents different print-tip-groups, j represents different spots.
• Observed is Mij, where Mij = ai ij and
• Robust estimate of ai is
where MADi = medianj { |yij - median(yij) | }
II
i i
i
MAD
MAD
1
Taking scale into account
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
0log1
2
I
iia
Within print-tip-group box plots for print-tip-group normalized M
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
Effect of location + scale normalization
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
Problem
• If differences in scale were due largely to DE genes, adjusting for scale might mask your ability to find those genes.
• Again, if few genes are unexpected to be DE, this might not be an issue.
Alternative method
• hidden
• where ci(·) is determined by both genes in the ith print-tip-group and other genes.
• “Composite” normalization uses MSP (titration series) genes.
• Could also use other housekeeping genes.
icGR
GR
2
*
2 loglog
Comparing different normalization methods
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
Summary
• Print tip normalization works well under two assumptions:
1. MSP genes have minimal sample specific bias and can cover wide intensity range. Composite normalization necessary with divergent samples.
2. Adjusting for scale might compromise ability to find DE genes. Could have opposite effect (false positives).
• Kerr, et al. Wolfinger, et al. perform only global normaliztion.
* maanova has extra normalization options
Recall Affy Measures of expression
• GeneChip® older software uses Avg.diff
with A a set of suitable pairs chosen by software.
• Log PMi / MMi was also used.
i
ii MMPMdiffAvg )(1
.
http://stat-www.berkeley.edu/users/terry/Classes/s246.2002/Week16/week16.ppt
Affy Measures of expression
http://stat-www.berkeley.edu/users/terry/Classes/s246.2002/Week16/week16.ppt
GeneChip® newest version (MAS 5.0) uses something else, namely
with CT a version of MM that is never bigger than PM. Here TukeyBiweight can be regarded as a kind of robust/resistant mean.
)}{log(log ii CTPMghtTukeyBiweiSignal
Affy Measures of expression
Rules to determine CT (change threshold) for each probe pair:
1. MM < PM CT = MM
2. MM ≥ PM :
i
i
CTPMSignal
CTPMSignal
logTBlog
logTBexp
A. If MM < PM for most probe pairs, an adjusted MM value is used based on bi-weight mean of ratio
B. If MM ≥ PM for most probe pairs, MM is replaced with a value that is “slightly smaller” than PM
MMPM
Affy Measures of Expression
Determine weights:
• Calculate median of log(PM-CT) values across the probe set.
• Probe pair weights are determined by distance to median. Closer pairs get higher weights.
Affymetrix Normalization and Scaling (pre MAS 5.0)
Global Normalization: Baseline array is chosen out of a set of arrays. Average Intensity* for this array is calculated. Intensities on any other (non-baseline) array A1 in the set are multiplied by normalization factor (NF) to make Average Intensity* of A1 equal to Average Intensity* of baseline array.
Global Scaling: Target intensity is chosen and each array in a set of interest is scaled by some factor (SF1, SF2, ..., SFN) to give Average Intensity* equal to target intensity.
Average Intensity*: Average of Average Difference values for every probe set except highest and lowest 2%.
Affymetrix – Quantifiying gene expression levels
Li and Wong brought attention to the fact that AvDiff, as a measure of expression, has not been studied extensively. They proposed:
Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection
Cheng Li and Wing Hung Wong* PNAS, January 2001, 98:1, p. 31-36.
* Ph.D. student of Grace Wahba, University of Wisconsin-Madison, graduated in 1980
* Recipient of COPSS Prize
— Li and Wong, text, 2003
Model-based analysis of oligonucleotide arrays
Consider I array samples and one gene:
Goal: Estimate the abundance level of the gene in the I samples.
Data: There are 2×I×20 measurements used to obtain estimates (I×20 PMs and I×20 MMs).
θi : Denotes “expression index” for gene in the ith sample.
Assume: Measured intensity is proportional to θi and proportionality constant depends on probe (indexed by j).
What is PMij = βjθi ?
20 features
i = I
20 features
i = 1
Li and Wong, 2001
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
For MM, denote the proportionality constant by αj
For PM, denote the proportionality constant by βj
νj: baseline response for jth probe pair due to nonspecific hybridization
αj: rate of increase of MM response for jth probe
ϕj : additional rate of increase in PM response
ijjijij
ijjijij
PM
MM
jjj PM intensity increases at a higher rate than MM intensity (β > α)
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig. 1. Black curves are the PM and MM data of gene A in the first six arrays. Light curves are the fitted values to model 1. Probe pairs are labeled 1 to 20 on the horizontal axis.
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Recall the model:
Currently, there is a “strong preference” to base all computations on y = PM – MM for each probe pair. Subtracting the deterministic portions of the equations above gives:
ijjijij
ijjijij
PM
MM
jjj
ijjiijijij MMPMy
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Consider
with
Assume this identifiability constraint:
Fix and fit for using least squares.
Fix at and fit for using least squares.
Iterate
ijjiijijij MMPMy
),0(~ 2 Nij
Jj
j 2
~
~ ~
~
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 2. Black curves are the PM-MM difference data of gene A in the first six arrays. Light curves are the fitted values to model 2.
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 3. Plots of residuals (y axis) versus fitted value (x axis) for additive model (A) and multiplicative model (B).
(A)
(B)ijjiijy
ijjiijy
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Consider model for one array:
Suppose ϕ’ s are obtained from many arrays. Treat them as known.
Given ϕ’s, the LS estimate for θ is
jjjy
J
yyj jj
j j
j jj
2ˆ
j
jjj
j Jy
J21
E1ˆE
J
yJ jj
j
22
2Var
1ˆVar
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
• Regarding θ’s as fixed, one can proceed similarly to get estimation and standard errors for .
• Note: A conditional analysis is done here. This assumes certain effects are known.
• In practice, the effects are estimated. The uncertainty in this estimation is not considered when computing standard errors.
• What is ?
~
2
2
1
2ˆ1
11ˆSE
jjj yy
JJ
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Recall: θi Denotes “expression index” for gene in the ith sample.
Question: Given ´s and SE[ ]´s, how would you use them?
Recall, for one array:
After fitting the model you would have:
jjjy
I ˆ , . . . , ˆ ,ˆ21
I ˆSE , . . . , ˆSE , ˆSE 21
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 4. (A) Six arrays of probe set 1,248. (B) Plot of standard error (SE, y axis) vs. θ. The probe pattern (black curve) of array 4 is inconsistent with other arrays, leading to unsatisfactory fitted curve (light) and large standard errors of θ4.
Black curves are PM-MM data. Light curves are fitted model.
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Recall: ϕj denotes theadditional rate of increase (in excess of the MM rate) in PM intensity for probe j.
Question: Given ´s and SE[ ]´s, how would you use them?
Recall, for one array:
After fitting the model you would have:
jjjy
1 2 20
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 6. (A) Probe 17 of probe set 1,222 is not concordant with other probes (black arrows) and is numerically identified by the outstanding standard error ϕ17 (B) Plot of standard error (SE, y axis) vs. ϕ.
Black curves are PM-MM data. Light curves are fitted model.
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Fig 7. (A) Probe set 3,562 has a single high-leverage probe 12, and the fitted light curves almost coincide with the black data curve. (B) ϕ12 is large compared with the other ϕ’ s close-to-zero value. Note that Affymetrix’s superscoring method works here by consistently excluding this probe.
Black curves are PM-MM data. Light curves are fitted model.
Model-based analysis of oligonucleotide arrays
Li and Wong, 2001
Li and Wong note that
“the MM responses do contain information on the expression index, and that this information can only be recovered by analyzing the PM and MM responses separately.”
Processing Probe Level Data
• A number of expression summary measures are obtained using PM and MM probes intensities.
• Recent results suggest that MM may be detecting signal along with PM.
• If this is the case, using MM could introduce noise and give biased estimates of the nominal expression level.
PM and MM values for 20 probes from 12 spike-in arrays from varying concentration experiment plotted vs. concentration
MM May Be Tracking Signal
• Some researchers suggested to use other MM sequences in order to alleviate this tracking.
• MM could be created by changing more than one base in PM sequence and by placing MM bases in different positions in the MM sequence (Nimblegen chips).
MM May Be Tracking Signal - What to do ?
• Other researchers suggest only using PM (Robust Other researchers suggest only using PM (Robust Multiarray Average) Multiarray Average)
• This approach would allow space currently used This approach would allow space currently used for MM to be used for other PM, thus allowing for for MM to be used for other PM, thus allowing for twice as many sequences of interest to be printed twice as many sequences of interest to be printed onto an array.onto an array.
MM May Be Tracking Signal - What to do ?
• Bolstad, et al., Bioinformatics, 2003
• Irizarry, et al., Biostatistics, 2003
• Irizarray, et al., text, 2003
• Irizarray, et al., NAR, 2003
Robust Multi-array Average (RMA)
http://stat-www.berkeley.edu/users/terry/Classes/s246.2002/Week16/week16.ppt
OVERVIEW
Uses only PM (ignores MM)
• Adjust for background on the raw intensity scale
• Take log2 of background adjusted PM
• Carry out quantile normalization of log2(PM-BG), with chips in suitable sets
• Conduct a robust multi-array analysis (RMA) of the quantities
RMA: Measures of Expression
Robust Multi-array Average (RMA)
RMA
Background correct, normalize, and log2 the PM intensities. Call this transformation T.
ei = log2 expression on ith array
aj = log2 probe effect for probe j
ijjiij aePM T
2,0~ Nij
Robust Multi-array Average (RMA)
Recall dChip:
RMA
NAR 2003:
“[Our model] is quite different from the additive model in PM-MM that was found unsatisfactory in Li and Wong, most likely because of the very strong mean variance dependence that would be present in such an additive model.”
ijjiij aePM T
ijjijijPM
ijjiijijij MMPMy
Why we take log2
http://biosun01.biostat.jhsph.edu/~ririzarr/Talks/nci-2002.ppt.gz
1.25 2.5 5 7.5 10 20 g
LIVER
CNS
12,626 genes
Dilution Study (www.genelogic.com)
5 reps
5 reps
30 arrays
30 arrays
Comparing RMA and MAS 5.0
• Precision of expression estimates (estimated by SD of replicate arrays)
• Consistency of fold change estimates
• Specificity and sensitivity (different methods used to assess DE genes)
• Normalization is done within replicate groups. The assumption that most genes do not change across non-replicate groups does not hold here.
(note that two different normalization methods were used: quantile for RMA and affy.scale.value for MAS)
• Expression measures for RMA and MAS Signal 5.0 were estimated using rma and expresso functions of Bioconductor package Affy
Comparing RMA and MAS 5.0 – Normalization
RMA
MAS 5.0
• Squared correlation coefficient across replicates was calculated over all 120 pairs of replicates ( per group of replicates)
RMA MAS 5.0 SignalR2 0.9947 0.9917
Strong probe affinity implies R2 ≈ 1.
• The difference was significant (p-value 1.152560e-07)
52
Methods and Results
• SD across replicate arrays were computed for all genes.
• LOESS curves were fitted to scatter plot of SD versus mean expressed values.
Methods and Results
Loess curve of SD across replicates for all genes RMA measures
Expression
SD
acr
oss
repl
icat
es MAS 5.0
RMA
• For one gene, fit line to expression estimate vs. concentration on the log-log scale. Then calculate the “Average lines” (average 's across genes).
• Since every fold increase in concentration should have the same fold increase in expression measure, a line fitted on log-log scale should have slope 1.
Consistency of fold change
65.0ˆ 5.0 MAS
67.0ˆ RMA
• Consistency of fold change was examined by comparing fold change estimates between arrays with different concentrations of target mRNA.
• Slopes over all genes for two different conditions of average expression versus concentration were calculated and on average were:
RMA MAS Signal 5.0liver tissue 0.53 0.53CNS samples 0.56 0.59
Consistency of fold change
• Fold change between CNS and Liver tissue were
estimated for all genes using 10 arrays in the lowest and 10 arrays in highest concentration group.
• Number of genes showing inconsistency of fold change estimate by at least 2-fold;
RMA 23
MAS 5.0 81
Consistency of fold change
Irizarry, et al., 2003
RMA fold change estimate for 20 vs. fold change estimate for 1.25 g
MAS 5.0- fold change estimate for 20 vs. fold change estimate for 1.25 g
Irizarry, et al., 2003
• In general it appears that RMA has better
precision and similar accuracy as MAS Signal 5.0.
• RMA had slightly better consistency of fold change estimate.
Conclusions
• 11 control cRNA’s spiked in at different
concentrations on each array. Other genes should be same across arrays.
• Choose 10 pairs of arrays from spike-in experiment.• Compute FC for each gene under RMA, dChip, MAS
5.0.• For some cut-off C, compute proportion of non-spiked
genes where FC > C, (false positives) and proportion of spiked genes where FC > C (true positives).
Specificity and sensitivity – Spike-in data
Irizarry, et al., 2003
Fold change for Affymetrix Spike-in experiment
Irizarry, et al., 2003
Test Statistic for Affymetrix Spike-in experiment
http://www.bioconductor.org/workshops/JAX02/jax-B.pdf
• Overall RMA does better than Li & Wong (dChip),
which in turn does better than MAS 5.0 using FC.
• The simple t = est log FC / SE(est log FC) seems best for use with MAS and RMA.
• MAS looks bad here because we use single chip summaries in our analysis. They need a multi-chip version of their Signal Log Ratio. When done, it will look like the final step in RMA.
• With RMA and Li & Wong, nominal SEs are not as good as observed ones and p-values are better than (log) fold change.
Conclusions from replicate chip ROC curves
Figure 5. Box plots showing the distribution of observed fold changes for non-spiked in genes. The different colors represent the different quantiles. The relationship of color and quantile is demonstrated in the first box from the left.
Log Fold Change of Non-Differentially-Expressed Genes
Irizarry, et al., 2003
Conclusions from single chip comparison ROC curves
http://www.bioconductor.org/workshops/JAX02/jax-B.pdf
• On the basis of the data just presented, and much more:
• With FC, RMA is best, LW (Li & Wong) next. MAS does not do well here.
• With p-values, RMA is a good as, and usually better than MAS, which is next. MAS does best on Affymetrix spike-in data sets. LW (dChip) does not do so well here.
• All judgments are comparative. Everyone does well in absolute terms, but some do better.
• In general it appears that RMA has better In general it appears that RMA has better
precision and similar accuracy as MAS Signal 5.0precision and similar accuracy as MAS Signal 5.0
• RMA had slightly better consistency of fold RMA had slightly better consistency of fold change estimatechange estimate
More Conclusions
Comment on MM
Irizarry, et al., 2003
NAR, 2003:
“It is possible that information about non-specific binding is contained in the MM values, but empirical results demonstrate that mathematical subtraction does not translate to biological subtraction. We have found that, until a better solution is proposed, simply ignoring these values is preferable.”
METHODS TO IDENTIFY DE GENES
Mult-t
Statistical Methods for Identifying Differentially ExpressedGenes in Replicated cDNA Microarray Experiments
byDudoit, Yang, Callow, and Speed
Statistica Sinica 12 (2002), 111-139.
**Additional Details in Parmigiani et al., 2003.
Mult-t : Outline
• Data: AI Knockout & SRBI transgenic mice. AI, SRBI are two genes invovled in HDL metabolism.
• Image: Segmentation and background correction (Yang et al.).
• Normalization: Spatial and intensity dependent effects.
• Gene summary: Construction of t-statistic for each gene. Evaluation of the statistic at a gene uses only data at that gene.
•Hypothesis test at each gene (accounts for multiple tests).
Mult-t : Normalization
Use lowess ( ) to identify curves through points grouped by print tips.
(log2 R/G)’ = (log2 R/G) - cj(A)
cj(A) is lowess ( ) fit to M vs. A for print tip j.
Mult-t : Gene Specific Summary
Compute Welch t-statistic for every gene
• Tj and tj: Random variable and realization of random variable for every gene j.
• Hj0: j th null is true.
• Hj1: j th null is false.
2
22
1
21
12
ns
ns
xxt
jj
jjj
Mult-t : Hypothesis tests
Given evaluated test statistics, which are unusually large in magnitude ?
Informal assessment:
QQ plots
MA plots
Other (numerator vs. denominator of t-stat)
More precise assessment:
For Hj, pj = P( | Tj | > | tj | | H j 0) and determine how small pj
should be so that you reject given many (m) tests are done.
Mult-t : Hypothesis tests
1 2 . . . . . . n12...m
genes
samples
j = 1, 2,…, m genes (6384)
i = 1, 2, …, n samples (16); n1 + n2 = n (n1 = n2= 8)
Xji = log2 (Rji/Gji) is relative (transformed, normalized, and background corrected) expression level for jth gene on ith array.
Xji
Mult-t : Hypothesis tests
H1: 11=12
H2: 21=22
:Hm: m1=m2
1 2 . . . . . . n12...m
genes
samples
t1
t2
. . .tm
H1
H2
. . .Hm
Determine distribution of test statistics under null
For n reasonably large,
T-stat ~ tv
Mult-t : Determine distribution of test statistics under null
Since n is generally not large in microarray experiments, build up distrubtion of test statistics under the null via permutation.
1 2 . . n1 n1+1 . . n2
12...m
samples
1 2 . . . . . B
permutations
t11
t21
. . .tm1
12...m
t1B
t2B
. . .tmB
Pj* = (1/B) ( | tj,b | > | tj | )
Mult-t : Notes on Permutations
Computationally, getting the distribution of test statistic via permutations is reasonable. Getting the distribution of the p-values might not be.
If you have, say, 6 samples total (3 in each group), what’s the smallest p-value you could obtain via permutations ?
Mult-t : Adjusting for multiple tests
• Family Wise Error Rate (FWER): probability of at least one type I error for all tests considered.
• Goal: Control FWER
Strong Control: Control for any combination of true and false nulls.
Weak Control: Control for the complete null (all nulls true).
Mult-t : Adjusting for multiple tests
• Procedures to control FWER:
Bonferonni:
Reject Hj if pj < m
pj* = min (pjm, 1)
Sidak:
pj* = 1 - (1-pj)m
Westfall & Young:
pj*’s obtained via reordering of permutation matrix.
Mult-t : Westfall and Young’s Procedure
Order observed t-statistics: | t rm | < | t rm-1 | < … < | t r2 | < | t r1 |
1 2 . . . . . B
t11
t21
. . .tm1
12...m
t1B
t2B
. . .tmB
reorder
u 1,1
. .um-1,1
u m ,1
u 1,B
. .um-1,B
u m,B
u m, b = | t rm, b |
u m-1, b = max (u m, b, | t rm-1, b |)
:
u 1, b = max (u 2, b, | t r1, b |)
Mult-t : Westfall and Young’s Step Down Max T Procedure
Order observed t-statistics: | t rm | < | t rm-1 | < … < | t r2 | < | t r1 |
1 2 . . . . . B
t11
t21
. . .tm1
12...m
t1B
t2B
. . .tmB
reorder
u 1,1
. .um-1,1
u m ,1
u 1,B
. .um-1,B
u m,B
Prj* = (1/B) ( | uj,b | > | trj | )
...and enforce monotonicity
Mult-t : Westfall and Young’s Step Down Max T Procedure
• Less conservativee than Bonferonni, Sidak, Holm’s
• Provides Strong Control of FWER
• Max T = Min P when the t-statistics are identically distributed. Generally, this is not the case; and, again, the minP algorithm is more computationally intensive.
Mult-t : Data
Experiment 1
8 AI Knock outs (Cy 5)
1 Reference (Cy 3)
8 Normals (Cy 5)
1 Reference (Cy3)
8 SRBI Transgenics (Cy 5)
1 Reference (Cy 3)
8 Normals (Cy 5)
1 Reference (Cy3)
Experiment 2
6382 probes
257 (~4%) related to lipid metabolism
Reference: Pooled cDNA from 8 normals
Q: What does this mean for permutation tests ?
Mult-t : Histogram and QQ plot of t-statistics
Mult-t : Max T adjusted and unadjusted p-values
Comments on Mult-t
• Welch’s t-statistic is used. Welch proposed solution to Behrens-Fisher problem. Implicit assumptions guide choice of the test statistic even though “no assumptions are made regarding distribution of the test statistics”.
• Permutations are advantageous for a number of reasons, but do not provide useful results when sample sizes are small.
• Permutation test not valid for this experimental design.
• Method is compared with single slide methods !
• Page 132, Newton et al. ... false positives... Consider definition of a false positive here !
Mult-t : Comparison of methods
SRBI data. Newton et al (orange); Chen et al. (purple)
Analysis of Variance for Microarrays (ANOVA)
Analysis of Variance for Gene Expression Microarray Data
byKerr, Martin, and Churchill
Journal of Computational Biology 7: 819-837, 2000.
Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments
byKerr and Churchill
PNAS 98 (16): 8961 - 8965, 2001.
**Additional Details in Parmigiani et al., 2003.
ANOVA : Outline
• Data: Human liver and human muscle tissue hybridized to two cDNA arrays. Final data set had 1286 spots.
• Normalization via terms in ANOVA model (“global analysis”)
• Gene summary: Construction of statistic for each gene. Evaluation of the statistic uses data from that gene (“local analysis”).
• Hypothesis test at each gene (uses bootstrap; does not account for multiple tests).
ANOVA: Model Development
Liver
Liver
Muscle
Muscle
1
1 2
2
Array
cDNA
A; i indexes array ( i=1,2 )
D; j indexes dye ( j=1,2 )
V; k indexes variety ( k = 1,2 )
G; g indexes gene (g = 1, 2, ..., N = 1286)
ANOVA : Model
log(yijkg) = + Ai + Dj + Vk + Gg +(AG)ig +(VG)kg+eijkg
m - overall average signal (*)
A - array (*)
D - dye (*)
V - variety (i.e., condition or tissue)
G - gene
AG - array by gene interaction (spot effect)
VG - variety by gene interaction (DE if VG1gVG2g)
* Normalization
ANOVA : Gene Specific Summary
Obtain parameter estimates via least squares
is of most interest when goal is to identifyDE genes.
Source df SS MS
Array 1 92.34 92.34
Dye 1 0.74 0.74
Variety 1 2.97 2.97
Gene 1285 1885.89 1.47
AG 1285 160.01 0.12
VG 1285 1357.28 1.06
Residual 1285 82.75 0.0644
Corrected Total
5143 3581.99
(Table 3, page 23)
21
VGVG
ANOVA : Hypothesis tests
Given evaluated test statistics, which are unusually large in magnitude ?
Informal assessment:
Plots of
More precise assessment:
Bootstrap to obtain confidence intervals for VG1-VG2.
21
VGVG
ANOVA : Bootstrap to Identify DE genes
Calculate Residuals
; distribution of residuals => f
Sample from C f to get b*
scale factor ensures that empirical distribution has variance equal to true residuals (Wu, 1986, Annals of Statistics).
Simulate Data
Fit the model to simulated data and calculate
**
21 ggVGVG
)log()log( ijkgijkg yy
** )log()log( bijkgijkg yy
ANOVA : Comments on ANOVA Approach
No adjustments for multiple tests !
The authors state “this may or may not be necessary based on the intended purpose of the analysis” (page 8).
ANOVA modelling framework provides a method of normalization* by accounting for array, dye, gene, ... effects.
Residual distribution on log scale is non-normal, but constant error variance assumption is not grossly violated.
Significance Analysis of Microarrays (SAM)
Significance Analysis of MicroarraysApplied to the Ionizing Radiation Response
byTusher, Tibshirani, and Chu
PNAS 98 (9): 5116-5121, 2001.
**Additional Details in Parmigiani et al., 2003.
SAM : Outline
• Data: 2 wild type human lymphoblastoid cell lines (1,2) harvested in unirradiated or irradiated (U,I) state 4 hours after treatment. RNA samples were labelled and divided into two identical aliquots (A,B) prior to hybridization onto Affy chips. (U1A, U1B, U2A, U2B, I1A, I1B, I2A, I2B).
• Normalization via reference set obtained from average of intensity values across subsets of arrays.
• Gene summary: Construction of statistic for each gene. Evaluation of the statistic uses data from the entire array.
• Hypothesis test at each gene (accounts for multiple tests).
SAM : Normalization
• Generate reference set by averaging each gene expression level across the 8 hybridizations.
• Cube root scatter plot intensity values from each data set against reference (this handles negatives and Tusher et al. report that it resolved vast majority of lowly expressed genes).
• A linear least squares fit to the cube root scatter plot is used to calibrate each hybridization.
SAM : Post Normalization
SAM : Gene Specific Summary
The relative difference measure d j for gene j:
To ensure that the variance of d j is independent of gene expression, s0 (a small positive constant) is added to the denominator.
PNAS manuscript: The coefficient of variation of d j was computed as a function of s j in moving windows across the data and s0 was chosen to minimize the CV.
Parmigiani et al. text: “adaptively chosen”. Taken as median of all s (i).
0
21
ss
xxd
j
jjj
SAM Procedure to Identify DE Genes
1 2 3 4 5 6 7 8
12...m
samples
1 2 . . . . . B=36
permutations
d11
d21
. . .dm1
12...m
d1B
d2B
. . .dmB
To minimize potentially confounding effects between the two cell lines, they analyzed data by using 36 balanced permutations. A permutation is considered balanced for cell lines 1 and 2 if each group of 4 experiments contained two experiments from cell line 1 and 2 from cell line 2.
SAM Procedure to Identify DE Genes
1 2 . . . . . B=36
permutations
d11
d21
. . .dm1
12...m
d1B
d2B
. . .dmB
order columns
1 2 . . . . . B=36
permutations
d(1)1
d(2)1
. . .d(m)1
12...m
d(1)B
d(2)B
. . .d(m)B
dE,j = (1/36) d ( j )b
SAM Procedure to Identify DE Genes
Plot observed, ordered, d ( j ) against d E,j
d E,j
d ( j )
2
u
l
* u need not equal | l |
DE genes
DE genes
SAM: Estimate the False Discovery Rate (FDR)
Example from Tusher et al. 2001. 46 genes identified as DE using =1.2. For permutation 1, figure out how many genes you would have rejected using this and assuming dj1 ( j = 1,2,...,m) is data.
Repeat for every permutation and calculate average number of false positives. Average = 8.4 => FDR = 8.4 / 46 = 0.183 (18.3%).
d E,j
d ( j )1
u
l
5 FD’s for this set.
SAM: Defining s0 and :
s0 is chosen to make CV of dj approximately constant as a function of sj.
This dampens large values of dj that arise from genes with very small sj.
Generally, a constant CV (or approximately constant CV) is assumed
in models of microarray data.
To determine , fix the type I error rate . Calculate
hatFDR1 , hatFDR2 , ... , hatFDRn, for n values of . Take smallest *
such that hatFDR* < .
There are other suggestions for calculating (~281 of
Parmigiani text 2003) that involve controlling the
pFDR. Control of pFDR is becoming more common.
SAM: Comments by Tusher et al. 2001
• Dudoit et al. 2002 method (using step down max T) is too conservative. It found zero genes for this data set !
• 8 arrays are not enough for p-values based on permutations such as those done in Dudoit et al. 2002.
• SAM does not have strong or weak control of FDR.
• SAM estimates FDR. The estimate can be > 1 .
SAM: Comments on Tusher et al. 2001
• They stress the well known problems that arise from using fold
change (nice reference to cite).
• The optimal way in which to determine s0 and are open
problems. They have been addressed. See Parmigiani text, 2003.
• Application of SAM methodology to more than two conditions
has not been evaluated. Utility will rely on construction of a good
statistic (that can be hard).
• Intuitive approach. Implemented in Excel and R.
• SAM determines and calculates FDR using the same data. This
could introduce a bias. See page ~282 of Parmigiani text, 2003.
False Discovery Rate
Do not reject H0
Reject H0
H0 true
H0 false
U V
T S
m0
m-m0
mm-R R
FDR: E(Q) where Q = V/R (R > 0) and 0 (R = 0)
E(Q) = E (Q | R > 1) Pr (R > 1)
Benjamini - Hochberg Procedure to Control the FDR
• Let P1,…,Pm denote the p-values from m tests.
• Order the p-values: P(1) P(2) P(m).
• Let k* = max{ k: P(k) (k)/m}
• Reject all the null hypotheses for which Pi P(k).
• This ensures FDR (m0/m)
• Result does not depend on m0 (the number of true nulls) or the distribution of p-values under H1.
Benjamini - Hochberg Procedure to Control the FDR
slope
| | | . . . . . . . . |
1/m 20/m 40/m 1
Ord
ered
p-v
alue
s
k*/m k* = max {k : p(k) < (k)/m}
1<k<m
Empirical Bayes for Microarrays (EBarrays)
Journal of Computational Biology 8: 37-52, 2001.
On Differential Variability of Expression Ratios:Improving Statistical Inference
About Gene Expression Changes from Microarray Databy
M.A. Newton, C.M. Kendziorski, C.S. Richmond, F.R. Blattner, and K.W. Tsui
On Parametric Empirical Bayes Methods for Comparing Multiple GroupsUsing Replicated Gene Expression Profiles
byC.M. Kendziorski, M.A. Newton, H. Lan and M.N. Gould
Statistics in Medicine, to appear, 2003.
**Additional Details in Parmigiani et al., 2003.
EBarrays: Outline
•Data: E.coli under 3 treatments, 1 control; 4 cDNA arrays, ~4200 spots. Rat mammary glands from parentals and congenics; 24 Affymetrix chips, ~26,000 intensities.
•Model Development: Hierarchical Mixture Model accounts for known sources of variability.
• Normalization: EBarrays assumes data has been normalized for effects within and between arrays.
• Gene summary: Posterior probability of DE for each gene. Evaluation uses data across the entire array.
• Hypothesis test at each gene (“naturally” accounts for multiple tests).
EBarrays: Data
E.coli K-12 cell lines: 4 samples labelled in red (control, IPTG-a, IPTG-b, HS) and 4 in green (all control).
EBarrays: Data
10 Affy chips from non-treated; 14 from treated (DMBA)
EBarrays: Model Development
),(~ ,ixi aGx
Measurement Error Actual Expression
),(~ ,iyi aGy
),(~, 0,, aIGiyix
zi 1 if X ,i Y ,i
0 if X ,i Y ,i
)(B~ pZ
EBarrays: Model Development
),(~ ,ixi aGx
Measurement Error Actual Expression
),(~ ,iyi aGy
),(~, 0,, aIGiyix
EBarrays: Model Fit
)|()|()|,( iiiiA ypxpyxp
dypxpyxp iiii )|()|()|,(0
0
k
kkkkkkkAkc pzpzyxpzyxpzpl )1(ln)1()(ln),(ln)1(),(ln, 0
E-step: 0)1(
,,1ˆpppp
pppyxzPz
A
Akkkk
M-step: Maximizing resulting form in . p,
ixixixii dpxpxp ,
0
,, )()|()|(
and iyiyiyii dpypyp ,
0
,, )()|()|(
EBarrays: Model Diagnostics (Marginal Densities)
EBarrays: Gene Specific Summary
DzP
DzP
i
i
0
1 odds
dpDpPyxpzPDzP iiii 1
0
,,11
p
p
yxp
yxp
ii
iiA
ˆ1
ˆ
),(
),( odds
0
EBarrays: Contour Plots of Odds
EBarrays: Model Diagnostics (Gamma QQ plots on 4 group comparison - DMBA treated)
EBarrays: Model Diagnostics (CV plots on 4 group comparison - DMBA treated)
EBarrays: Results on 4 group comparison (DMBA treated)
Gene ID COP CI CII WF P0 P1 P2 P3
J00801 3066 4777 995 9083 0.05 0.95 0 0
0.04 0.96 0 0
L08100 4368 1278 14162 0 1 0 0
0 1 0 0
J00772 392 122 679 0.04 0.96 0 0
0.97 0.02 0.00 0.01
EBarrays: Results on 4 group comparison (DMBA treated)
EBarrays: Threshold
• The rule “classify into the pattern of expression with the highest posterior probability” is the rule which minimizes the posterior expected number of false positives and negatives (under 0-1 loss).
• For two conditions, this is the same as “classify into the pattern of expression with posterior probability > 0.5 ”.
• EBarrays reports posterior probabilities; user can decide on threshold.
EBarrays: Threshold (Meng Chen)
• 500,1000 and 2000 genes500,1000 and 2000 genes
• P(DE)=0.05, 0.1, …, 0.5P(DE)=0.05, 0.1, …, 0.5
• 2 conditions with 20 samples 2 conditions with 20 samples each.each.
• EBarrays, non-informative EBarrays, non-informative prior, 5 iterations, 0-1 loss.prior, 5 iterations, 0-1 loss.
• Each point is the average of Each point is the average of pFDR for 10 runs.pFDR for 10 runs.
• The plane is the Linear Model The plane is the Linear Model fit.fit.
• pFDR increases when P(DE) pFDR increases when P(DE) becomes larger, but is still becomes larger, but is still relatively low ! relatively low !
EBarrays: Some Ideas on pFDR control (Meng Chen)
• In Bayesian framework, one chooses a loss function to specify the relative cost of a false positive to a false negative; then, a rule is derived to minimize the Bayes risk.
• As stated previously, EBarrays reports posterior probabilities; user decides on threshold (under 0-1 loss, rule is to take the pattern with the highest posterior probability).
•The posterior expected false discovery rate can be controlled by adjusting the threshold.
• For example, one can decide beforehand at what level to control the pFDR and then use the rule that controls it at that level.
•Which one makes more sense?
EBarrays: Which one makes more sense ? (Meng Chen)
is the deciding point in EBarrays. Reject null if Pr(DE|data)>.
• 1000 genes, 2 conditions with 20 samples each. P(DE) = 0.2
• Increasing appears to decrease FDR.
• Large corresponds to bigger penalty to false positives, which makes sense.
EBarrays: Which one makes more sense ? A Simulation Study (Meng Chen)
• Simulations were carried out to compare the risk resulting from EBarrays and BH.
• 1000 genes, 2 conditions of 10 samples each.
• ~ N(,1). = 2,3,4,5.
• The risk is a function of the true proportion of DE genes.
• Didn’t replicate the runs.
EBarrays: Which one makes more sense ? Results (Meng Chen)
• Hierarchical model was developed to identify significant differential expression.
• Model accounts for measurement error process and for natural fluctuations in absolute expression levels.
• Multiple conditions are handled in the same way as two conditions (no extra work required!).
• Threshold can be adjusted to target a specific pFDR.
• R-library available at www.biostat.wisc.edu/~kendzior/ (soon in Bioconductor).
• In addition to identifying DE genes, EBarrays provides improved (shrinkage) estimates of expression.
EBarrays: Comments on Empirical Bayes Approach
Identifying DE genes: General Approach
• Use data to evaluate a test statistic (determine a gene specific summary) for every gene. Could use only data at that gene (Dudoit et al.) Could use data from all genes (Newton et al., Storey et al.)
• Evaluate method (model) used to generate test statistics. Were assumptions reasonable ? Does model fit well ? Does it provide additional information ?
• Perform hypothesis test at each geneDetermine threshold.Perhaps adjust for multiple tests.
EBarrays: Shrinkage Estimates of Fold Change
The posterior distribution of true differential expression at a given spot:
)(2
)1(0
01
),,(aa
i
i
i
aaiiii x
yyxp
i
ii y
xˆ
Use marginal maximum likelihood to determine . ),,( 0 aa
EBarrays: Shrinkage Plots
EBarrays: Shrinkage Estimates Provide Error Reduction
EBarrays: Shrinkage Estimates of Expression Re-rank Genes
CLASSIFICATION AND CLUSTERING
Problems to be Addressed via Classification or Clustering Methods
Unsupervised Learning: Identification of new groups of profiles or genes
Hierarchical clustering analysis SVD, PCA, K-means, Model Based Approaches,
SOM * (Golub et al. implementation uses both
unsupervised and supervised methods).
Supervised Learning: Classification into known classes (usually done on profiles)
Discriminant Analysis methods
Variable Selection: Identification of predictors (usually genes) that characterize known profile classes
Why cluster microarray data ??
11.03 1.22 0.92 2.61 -0.29 1.31 1.15 0.54 1.98 10.700.00 10.61 2.40 2.16 0.60 -0.22 1.64 0.89 10.64 0.210.12 0.46 10.30 0.14 1.56 2.29 1.30 11.18 2.06 0.14-0.92 0.97 -0.11 10.78 2.57 2.26 6.64 1.39 1.22 2.130.59 0.29 0.30 2.14 9.83 10.42 -0.50 1.66 0.29 0.462.14 2.19 2.19 0.01 9.93 9.84 -0.57 -0.52 1.36 0.48-0.77 0.65 1.51 10.39 0.90 0.92 10.26 0.69 2.13 -0.100.35 0.07 7.66 0.18 0.70 0.51 2.24 9.99 2.26 -0.151.67 11.30 1.48 1.89 1.29 0.72 0.39 0.94 8.41 1.2011.26 0.43 2.12 1.10 1.40 1.48 2.04 0.96 0.93 8.55
Why cluster microarray data ??
11.03 1.22 0.92 2.61 -0.29 1.31 1.15 0.54 1.98 10.700.00 10.61 2.40 2.16 0.60 -0.22 1.64 0.89 10.64 0.210.12 0.46 10.30 0.14 1.56 2.29 1.30 11.18 2.06 0.14-0.92 0.97 -0.11 10.78 2.57 2.26 6.64 1.39 1.22 2.130.59 0.29 0.30 2.14 9.83 10.42 -0.50 1.66 0.29 0.462.14 2.19 2.19 0.01 9.93 9.84 -0.57 -0.52 1.36 0.48-0.77 0.65 1.51 10.39 0.90 0.92 10.26 0.69 2.13 -0.100.35 0.07 7.66 0.18 0.70 0.51 2.24 9.99 2.26 -0.151.67 11.30 1.48 1.89 1.29 0.72 0.39 0.94 8.41 1.2011.26 0.43 2.12 1.10 1.40 1.48 2.04 0.96 0.93 8.55
Simple Answer: To recognize patterns that aren’t easy to see.
Why cluster microarray data ??
Current methods for classifying human tumors rely on a variety of morphological, clinical, and molecular variables.
There are still uncertainties in diagnosis.
Existing tumor classes are most likely heterogeneous.
Microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression profiles on a genomic scale.
This may lead to more reliable classification of tumors !!
Nice motivation provided by Dudoit et al., 2002, JASA (cancer is used to illustrate)
Clustering Algorithms (Unsupervised)
“Model Free” algorithms (aka. combinatorial algorithms) directly assign observation to a group or model withoutconsideration of underlying probability model.
* Popular
* Intuitive
* (Fairly) Easy to Implement
“Model Based” algorithms: many assume data are i.i.d. from some population with pdf f where f is a mixture of component density functions. Each component describes one of the clusters. The model can be fit using ML or Bayesian methods.
Model Free Algorithms
The most popular clustering algorithms directly assign each observation to a group or cluster without regard to a probability model describing the data.
Each observation is labelled i {1,2,...,N}
Each cluster is labelled k {1,2,...,K} (K < N)
Each observation is assigned to one (and only one) cluster.
C: i -> k ( C (i) = k )
Consider the distance d(xi, xi’) between every pair of observations.
Find C* that achieves some goal (i.e., minimize some summary function of the d’s).
K - Means
Goal: Minimize W(C)Algorithm:
1. For a given C, calculate the means of each cluster {m1, m2, ..., mk}
2. Assign each data value to cluster with closest cluster mean
3. Repeat 1 & 2 until convergence.
p
jiijiijii xxxxxxd
1
2'
2'',
K
kii
kiC kiC
xxCW1
2'
)( )'(
2
1minarg ki
KkmxiC
Comments on K - Means
1) Uses quantitative data values
2) No ordering of objects within a cluster.
3) Number of clusters, K, must be chosen in advance
4) As K changes, cluster membership can change in arbitrary ways. (i.e., clusters need not be nested).
5) The algorithm on the previous page guarantees convergence, but convergence may be to a local min.
Comments on K - Means
(4). To choose K, oftentimes different data values are considered
K
Dis
tanc
e
(5). To identify if min is local, should start from many different configurations (never guaranteed).
Final Note on K - Means
K-means is similar to K-medoids
In K-medoids, instead of finding mean values of clusters, “centers” of clusters are found.
“Centers” are points that minimize the total distance to other points in that cluster.
Hierarchical Clustering
Specify measure of distance (dissimilarity) between pairs of observations (ie, construct a distance matrix).
Produce hierarchical representations in which clusters at each level are created by merging clusters at the next lower level.
Lowest Level: Each cluster is a single observation.
Highest Level: One cluster contains all data.
Hierarchical Clustering (continued)
Bottom up: Start at lowest level (single observations) and merge selected pair (most similar) into cluster. Using definition of distance between observations and clusters, continue...
Top down: Recursively split data.
Each level of the hierarchy represents particular grouping of the data into disjoint clusters of observations
Generally, significance measures on clusters are not considered. Exception: Fraley & Raftery, UW-TR, 1998.
Bottom Up Clustering: 3 common approaches
Single Linkage: Distance taken to be minimum distance among all pairwise distances
Complete Linkage: Distance taken to be maximum distance among all pairwise distances.
Average Linkage: Distance measure is averaged across all pairwise distances.
',
'
min ii
HiGi
SL dd
',
'
max ii
HiGi
CL dd
','
1, ii
Gi HiHGAL d
NNHGd
Bottom Up Clustering: Comments
If data exhibits strong clustering features (quantified by measure d) and each of the clusters is well separated from the others, then the 3 methods will produce similar results.
With SingleLinkage, there is a tendency to combine observations linked by a series of close intermediate observations (“chaining”). The clusters might not be compact.
With Complete Linkage, compact clusters are obtained; but the distance between clusters might be small.
Hierarchical Clustering: Eisen et al., PNAS, 1998
jiFor genes i and j,
ji x
jOSkjN
k x
iOSkiji
xxxx
Nxxd ,,
1
,,1,
N
i
OSiG N
GG
1
2
GOS = 0; Note that when GOS is mean of observations on G, d is the correlation coefficient.
Hierarchical Clustering: Eisen et al.
Comments on Hierarchical Clustering
Intuitive
Biological Information can (but often is not) incorporated into measures of distance
Nice as a descriptive or diagnostic tool
Cluster order is arbitrary
Where one “cuts” the tree is arbitrary
No confidence measures are applied to clusters.
CAUTION!
Model Based Approaches for Unsupervised Clustering
Model Assumptions are made
Confidence or Probability of genes in particular groups can be assessed.
*** Many methods to identify DE genes can be thought of as model based clustering approaches where clustering is done using gene specific summaries.
EBarrays: Results on 4 group comparison (DMBA treated)
EBarrays: Data
10 Affy chips from non-treated; 14 from treated (DMBA)
EBarrays: Identification of a few interesting genes
Gene ID COP CI CII WF P0 P1 P2 P3
J00801 3066 4777 995 9083 0.05 0.95 0 0
0.04 0.96 0 0
L08100 4368 1278 14162 0 1 0 0
0 1 0 0
J00772 392 122 679 0.04 0.96 0 0
0.97 0.02 0.00 0.01
POE: Probability of Expression
Journal of Royal Statistical Society 64: 717-736 (with discussion), 2002.
**Additional Details in Parmigiani et al., 2003.
A Statistical Framework for expression based molecular classification in cancer
by G. Parmigiani, E.S. Garrett, R. Anbazhagan, E. Gabrielson
POE: Probability of Expression
POE:
Model gene expression using latent categories (on, off , baseline)
Use model to
1. Remove noise prior to clustering
2. Defining molecular subclasses
3. Determine probability that particular gene is in a class.
POE:
Model gene expression using latent categories (on, off , baseline)
Use model to
1. Remove noise prior to clustering
2. Define molecular subclasses
3. Determine probability that particular gene is in a class
over, under, neutral
POE:
For each gene and tumor, calculate the probability that the gene is expressed at baseline, over-expressed, or under-expressed in that tumor.
Identify clusters of genes based on this probability. Identify representative (“seed”) genes within each cluster.
Identify patterns (“profiles”) of expression across seed genes. For each tumor, calculate the posterior probability of each expression pattern (“profile”).
THIS CLASSIFIES TUMORS !
t 1 t 2 ………… t N
g 1
g 2
.
.
g m
m genes
N tumors
POE: Basic Idea Behind Model
A given gene g can be over-expressed, under-expressed, or neutral.
Suppose there are K tumor classes
If gene g* is related to tumor class, then the distribution of expression values of g* will be different in at least one of the k classes.
Currently assumes classes are not known (unsupervised), but they are working on extending this.
POE
• Notation:
• Modeling observed gene expression, agt:
• For gene g, the proportions of differentially expressed tumors in the population of unclassified tumors are
e g t
e g t
e g t
g t
g t
g t
1
0
1
g en e h as ab n o rm a lly lo w ex p ress io n in tu m o r
g e n e h as n o rm a l ex p re ss io n in tu m o r
g e n e h as ab n o rm a lly h ig h ex p re ss io n in tu m o r
a e e f eg t g t e g| ( ) ~ ( ) { , , }, 1 0 1
g g t g g tP e P e ( ) ( )1 1
Garrett JSM 2002
POE: Quantities of Interest
p P e a f f
f a
f a f a
g t g t g t g g g g
g g g t
g g g t g g g g t
( | , , , , )
( )
( ) ( ) ( )
, ,
,
, ,
1
1
1 0
1
1 0
p P e a f f
f a
f a f a
g t g t g t g g g g
g g g t
g g g t g g g g t
( | , , , , )
( )
( ) ( ) ( )
, ,
,
, ,
1
1
1 0
1
1 0
Interpretation: The probability that gene g in tumor t is over expressed given observed expression and the model parameters
Interpretation: The probability that gene g in tumor t is under expressed given observed expression and the model parameters
Garrett JSM 2002
POE: Distributional Assumptions
f U
f N
f U
g g t g t g
g t g g
g t g t g g
1
0
1
,
,
,
( ) ( , )
( ) ( , )
( ) ( , )
Empirical Bayes Approach: Could put priors on unknowns and integrate to give predictive distributions. Then maximize the marginal likelihood to identify unknowns.
POE: Distributional Assumptions
t : Sample expression for normal expression levels.
g: gene effect in gene g for normal expression.
(g+/g) > r where r is approximately 5.
f -1,g
f 1,g
f 0,g
t + g
POE: Distributional Assumptions
After fitting, parameters can be used to “de noise” data or tocluster genes.
p gt +, p gt -, and p gt 0 are the most important quantities.
g
g
g
g
g
g
N
G
E
E
N
N
| , ~ ( , )
| , ~ ( , )
| ~ ( )
| ~ ( )
( ) | , ~ ( , )
( ) | , ~ ( , )
2
lo g it
lo g it
POE: “De-noised” measures of expression
a g t ~ N( gt, g)
Eappa gtgtgtgtgtgtgt (|,)()()
Normal class: g t = t + g with g unknown
Elevated class: g t - t - g ~ U ( 0, k g+)
Low expression class: g t - t - g ~ U ( k g-, 0)
Posterior means of gt can be used as estimates of expression values.
POE: Cluster genes
1. Choose DE pattern of interest. Indicate what proportion of genes are over-expressed and under-expressed.
2. For each gene, using p gt+ and p gt
-, calculate the probability that the samples have a pattern of the type specified. Sort by this probability.
3. Calculate a J x J matrix of “gene agreement”:
)0()1()1(1 )1()()()|,...,( gtgtgt eI
gtgteI
gtt
eIgtgTg ppppeeP
0
1
0 11 mi
I
igimigimigigm ppppppr
POE: Cluster genes
4. Identify genes with “high coherence”
5. Out of this group, pick gene with the highest probability calculated from step 2 (seed gene).
6. Group genes if they are similar to the seed gene.
7. Remove this group and repeat.
Should repeat process using different initial patterns.
Results in some number of seed genes (for example, say 4).
POE: Comments on seed genes (pg. 379 of Parmigiani et al. text)
Any number of seed genes can be used to create a collection of profiles. s genes gives 3s possible profiles.
“ Profiles based on 4 or more genes are seldom required with the sample sizes and signal-to-noise ratio achievable.”
POE: Creating molecular profiles
For each tumor t, calculate the posterior probabilities of each expression profile using p sg1, t +, p sg1, t -, p sg1, t 0, ...
This classifies tumors !
Each of the 4 seed genes could be under-expressed (-1), over-expressed (1), or neither (0). This gives 34 = 81 patterns of expression (or profiles).
P1 P2 P3 ...... P81
sg1 0 0 -1 1
sg2 0 1 0 1
sg3 0 0 -1 1
sg4 0 0 1 1
POE: Quotes from Garrett & Parmigiani in Parmigiani et al. text 2003
“ The benefit of [the Bayesian hierarchical modeling] approach is that it borrows strength across genes using the entire genomic distribution instead of fitting a separate independent model for each gene.”
“Hierarchical Bayesian models have been shown to have appealing properties in estimation of large vectors of related quantities.”
Comments on POE
Unsupervised, model based clustering approach. The model accounts for measurement errors.
NOT intended for gene clustering.
Uses scale-independent measures of expression which allows combination of data across platforms
Defines a molecular profile based on a small number of genes. This could be useful clinically.
TIME SERIES ANALYSIS OF MICROARRAY DATA
Analysis of Microarray Time Series
Manuscript in Progress
by
M. Yuan and C.M. Kendziorski
Methods for Microarray Time Series Data
Every method that we know considers TS data in one condition. General goal is to cluster genes with similar expression patterns over time.
We consider TS data in multiple conditions. Group genes based on differential expression patterns over time.
Microarray Time Series: Example
• Two treatments
• 6 time points: 0, 2, 6, 24, 48, 120 Hours
• Number of Genes: 12625
Microarray Time Series: Apply EBarrays at each time point
• Differentially expressed genes identified by Ebarrays
• From 24 hrs to 48 hrs, from 48 hrs to 120hrs
0 Hr 2 Hrs 6 Hrs 24 Hrs 48 Hrs 120 Hrs
0 9 0 28 170 333
Pr(DE|DE) Pr(DE)
48 Hrs 6/28 170/12625
120 Hrs 36/170 333/12625
Microarray Time Series: Correlation
• What if there is no correlation?
Pr(DE|DE)=Pr(DE)
• Why should we care about correlation?
Pr(DE)f(x|DE)/Pr(EE)f(x|EE)If Pr(DE) is large, it is easier to claim DE
If Pr(DE) is small, it is harder to claim DE
Microarray Time Series: Correlation
• What if there is no correlation?
Pr(DE|DE) = Pr(DE)
• Why should we care about correlation?
Pr(DE)f(x|DE)/Pr(EE)f(x|EE)If Pr(DE) is large, it is easier to claim DE
If Pr(DE) is small, it is harder to claim DE
Microarray Time Series: How much information did we lose ?
• Given that a gene is DE at 24 Hours
We could claim DE at 48 Hours if f(x|DE)>3.67f(x|EE)
If we do not consider correlation, we claim DE at 48 Hours if f(x|DE)>74.26f(x|EE)
• Given that a gene is DE at 48 Hours
We could claim DE at 48 Hours if f(x|DE)>3.72f(x|EE)
If we do not consider correlation, we claim DE at 48 Hours if f(x|DE)>36.91f(x|EE)
Microarray Time Series: HMM Model Structure
• Pattern process S[t]: Pattern of expression at time t.
Treatment vs Control: DE or EE
Compare treatments: i.e. 5 patterns for 3 trts
Markov Process
• Expression vector x[tk]: K expressions observed at time t.
Distributed according to S[t]
Conditional independent given S[t]
Microarray Time Series: HMM Model Structure
…… ……
Gene 1
Gene 2
Gene n
Gene 1
Gene 2
Gene n
Patterns HMM Expression Vectors
S[1]
X[11],…,x[1K]
S[2]
X[21],…,x[2K]
S[3]
X[31],…,x[3K]
……
Microarray Time Series: Options to Specify HMM
• Marginal expression distribution using EBarrays
• Transit matrix Pr(S[t]|S[t-1]) free of time Homogeneous HMM
• Force Pr(DE|S[t-1]=DE)=Pr(DE|S[t-1]=EE) Independent analysis
• Homogeneous independent analysis Constraint that Pr(DE) is constant over time
Microarray Time Series: Estimation Using EM
• Infer unknown pattern process from observed expression data – Maximum a Posteriori (MAP): max Pr(S[t]=i|X)
• Unknowns:Parameters associated with f(x|S)Parameters associated with pattern process
• EM algorithmE Step: parameters pattern process (Baum-Welch)M Step: pattern process parameters (MLE)
Microarray Time Series: Cluster Genes Into Patterns
• Maximum a posteriori
max Pr(S[t], t=1,…,T|X)
• Viterbi Algorithm
Microarray Time Series: Simulation Study
• 6 time points
• 2 treatments
• 1500 genes
• Proportion of DE at the first time point is 0.1
• Pr(DE|EE)=0.1
• Pr(DE|DE)=0.1, 0.5, 0.7
Microarray Time Series: Simulation Results
P(DE|DE) Method Time 1 Time 2 Time 3 Time 4 Time 5 Time 6
0.1
EB 55/60 57/59 66/67 82/101 93/107 98/115
HMM 57/62 62/67 68/69 81/98 85/94 96/108
0.5
EB 95/106 92/102 123/142 126/136 139/151 137/145
HMM 97/109 116/129 139/155 145/155 152/173 146/160
0.7
EB 63/75 123/132 165/191 203/230 158/173 172/184
HMM 72/88 144/154 191/213 236/258 214/237 201/224
Microarray Time Series: EBarrays vs. HMM
• Differentially expressed genes identified by Ebarrays based HMM
• How many more genes identified:
0 Hr 2 Hrs 6 Hrs 24 Hrs 48 Hrs 128 Hrs
0 9 0 138 333 475
0 Hr 2 Hrs 6 Hrs 24 Hrs 48 Hrs 128 Hrs
0 0 0 110 163 142
Microarray Time Series: Could it be so good ?
• Simulate data for the last two time points with parameters estimated from the HMM
• Performance comparisonMethod 48 Hours 120 Hours
EB 259/269 374/390
HMM 302/321 469/498
Microarray Time Series: Comments
• Correlation over time does exist in most studies.
• Taking correlation over time into account can significantly improve the efficiency of method to identify DE genes.
• HMM provide a flexible way to model the correlation over time
• Ebarrays based HMM is a useful option to analyze microarray time series data
• Technical Report coming soon...
EXPERIMENTAL DESIGN
Experimental Design Questions and Overview
What to print/spot on the array ?
How many pieces of one gene ?
Replicates of a gene ?
Housekeeping or other control spots ?
How to arrange spots/genes on the array ?
Spatial Bias
Print Tip Bias
cDNA ?
Affy (done)cDNA (~done)
Affy (varies ~11)
Experimental Design Questions and Overview (continued)
What to hybridize onto the array ?
What is reference/control ?
How is labelling done ?
Should samples be pooled ?
cDNA ?
Affy (can be ?)
cDNA (Dye swaps ?)
Affy (done)
cDNA (?)
Affy (?)
Actual Design
The actual design considers questions previously stated, but also includes issues of replication.
How many arrays should one use ?
How should samples be allocated to arrays ?
Answers to these questions will depend on three sources of variation: Biological, Technical, and Measurement Error.
In addition, the goal of the experiment should affect its design !
Sources of Variation
Biological Variation: Subject to subject variation. Intrinsic to the organisms studied. This CAN NEVER be reduced, but its effect can perhaps be reduced (e.g. by pooling biological samples).
Technical Variation: Introduced during extraction, labelling, and hybridization. Quantified (estimated) by hybridizing multiple mRNA samples from the same individual to many arrays. Also called array to array variation (caution: so are other sources of variation).
Measurement Error: Introduced when reading signals. Measured within a single array. Multiple spots on one array can reduce the effect of measurement error.
Loop Design
Graphical representation of the loop design indicates
Which differences can be estimated
Precision of the estimates
For example,
A and B can only be compared if there is a path from A to B
Here, A and B can be compared directly or through C. The direct comparison (log(A/B)) is less variable than
log (A/B) = log (A/C) + log (C/B)
A
BC
Graphical Representation of Loop Designs
A
BC
Simplest loop design
A B
A B
A B
A B
A B
A B5
Comparison Among Sources of mRNA
Consider sources A, B, and C to be of interest
Dudoit and Speed method (mult t) 3 times (AB, AC, BC). This is not optimal, but will give adjusted p-values.
ANOVA (Kerr et al., Wolfinger et al.) Better to use all the data at once, but there is no accounting for multiple tests with these approaches. Rank ordering provided might be useful.
EBarrays (Newton et al., Kendziorski et al.) Can handle multiple conditions and accounts for multiple tests. Must specify patterns. Computational issues.
Evaluation of Designs
Each approach gives a gene specific summary score. The scores depend on biological, technical, and measurement error variability.
Different designs result in different allocations of the variance components.
Evaluations of designs is often done by considering variability associated with resulting gene expression estimate.
Evaluation of Designs
log (A/B) = log (A/R) - log (B/R)
Yang & Speed, NRG, 2002
Feasibility of Design III decreases as the number of conditions increases. With 6 samples, there are 15 pairwise comparisons.
Kerr and Churchill proposed loop designs. No longer strongly recommended (by many including Churchill). Some comparisons are less precise than others. Problems with robustness.
Notes on Designs
Compare A with D
log (A/D) = log (A/B) + log (B/C) + log (C/D)
log (A/D) = log (A/F) + log (F/E) + log (E/D)
What if arrays C and E are bad ?
A Closer Look at the Loop Design
A
B
C
D
E
F
Table 2 from Yang & Speed
Yang & Speed, NRG, 2002
Time Course Experiments
If main focus of the experiment is relative change between T2, T3, T4 and initial time point T1, then a reference design is good.
T1 T2 T3 T4
Here, all comparisons are made with equal efficiency.
If a stable reference is available (T1), this will allow comparisons to be made over a relatively long period of time.
Author note on Reference Design
Gary Churchill has traditionally argued against a standard reference design since almost half the measurements are made on the reference sample (which might be of little or no interest) and the variance can be increased relative to other designs.
However, in his NG Reviews article (December 2002), he notes the advantages:
Paths connnecting 2 samples are 2 steps long.
Good way to handle comparisons across time.
Replication (Review and Note)
Technical Replicates: Yang and Speed also include measurement error here. They define such replicates as ones where target mRNA is from the same extraction (different than GC definition reviewed earlier).
Biological Replicates: mRNA samples from different individuals, different cell lines.
Where to spend resources ?
Gary Churchill (NG Reviews, December 2002):
Correlation from duplicate spots on one array (~95 %)
Same target to multiple arrays (~60-80 %)
Samples from individual inbred mice (~30 %)
Yang & Speed (NG Reviews, August 2002):
Technical replicates generally involve a smaller degree of variation in measurements than biological replicates.
Where to spend resources ?
Gary Churchill (NG Reviews, December 2002):
When measurement is expensive, it is preferable to add experimental units rather than technical replicates.
When the variability of measurements exceeds the variability between experimental units, technical replication can increase precision.
When variability between experimental samples is large and units are not too costly, it may be worthwhile to pool samples.
Summary
In short,
Identify goals of the experiment (which comparisons are most important ?)
Identify options
Calculate variability associated with all options
Research options to see how they work in practice !!
Choose design based on variability, feasibility, and cost.