statistical methods for microarrays christina kendziorski landon sego department of biostatistics...

Statistical Methods for Microarrays

Christina Kendziorski

Landon Sego

Department of Biostatistics and Medical InformaticsUniversity of Wisconsin-Madison

BASIC BIOLOGY

Introduction to Basic Biology and Microarray Experiments

• What is a DNA microarray measuring?

Gene expression.

• The novelty of a microarray is that it quantifies the abundance of thousands of genes simultaneously—which gives biologists a global perspective.

• The biological processes that give rise to microarray data can be viewed as information transfer processes.

• Data collection for microarray experiments is not a trivial task and requires imaging technology and image processing tools.

Nguyen, et al. 2002

Review of DNA molecule

Nguyen, et al. 2002

The Central Dogma of Molecular Biology

Terminology

• Amino acid The basic building block of proteins (or polypeptides)

• mRNA Messenger RNA is an RNA strand complementary to a DNA template

• TranscriptionThe process where the DNA template is copied/transcribed to mRNA

• Gene expression A gene is expressed if its DNA has been transcribed to RNA—gene

expression is the level of transcription of the DNA of the gene

• RT Reverse transcription is an experimental procedure to synthesize a DNA

strand (cDNA) which is complementary to a mRNA template

Nguyen, et al. 2002

Terminology

• cDNA/cRNA Complementary DNA is synthesized from mRNA during RT and, similarly, in the

context of oligo arrays, complementary RNA is RNA synthesized during in vitro transcription

• dNTP Deoxyribo nucleoside triphosphate; denotes any of dUTP, dTTP, dATP, or dGTP;

molecular building blocks for making DNAs in RT, PCR, or in vitro replication; free dNTP’s in solution (which are not yet incorporated into the nucleic acid strand) have three phosphates which provide the necessary energy for cDNA synthesis

Nguyen, et al. 2002

Terminology

• Primer A short, single strand of RNA or DNA that can initiate chain growth from a template

• Oligo(dT) Primer with sequence TTTT… used to initiate cDNA during RT

• Reverse transcriptase An enzyme that catalyzes the synthesis of cDNA during RT

• Poly(A) tail A sequence of A (AAA …) at the 3' end of mRNA; oligo(dT) is used in

RT to recognize mRNA by its poly(A) tail

• Target cDNAs Mixture of cDNAs obtained from the experiment and reference mRNAs

Nguyen, et al. 2002

Terminology

• Probe cDNAs Immobilized cDNA printed on the array

• Hybridization Process of bringing into contact the target and probe for binding in microarrays—also refers to the binding of two DNA strands generally

• PCR Polymerase chain reaction is a procedure to amplify a segment of DNA—mass-

replication of a segment of DNA

• Oligonucleotide A short fragment of DNA (usually in single- stranded form) which is often chemically synthesized. It can be used as a probe or primer. 'Oligo' is Greek for 'few'.

Nguyen, et al. 2002

The Central Dogma of Molecular Biology

Basic model for gene expression

• Two different levels of gene expression:

Transcription level—where RNA is made from DNA.

Translation level—where protein is made from mRNA.

• Microarrays measure gene expression at the transcription level.

Nguyen, et al. 2002

DNA mRNAamino acid protein

cell phenotype organism phenotype

transcription translation

Gene expression and abundance

• Quantification of mRNA abundances

Quantification of amount of gene expression

• Gene is expressed if its DNA has been transcribed to RNA

• A “high level of expression” would imply transcription has occurred many times and there are many copies of mRNA in the tissue.

• “low level of expression” implies fewer copies of mRNA

Nguyen, et al. 2002

DNA transcription

• DNA transcription is the information transfer process directly relevant to DNA microarray experiments because quantification of the type and amount of this copied information is the goal of the microarray experiment.

• Transcription occurs in 3 stages: initiation, elongation, and termination.

• After transcription, the mRNA is further processed by removing non-coding segments, called introns.

Nguyen, et al. 2002

DNA transcription - Initiation

http://www.brooklyn.cuny.edu/bc/ahp/BioInfo/graphics/Transcription.02.GIF

• Promoter regions on the DNA chain provide the signal for the initiation of transcription. Promoter regions recruit an enzyme (protein) called RNA polymerase II to the transcription initiation site.

DNA transcription - Elongation

http://www.brooklyn.cuny.edu/bc/ahp/BioInfo/graphics/Transcription.02.GIF

• During elongation, the RNA polymerase moves along the DNA and extends the RNA chain by adding free nucleotides with base A, G, C, or U to match the T, C, G, or A nucleotides of the DNA template strand, respectively.

DNA transcription – Termination and processing

• When the RNA polymerase reaches the template strand signal for termination, the newly synthesized RNA is released from the DNA template.

• Before the message is transported to the cytoplasm, some important posttranscriptional processing occurs.

• For example, a sequence of A’s is added to the RNA strand at the 3' end. This sequence of A’s is called the poly(A) tail.

• Non-coding regions of the mRNA (called introns) are removed in a process called splicing.

Nguyen, et al. 2002

DNA transcription and RNA processing

Nguyen, et al. 2002

One gene ≠ One protein

• Relationship between protein and mRNA is not one to one—so the simplified model shown on the previous slide is only an approximation.

• Exon: Coding region of DNA

• Intron: Non-coding region

Gene 1 Gene 2

template DNA strand

1 gaattccacattgtttgctgcacgttggattttgaaatgctagggaactttgggagactc61 atatttctgggctagaggatctgtggaccacaagatctttttatgatgacagtagcaatg

421 gagctacaagggcctggtgcatccagggtgatctagtaattgc agaacagcaagtgct ag481 ctctccctccccttccacagctctgggtgtgggagggggttgtccagcctccagcagcat541 ggggagggccttggtcagcctctgggtgccagcagggcaggggcggagtcctggggaatg601 aaggttttatagggctcctgggggaggctccccagccccaagcttaccacctgcacccgg661 agagctgtgtcaccatgtgggtcccggttgtcttcctcaccctgtccgtgacgtggattg721 gtgagaggggccatggttggggggatgcaggagagggagccagccctgactgtcaagctg781 aggctctttcccccccaacccagcaccccagcccagacagggagctgggctcttttctgt

6301 cctagagaaggctgtgagccaaggagggagggtcttcctttggcatgggatggggatgaa6361 gtaaggagagggactggaccccctggaagctgattcactatggggggaggtgtattgaag6421 tcctccagacaaccctcagatttgatgatttcctagtagaactcacagaaataaagagct6481 cttatactgt

...

...

Success Story: KLK3 (PSA Gene)

Success Story

Dhanasekaran et al., Nature, 2001

GETTING EXPRESSION MEASUREMENTS:

cDNA ARRAYS and AFFY CHIPS

cDNA Microarray Experimental Procedure: Overview

• Given a biological sample of cells, there are two possibilities: a set of genes is either expressed in the cells or it is not.

• cDNA arrays are designed to measure the expression of cells in an experimental sample relative to a reference (control) sample.

• In current practice, cDNAs from the experimental and reference samples are labeled with different fluorescent dyes, mixed, and hybridized onto the array.

• The measured fluorescence intensity for each sample is assumed to be proportional to transcript abundance (of course, this is conditional on factors such as spot characteristics, hybridization efficiency, level of dye incorporation, etc.).

Nguyen, et al. 2002

mRNA

LabelledcDNA

Tissue Sample

Microarray

2

1

xx

DataNominal

Level1

2

cDNA Array Data

cDNA Microarray Experimental Procedure

1. Fabrication of array: preparing glass slide, selecting probe DNA sequences, and depositing (printing) the probe cDNA onto the slide.

2. Sample preparation: Isolating total RNA (mRNA and other RNAs) from experimental and reference samples of interest.

3. cDNA synthesis and labeling: making cDNAs from the experimental and reference samples and labeling each sample with a fluorescent dye.

4. Hybridization: applying experimental and reference cDNA mixture to the array, letting the target and probe cDNA bind, then washing off the excess.

5. Data collection: measurement of fluorescent intensities using a confocal microscope.

Nguyen, et al. 2002

Fabrication of the array

• What cDNA sequences (probes) should be printed on the array?

• Ideally, all genes would be printed. But in most cases we don’t know the sequences of all genes.

• cDNA libraries (GenBank, UniGene etc.) can be used to select the cDNA sequences.

• Sometimes cDNAs may be spotted (printed) onto the array without knowing what gene the cDNA corresponds to.

• Sometimes pieces of expressed gene, known as an expressed sequence tag, (EST) can be spotted onto the array.

Nguyen, et al. 2002


The MGuide. Version 2.0 The Brown Lab's complete guide to microarraying for the molecular biologist. Parts list Drawings for custom parts Assembly Guide: Step-by-Step Download software Online Software Documentation Print Tip Gallery Protocols Re-Purify Your Cy-Dyes. MicroArray Forum. NEW!

http://cmgm.stanford.edu/pbrown/mguide


• The array itself is a pre-treated glass slide to which the cDNA probes will be attached.

• The selected cDNA sequences are amplified (mass-replicated) using PCR.

• After amplification, the solution containing the amplified cDNA probes is deposited on the array using a set of microspotting pins.

• Ideally, the amount solution deposited by the pins should be uniform—but this is not completely achieved in practice:

Nguyen, et al. 2002

Glass slides and/or treatment not uniform There are pin effects Spots are not uniform.


• The drops of solution containing the cDNA probes form the spots on the array—each spot corresponding to a gene, EST, other.

• As a product of PCR, the cDNA probes that are spotted onto the array are double stranded.

• When the target cDNA is applied to the array, the double stranded probes are denatured (separated) in a heating process to allow the target cDNA to bond to the probe strands.

Nguyen, et al. 2002

Sample preparation

• mRNA is extracted from tissue samples. For example:

• During the sample preparation process, operator differences and heterogeneous tissue can significantly contribute to the variability.

• However, these factors are not normally considered in microarray studies.

Nguyen, et al. 2002

one sample from tumor tissue (experimental sample)

one sample from normal tissue (reference sample).

Synthesis, labeling, and hybridization

http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html

Simplified summary of cDNA synthesis and labeling

• RNA is isolated from experimental and reference cell pools.

• Free nucleotides (dNTPs), oligo(dT), and reverse transcriptase are added to the solution of total RNA to initiate cDNA synthesis.

• Fluorescent dye molecules (labels) are incorporated into the cDNA.

• Typically, the cDNA from the experimental sample is dyed red (Cy5) and cDNA from the reference sample is dyed green (Cy3).

Nguyen, et al. 2002

• There are several methods for adding the labels to the cDNA:

• The gene expression measurement is affected by the labeling method used.

Dye incorporation

Nguyen, et al. 2002

direct incorporation labeling method

Amino-modified (amino-allyl) nucleotide method

primer tagging method

Direct Incorporation Method

Nguyen, et al. 2002


• The following ingredients are added to the solution of total RNA to initiate cDNA synthesis:

• Experimental and reference solutions are mixed and hybridized onto the array.

1. Oligo(dT)2. Reverse transcriptase3. Free nucleotides: dATP, dCTP, dGTP, and dTTP4. Labeled uracil nucleotides: dUTPs with a dye

molecule attached. Cy5-dUTP (red) is added to the experimental solution and Cy3-dUTP (green) is added to the reference solution.

Amino Modified Nucleotide Method

Nguyen, et al. 2002



• After cDNA synthesis, Cy5 and Cy3 are added to the experimental and reference solutions, respectively. The dye couples with the amino-modified dUTP nucleotides.

• Experimental and reference solutions are mixed and hybridized onto the array.

1. Oligo(dT)2. Reverse transcriptase3. Free nucleotides: dATP, dCTP, dGTP, and dTTP4. Modified amino-allyl dUTPs are added to both

experimental and reference solutions.

Primer Tagging Method

Nguyen, et al. 2002



• After cDNA synthesis, experimental and reference solutions are mixed and hybridized onto the array.

• After washing, the array is incubated with Cy5- and Cy3-labeled molecules called dendrimers. The dendrimers attach to the corresponding capture sequences.

1. Reverse transcriptase2. Free (unlabeled and unmodified) nucleotides:

dATP, dCTP, dGTP and dTTP3. Oligo(dT) primer with capture sequence TTTT----

for experimental sample and capture sequence TTTT+++ for reference sample.

• In all three methods, the red and green dyes may incorporate unequally. Likewise, spectral overlap can occur between the fluorescence of the two dyes.

• In the direct incorporation method, the Cy3-dUTP and Cy5-dUTP molecules exhibit some steric hindrance which contributes to nonefficient and nonuniform incorporation of the dye into the cDNA.

• In the amino-modified method, the amino-allyl is a smaller molecule with less steric hindrance, and so the amino modified dUTP are uniformly incorporated the into the cDNAs with higher frequency than the direct incorporation method.

Comparing the dye incorporation methods

Nguyen, et al. 2002

• In both the direct incorporation and amino-modified methods, the abundance of labeled uracil nucleotides is influenced by the composition as well as the length of the cDNA strand.

• The resulting fluorescent intensity depends on the abundance of uracil nucleotides that were incorporated into the cDNA strand.

• The primer tagging method attempts to correct this problem by attaching one dendrimer to each cDNA strand. Each dendrimer contains approximately 250 fluorescent Cy5 or Cy3 molecules. Hence there is approximately one intensity signal per cDNA molecule.

Comparing the dye incorporation methods

Nguyen, et al. 2002

cDNA Microarray Experimental Procedure: Hybridization

• The solutions containing the experimental and reference labeled cDNAs are mixed and applied to the array, which contains the probe cDNAs in each spot.

• Target and probe sequences bind by base pairing (hybridization). Note that binding can occur between sequences that are similar but not identical (cross-hybridization).

• After sufficient time is allowed for hybridization, the array then goes through a series of washes to eliminate all unbound target cDNA’s an solution.

• The washing procedure must be stringent enough to remove all extraneous material but at the same time not remove the bound cDNAs—the signals of interest.

Nguyen, et al. 2002

cDNA Microarray Experimental Procedure: Concepts

• Consider one spot on the array. This spot contains cDNA probes for a gene of interest, say gene A.

• If there are target cDNAs in the mixed solution complementary to the probe cDNAs of gene A, they should bind together by base pairing (hybridization).

• If, for example, gene A is expressed in both the experimental and reference samples, we expect that cDNA from both samples to bind with the probe cDNA—this spot will then show both red and green fluorescence.

Nguyen, et al. 2002

cDNA Microarray Experimental Procedure: Data Collection

• After the slides are prepared and the hybridization step is complete, the expression level of each gene is measured.

• The expression levels of a gene in the experimental or reference cells are measured by the spot intensities of the fluorescent dyes. We assume that a spot of high fluorescence indicates high expression of the corresponding gene.

• The array is scanned using a confocal laser microscope. Images of each spot on the array are produced, processed, and analyzed to measure the expression of each gene.

Nguyen, et al. 2002


Nguyen, et al. 2002

cDNA Microarray Experimental Procedure – Image quality

• There are a number of factors that influence the image quality, such as noise fluorescence (fluorescence from non-dye sources), pollution of the fluorescent signal, photo-bleaching, etc.

• Raw data consists of two images, one image obtained from the red channel and one from the green channel.

• Which pixels in the target area represent signal, and which represent background? Pixels must be categorized as one or the other.

• A measurement is made of the intensity of the fluorescence for the spot and the intensity of the noise fluorescence from the background.

Nguyen, et al. 2002

cDNA Microarray – Quantifying gene expression

Cy5 (red) signal intensities Cy5 (red) background intensities1 2 . . . . . . n

12...m

= RnmI

Rijx

for i = 1,…,n samples, j = 1,…,m genes

Notation for Cy3 (green) intensities is the same with the exception of a G superscript.

1 2 . . . . . . n12...m

= RnmB

Rijb


• Many analyses are based on background corrected intensities using the measurements

and

• Intensity ratios are also used:

• Negative intensities (where the background is stronger than the signal) can occur—these issues will be discussed later….

Rij

Rijij bxr G

ijGijij bxg

ij

ij

gr

( Represents the abundance of gene expression in experimental sample relative to reference sample )


Multiple sources of variability: biological and technical

• The multiple experimental and biological processes in the cDNA microarray procedure each contribute to the overall variability.

• For example, a significant amount of variability may be introduced during the selection of the tissue samples.

• A significant amount of variability also arises during measurement and image processing.

• It is not easy to identify which portions of the microarray procedure are contributing most to the overall variability—and often times the variability at a given step of the procedure is poorly understood.

mRNA

LabelledcDNA

Tissue Sample

Microarray

2

1

xx

DataNominal

Level1

2

Recap: cDNA Array Data

Affymetrix Chip Experimental Procedure

• Affymetrix is another type of commonly used microarray technology.

• Rather than attach the pre-synthesized cDNA probes to the chip, oligonucleotides are chemically synthesized (grown) directly on the chip.

• An oligonucleotide is a short fragment of DNA (usually in single-stranded form) which is often chemically synthesized. It can be used as a probe or primer. 'Oligo' is Greek for 'few'.

• The oligonucleotides are synthesized to match known gene sequences. The process of synthesizing the oligonucleotides conceptually resembles semi-conductor fabrication by using masks, light exposure, and deposition of nucleotides.


• Each gene or EST (expressed sequence tag) is represented on the array by 11-20 features. Each feature consists of an oligonucleotide that is a perfect match (PM) to a segment of a gene.

• For each PM, there is a corresponding oligo that is identical to the PM except for a single mismatch (MM) at the central base of the oligonucleotide.

Nguyen, et al. 2002

Affymetrix – Identifying gene expression

• Only one tissue sample is applied to each Affymetrix chip.

• The Affymetrix chip currently has 11 PM features for each gene. These 11 PM features serve as unique sequence detectors and the corresponding 11 MM features serve as controls.

• Under relatively ideal conditions, when the gene is expressed in the cell sample, high intensity is expected for the PM feature and low intensity for the MM feature.

• It is assumed that differences observed between the PM and MM feature intensities are due to hybridization kinetics of the different feature sequences and nonspecific background RNA hybridizations.

Nguyen, et al. 2002

Affymetrix – Sample labeling

• Affymetrix uses a one-color detection scheme (one sample on one array).

• Targets are biotin labeled cRNA, rather than cDNA.

• Double stranded cDNA is synthesized using RT. Then the targets, biotinylated cRNA, are synthesized using in vitro transcription.

• Biotinylated cRNA are cRNA that have biotin molecules attached to them.

• The biotinylated cRNA is fragmented (to reduce segment length) and hybridized to the array.

• The slide is washed and fluorescent dye is applied. The dye couples with the biotin on the cRNA.

Nguyen, et al. 2002

Recap: Affymetrix Chip Experimental Procedure

Affymetrix – Advantages over cDNA Micorarrays

• With the one-dye system, unequal incorporation between dyes is not an issue—nor is spectral overlap between dyes.

• The reference sample is no longer needed. This reduces the required biological materials needed for experiments.

• The possibility of genomic DNA being labeled during RT is avoided because the modified biotinylated nucleotides are incorporated during IVT.

• Fragmentation of the target cRNAs ensures that the most target lengths are within a reasonable range, thus avoiding target folding.

Nguyen, et al. 2002

Affymetrix – Image processing

• Affymetrix software uses a gridding procedure to locate the features on the array.

• Each feature consists of about 64 pixels.

• The features are scanned and the intensity value for a feature is computed as the 75th percentile of the intensities for the pixels in that feature (excluding the boundary pixels).

• Signal intensities are corrected for background noise intensities.

Nguyen, et al. 2002

Affymetrix – Quantifiying gene expression levels

• Recall that each gene consists of k = 1,…,K features, each feature having a pair of perfect match and mismatch intensity measurements ( PMk , MMk ).

• Average Difference (AvDiff) method--for each gene:

• Most reported results from Affymetrix arrays are based on analyses that use AvDiff but with various ways to filter the outliers.

Nguyen, et al. 2002

Calculate the difference for each feature: dk = PMk - MMk

Remove all dk that exceed 3 SD of the trimmed mean (trimmed mean calculated by excluding the largest and smallest dk)

Take average of the remaining dk


• As another way to measure gene expression, Efron et al. (2001) investigated

avg{dk = log(PMk) – clog(MMk), k = 1,…,K}

for various scale factors c.

• Yet another approach to measure gene expression:

where CTk is the change threshold and wk is a weight.

• Naef et al. note that the information content of MM features is not clear, they proposed expression indexes using only the PM features.

K

kkkk CTPMw

KSignal

1

)log(1

exp


From Statistical Algorithms Reference Guide, Affymetrix, 2002:

“When the mismatch intensity is lower than the perfect match intensity, then the mismatch is informative and provides an estimate of stray signal. Rules are employed to ensure that negative signal values are not calculated. Negative values do not make physiological sense, and make further data processing, such as log transformations, difficult.”

PROCESSING IMAGE FILES TO GIVE ROBUST ESTIMATES OF

INTENSITY


Cy5 (red) signal intensities Cy5 (red) background intensities1 2 . . . . . . n

12...m

= RnmI

Rijx

for i = 1,…,n samples, j = 1,…,m genes

Notation for Cy3 (green) intensities is the same with the exception of a G superscript.

1 2 . . . . . . n12...m

= RnmB

Rijb


• Many analyses are based on background corrected intensities using the measurements

and

• Intensity ratio is of interest:

• Represents the abundance of gene expression in experimental sample relative to reference sample

Rij

Rijij bxr G

ijGijij bxg

ij

ij

gr

cDNA Microarray

http://www.beatson.gla.ac.uk/infrastructure.htm

Microarray technology. Figure showing part of a 10,000 element cDNA microarray hybridised with a cancer cell line RNA labelled with the fluorescent marker, Cy3 (green) and a control, labelled with Cy5 (red). This identifies genes with differential expression. (Dr N.I. Barr)

Considerations when obtaining data

One array:

One spot A:

One spot B: Given pixel information, how to assign a summary foreground and background value? How do you combine the two to give an intensity estimate?

Assigning coordinates to each spot (and measuring distances between spots)

Which pixels to use as signal and background?

Considerations when obtaining data

Addressing or gridding: Assign coordinates to each spot. Obtain foreground gridlines and background grid lines.

Segmentation: Classify pixels as foreground or background.

Intensity Extraction: Given pixel information, calculate summary measures for foreground and background.

Background Correction: Correct foreground for background to give estimate of intensity.

Addressing or gridding

21

19

one grid

Format of array is known:

Distance between rows and columns of gridsTranslation of gridDistance between rows and columns of spots within each gridTranslation of spotsOverall position of array in the imageRotation of array

Addressing or gridding

• Addressing records all of this information.

• Important information for use later is foreground grid lines and background grid lines.

• fgl show spot locations

• bgl separate spots

Segmentation

Fixed Circle Segmentation: Fix diameter for every spot and draw circles of this diameter.

Adaptive Circle Segmentation: Allow diameters to change from spot to spot. Takes a long time.

Adaptive Shape Segmentation

Histogram Segmentation

Adaptive Shape Segmentation (SRG)

SRG = seeded region growing

1. Start with collection of background and foreground pixels (seeds).

2. Identify neighboring pixels for each collection and calculate the mean intensity for background neighbors and foreground neighbors.

3. Identify pixel in foreground neighbors that is closest to foreground mean and include it in collection. Do the same for background.

4. Continue.

Selection of foreground and background seeds

http://stat-www.berkeley.edu/users/terry/zarray/Talks/image/image102000.ppt

• Foreground seeds are chosen by finding the maximum of the combined intensity surface over a small region centered within the square (single point within the square).

• Background seeds are constructed as crosses based on the fitted background grid.


• Target mask chosen larger than spot.

• Chen, et al.

• 8 random samples from patch.

• Lowest 8 from mask.

• WRS Reject Signal defined to be 8 values from mask and all pixels in mask with intensities ≥ smallest of the 8.

• Do not reject repeat with some number of the 8 masked values replaced with pixels of higher intensity.

Target site

Target mask

Target patch


Advantage: Simple Disadvantage: Large mask might include other spots

Plot histogram of pixels in mask:

Average used as background

Average used as foreground

5th % 20th % 80th % 95th %

Segmentation

Methods Software

Fixed Circle ScanAlyze, GenePix, QuantArray

Adaptive Circle GenePix

Adaptive Shape Spot, SRG & one other method

Histogram QuantArray

Histogram method and Chen, et al. method give you summary measures of pixel intensities. Other methods simply divide the pixels into foreground and background.

Different background adjustment methods


• Region inside red circle represents the spot mask.

• Local background calculation by different methods:

Green: used in QuantArray;

Blue: used in ScanAlyze;

Pink: used in Spot.

Different background adjustment methods

• Histogram based techniques give foreground and back ground measures directly.

• Other methods simply divide information into foreground pixels and background pixels and you need to perform the calculations.

• Some give summary measures of foreground and background intensity.

Estimating background

ScanAlyze: Median of all pixels outside spot mask (circle) that are within square centered at spot center.

QuantArray: Median of all pixels in concentric circles outside of (and some distance from) spot mask.

GenePix: Median of all pixels in “valleys” surrounding spot mask.

Spot: Option implemented in GenePix and other options that utilize information (e.g. morphological opening) from the entire array to give spot specific adjustment.

Morphological opening:

• Image with spot intensities removed is estimated. This provides estimate of background for the entire slide.

• Background is value of this image at spot center.

Comparison of methods considers different background adjustments

• Methods considered are implemented in the packages. Broadly classified into 4 categories:

• Unless otherwise specified, spot foreground intensities are calculated by taking the mean intensity of the pixels within the spot mask.

1. Local adjustment: use median intensity of pixels in region just outside spot mask

2. Morphological opening

3. Constant background

4. No adjustment

Description of image analysis methods

QA.fix.nbg

QA.hist.nbg

GP

Software: GenePix.Segmentation: Spot intensity is the mean of pixel values

between the 45th and 85th percentiles within a fixed circle of 9 pixels in diameter.

Background: None.

Software: QuantArray.Segmentation: Spot intensity is the mean of pixel values

between the 80th and 95th percentile of a 11-by-11 pixels square.

Background: None.

Software: GenePix.Segmentation: Proprietary algorithm that results in

adaptively sized circles.Background: Median from “valley of spot”.


Description of image analysis methods

SA

S.morph

Software: ScanAlyze.Segmentation: Fixed circles, 10 pixels in diameter.Background: Median value in local square region

Software: Spot.Segmentation: Seeded region growing.Background: Based on morphological opening. The

structuring element is a square region with sides of length 2.5 times the approximate spot to spot separation.


Comparing image analysis methods

Foreground intensities

Background intensities



Some observations:

• Higher intensities show tighter correlation.

• Background estimates: low correlation implies very little useful information.

• SA local median has smallest variability followed by S.morph and GP. QA highly variable. Very high QA values come from concentric circle method and probably mean other spots are being included.

• S.morph lowers background.

Comparing image analysis methodsFo

regr

ound

– B

ackg

roun

d

Background

(A)

Fore

grou

nd –

Bac

kgro

und

Background

(B)

(A) Morphological background adjustment method of Spot (S.morph)

(B) QA.fix (Often gives increaseing BG estimates)

Only values from the lower half of the foreground intensity distribution are displayed.


Data

8 AI knockouts (Cy5)

1 Reference (Cy3)

8 Normals (Cy5)

1 Reference (Cy3)

Ref: pooled cDNA from 8 normals

6384 probes

257 (~4%) known to be related to lipid metabolism

From Callow, et al. 2000, Dudat, et al. 2002

Comparison of t-denominators of different methods


Comparison of the t-denominators (estimating between slide variability) for different image analysis methods in the apo AI experiment.

Comparison of t-values of different methods


Gap between p-values for 8 known: S.nbg and S.valleyDE genes largest for SA, S.morph, and S.const


• Morphological opening provides lower estimates of background than other methods.

─ M.O. estimates are less variable than other approaches.

─ Accuracy (assessed by finding known DE genes) was not compromised.

• In terms of finding DE genes,

Spot ScanAlyze

GenePix QuantArray


• Choice of intensity estimation method has larger impact on log intensity ratios than segmentation method.

• Means or medians over large neighborhoods can be noisy.

• No background adjustment results in decreased ability to find DE genes.

• Recommend morphological opening method.*

* No comments on false positives or false negatives

Considerations following cDNA array intensity estimation

• Dye bias

• Print tip effects

• Spatial effects

• Array effects

Normalization attempts to minimize the effect of these systematic variations, making substantive differences easier to find.

Simple Problem

One array (A1) is brighter than a second array (A2) and you would like to compare the two.

1

2log GR

2

2log GR

1

2log GR

22log G

R

Simple solution

• Scale intensities to have the same mean or median.

• Problems with this?

• Assume “shift effect” is constant across array.

• Doesn’t account for spatial effects.

Yang, et al., 2002

cGR

GR

12

*

12 loglog

Methods?log2(R/G) log2(R/G) - c = log2{R / (kG)}

Standard Practice (in most software)

c is a constant such that normalized log-ratios have zero mean or median.

Our Preference:

c is a function of overall spot intensity and print-tip-group.

What genes to use?• All genes on the array

• Constantly expressed genes (house keeping)

• Controls

– Spiked controls (e.g. plant genes)

– Genomic DNA titration series

• Other set of genes

Within-slide normalization

http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html

M vs. A

GRA

GRM

2

2

log

log


• Assumption: Changes roughly symmetric

• First panel: smooth density of log2G and log2R.

• Second panel: M vs. A plot with median set to zero

Normalization - Median


Normalization – lowess


• Global lowess• Assumption: changes roughly symmetric at all intensities.

Build the smooth function S(x) pointwise:

1. Take a point, x0. Find K nearest neighbors of x0 (N(x0)). The number of neighbors K is determined by user—specifies some percentage of the total number of points (they use 40%).

2. Calculate

3. Assign weights to N(x0) points.

4. Calculate weighted least squares fit of y on N(x0). Take

5. Repeat . . .

xxxxNx

00

0

max

00ˆ xSy

Instead of

Use

Where c(·) is the smooth curve through the M-A plot (lowess fit to the M-A plot)

Recall:

Yang, et al.

cGR

GR 2

*

2 loglog

GRA

GRM

2

2

log

log

cGR

GR

2

*

2 loglog

Assumption: For every print group, changes roughly symmetric at all intensities.

Normalization – print-tip-group


M vs. A – after print-tip-group normalization


Within print-tip-group normalization is reasonable when:

1. Only a relatively small proportion of the genes will vary significantly in expression between the 2 MRNA samples

or

2. There is symmetry in the expression levels of the up/down regulated genes.

3. There is no correlation between groups of DE genes and print tips

• Consider location normalized intensities for print tip group i.

• Suppose

• Can get estimates of ai’s and adjust.

icGR

GR

2

*

2 loglog

group print tip ofeffect :

ratios-log trueof variance:

,,0~2

2

thi

i

ia

aNX

Assumptions:• All print-tip-groups have the same spread.

• True ratio is ij where i represents different print-tip-groups, j represents different spots.

• Observed is Mij, where Mij = ai ij and

• Robust estimate of ai is

where MADi = medianj { |yij - median(yij) | }

II

i i

i

MAD

MAD

1

Taking scale into account


0log1

2

I

iia

Within print-tip-group box plots for print-tip-group normalized M


Effect of location + scale normalization


Problem

• If differences in scale were due largely to DE genes, adjusting for scale might mask your ability to find those genes.

• Again, if few genes are unexpected to be DE, this might not be an issue.

Alternative method

• hidden

• where ci(·) is determined by both genes in the ith print-tip-group and other genes.

• “Composite” normalization uses MSP (titration series) genes.

• Could also use other housekeeping genes.

icGR

GR

2

*

2 loglog

Comparing different normalization methods


Summary

• Print tip normalization works well under two assumptions:

1. MSP genes have minimal sample specific bias and can cover wide intensity range. Composite normalization necessary with divergent samples.

2. Adjusting for scale might compromise ability to find DE genes. Could have opposite effect (false positives).

• Kerr, et al. Wolfinger, et al. perform only global normaliztion.

* maanova has extra normalization options

Recall Affy Measures of expression

• GeneChip® older software uses Avg.diff

with A a set of suitable pairs chosen by software.

• Log PMi / MMi was also used.

i

ii MMPMdiffAvg )(1

.

http://stat-www.berkeley.edu/users/terry/Classes/s246.2002/Week16/week16.ppt

Affy Measures of expression


GeneChip® newest version (MAS 5.0) uses something else, namely

with CT a version of MM that is never bigger than PM. Here TukeyBiweight can be regarded as a kind of robust/resistant mean.

)}{log(log ii CTPMghtTukeyBiweiSignal

Affy Measures of expression

Rules to determine CT (change threshold) for each probe pair:

1. MM < PM CT = MM

2. MM ≥ PM :

i

i

CTPMSignal

CTPMSignal

logTBlog

logTBexp

A. If MM < PM for most probe pairs, an adjusted MM value is used based on bi-weight mean of ratio

B. If MM ≥ PM for most probe pairs, MM is replaced with a value that is “slightly smaller” than PM

MMPM

Affy Measures of Expression

Determine weights:

• Calculate median of log(PM-CT) values across the probe set.

• Probe pair weights are determined by distance to median. Closer pairs get higher weights.

Affymetrix Normalization and Scaling (pre MAS 5.0)

Global Normalization: Baseline array is chosen out of a set of arrays. Average Intensity* for this array is calculated. Intensities on any other (non-baseline) array A1 in the set are multiplied by normalization factor (NF) to make Average Intensity* of A1 equal to Average Intensity* of baseline array.

Global Scaling: Target intensity is chosen and each array in a set of interest is scaled by some factor (SF1, SF2, ..., SFN) to give Average Intensity* equal to target intensity.

Average Intensity*: Average of Average Difference values for every probe set except highest and lowest 2%.


Li and Wong brought attention to the fact that AvDiff, as a measure of expression, has not been studied extensively. They proposed:

Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection

Cheng Li and Wing Hung Wong* PNAS, January 2001, 98:1, p. 31-36.

* Ph.D. student of Grace Wahba, University of Wisconsin-Madison, graduated in 1980

* Recipient of COPSS Prize

— Li and Wong, text, 2003

Model-based analysis of oligonucleotide arrays

Consider I array samples and one gene:

Goal: Estimate the abundance level of the gene in the I samples.

Data: There are 2×I×20 measurements used to obtain estimates (I×20 PMs and I×20 MMs).

θi : Denotes “expression index” for gene in the ith sample.

Assume: Measured intensity is proportional to θi and proportionality constant depends on probe (indexed by j).

What is PMij = βjθi ?

20 features

i = I

20 features

i = 1

Li and Wong, 2001


Li and Wong, 2001

For MM, denote the proportionality constant by αj

For PM, denote the proportionality constant by βj

νj: baseline response for jth probe pair due to nonspecific hybridization

αj: rate of increase of MM response for jth probe

ϕj : additional rate of increase in PM response

ijjijij

ijjijij

PM

MM

jjj PM intensity increases at a higher rate than MM intensity (β > α)


Li and Wong, 2001

Fig. 1. Black curves are the PM and MM data of gene A in the first six arrays. Light curves are the fitted values to model 1. Probe pairs are labeled 1 to 20 on the horizontal axis.


Li and Wong, 2001

Recall the model:

Currently, there is a “strong preference” to base all computations on y = PM – MM for each probe pair. Subtracting the deterministic portions of the equations above gives:

ijjijij

ijjijij

PM

MM

jjj

ijjiijijij MMPMy


Li and Wong, 2001

Consider

with

Assume this identifiability constraint:

Fix and fit for using least squares.

Fix at and fit for using least squares.

Iterate

ijjiijijij MMPMy

),0(~ 2 Nij

Jj

j 2

~

~ ~

~


Li and Wong, 2001

Fig 2. Black curves are the PM-MM difference data of gene A in the first six arrays. Light curves are the fitted values to model 2.


Li and Wong, 2001

Fig 3. Plots of residuals (y axis) versus fitted value (x axis) for additive model (A) and multiplicative model (B).

(A)

(B)ijjiijy

ijjiijy


Li and Wong, 2001

Consider model for one array:

Suppose ϕ’ s are obtained from many arrays. Treat them as known.

Given ϕ’s, the LS estimate for θ is

jjjy

J

yyj jj

j j

j jj

2ˆ

j

jjj

j Jy

J21

E1ˆE

J

yJ jj

j

22

2Var

1ˆVar


Li and Wong, 2001

• Regarding θ’s as fixed, one can proceed similarly to get estimation and standard errors for .

• Note: A conditional analysis is done here. This assumes certain effects are known.

• In practice, the effects are estimated. The uncertainty in this estimation is not considered when computing standard errors.

• What is ?

~

2

2

1

2ˆ1

11ˆSE

jjj yy

JJ


Li and Wong, 2001

Recall: θi Denotes “expression index” for gene in the ith sample.

Question: Given ´s and SE[ ]´s, how would you use them?

Recall, for one array:

After fitting the model you would have:

jjjy

I ˆ , . . . , ˆ ,ˆ21

I ˆSE , . . . , ˆSE , ˆSE 21


Li and Wong, 2001

Fig 4. (A) Six arrays of probe set 1,248. (B) Plot of standard error (SE, y axis) vs. θ. The probe pattern (black curve) of array 4 is inconsistent with other arrays, leading to unsatisfactory fitted curve (light) and large standard errors of θ4.

Black curves are PM-MM data. Light curves are fitted model.


Li and Wong, 2001

Recall: ϕj denotes theadditional rate of increase (in excess of the MM rate) in PM intensity for probe j.

Question: Given ´s and SE[ ]´s, how would you use them?

Recall, for one array:

After fitting the model you would have:

jjjy

1 2 20


Li and Wong, 2001

Fig 6. (A) Probe 17 of probe set 1,222 is not concordant with other probes (black arrows) and is numerically identified by the outstanding standard error ϕ17 (B) Plot of standard error (SE, y axis) vs. ϕ.



Li and Wong, 2001

Fig 7. (A) Probe set 3,562 has a single high-leverage probe 12, and the fitted light curves almost coincide with the black data curve. (B) ϕ12 is large compared with the other ϕ’ s close-to-zero value. Note that Affymetrix’s superscoring method works here by consistently excluding this probe.



Li and Wong, 2001

Li and Wong note that

“the MM responses do contain information on the expression index, and that this information can only be recovered by analyzing the PM and MM responses separately.”

Processing Probe Level Data

• A number of expression summary measures are obtained using PM and MM probes intensities.

• Recent results suggest that MM may be detecting signal along with PM.

• If this is the case, using MM could introduce noise and give biased estimates of the nominal expression level.

PM and MM values for 20 probes from 12 spike-in arrays from varying concentration experiment plotted vs. concentration

MM May Be Tracking Signal

• Some researchers suggested to use other MM sequences in order to alleviate this tracking.

• MM could be created by changing more than one base in PM sequence and by placing MM bases in different positions in the MM sequence (Nimblegen chips).

MM May Be Tracking Signal - What to do ?

• Other researchers suggest only using PM (Robust Other researchers suggest only using PM (Robust Multiarray Average) Multiarray Average)

• This approach would allow space currently used This approach would allow space currently used for MM to be used for other PM, thus allowing for for MM to be used for other PM, thus allowing for twice as many sequences of interest to be printed twice as many sequences of interest to be printed onto an array.onto an array.

MM May Be Tracking Signal - What to do ?

• Bolstad, et al., Bioinformatics, 2003

• Irizarry, et al., Biostatistics, 2003

• Irizarray, et al., text, 2003

• Irizarray, et al., NAR, 2003

Robust Multi-array Average (RMA)


OVERVIEW

Uses only PM (ignores MM)

• Adjust for background on the raw intensity scale

• Take log2 of background adjusted PM

• Carry out quantile normalization of log2(PM-BG), with chips in suitable sets

• Conduct a robust multi-array analysis (RMA) of the quantities

RMA: Measures of Expression


RMA

Background correct, normalize, and log2 the PM intensities. Call this transformation T.

ei = log2 expression on ith array

aj = log2 probe effect for probe j

ijjiij aePM T

2,0~ Nij


Recall dChip:

RMA

NAR 2003:

“[Our model] is quite different from the additive model in PM-MM that was found unsatisfactory in Li and Wong, most likely because of the very strong mean variance dependence that would be present in such an additive model.”

ijjiij aePM T

ijjijijPM

ijjiijijij MMPMy

Why we take log2

http://biosun01.biostat.jhsph.edu/~ririzarr/Talks/nci-2002.ppt.gz

1.25 2.5 5 7.5 10 20 g

LIVER

CNS

12,626 genes

Dilution Study (www.genelogic.com)

5 reps

5 reps

30 arrays

30 arrays

Comparing RMA and MAS 5.0

• Precision of expression estimates (estimated by SD of replicate arrays)

• Consistency of fold change estimates

• Specificity and sensitivity (different methods used to assess DE genes)

• Normalization is done within replicate groups. The assumption that most genes do not change across non-replicate groups does not hold here.

(note that two different normalization methods were used: quantile for RMA and affy.scale.value for MAS)

• Expression measures for RMA and MAS Signal 5.0 were estimated using rma and expresso functions of Bioconductor package Affy

Comparing RMA and MAS 5.0 – Normalization

MAS 5.0

• Squared correlation coefficient across replicates was calculated over all 120 pairs of replicates ( per group of replicates)

RMA MAS 5.0 SignalR2 0.9947 0.9917

Strong probe affinity implies R2 ≈ 1.

• The difference was significant (p-value 1.152560e-07)

52

Methods and Results

• SD across replicate arrays were computed for all genes.

• LOESS curves were fitted to scatter plot of SD versus mean expressed values.

Methods and Results

Loess curve of SD across replicates for all genes RMA measures

Expression

SD

acr

oss

repl

icat

es MAS 5.0

RMA

• For one gene, fit line to expression estimate vs. concentration on the log-log scale. Then calculate the “Average lines” (average 's across genes).

• Since every fold increase in concentration should have the same fold increase in expression measure, a line fitted on log-log scale should have slope 1.

Consistency of fold change

65.0ˆ 5.0 MAS

67.0ˆ RMA

• Consistency of fold change was examined by comparing fold change estimates between arrays with different concentrations of target mRNA.

• Slopes over all genes for two different conditions of average expression versus concentration were calculated and on average were:

RMA MAS Signal 5.0liver tissue 0.53 0.53CNS samples 0.56 0.59


• Fold change between CNS and Liver tissue were

estimated for all genes using 10 arrays in the lowest and 10 arrays in highest concentration group.

• Number of genes showing inconsistency of fold change estimate by at least 2-fold;

RMA 23

MAS 5.0 81


Irizarry, et al., 2003

RMA fold change estimate for 20 vs. fold change estimate for 1.25 g

MAS 5.0- fold change estimate for 20 vs. fold change estimate for 1.25 g


• In general it appears that RMA has better

precision and similar accuracy as MAS Signal 5.0.

• RMA had slightly better consistency of fold change estimate.

Conclusions

• 11 control cRNA’s spiked in at different

concentrations on each array. Other genes should be same across arrays.

• Choose 10 pairs of arrays from spike-in experiment.• Compute FC for each gene under RMA, dChip, MAS

5.0.• For some cut-off C, compute proportion of non-spiked

genes where FC > C, (false positives) and proportion of spiked genes where FC > C (true positives).

Specificity and sensitivity – Spike-in data


Fold change for Affymetrix Spike-in experiment


Test Statistic for Affymetrix Spike-in experiment

http://www.bioconductor.org/workshops/JAX02/jax-B.pdf

• Overall RMA does better than Li & Wong (dChip),

which in turn does better than MAS 5.0 using FC.

• The simple t = est log FC / SE(est log FC) seems best for use with MAS and RMA.

• MAS looks bad here because we use single chip summaries in our analysis. They need a multi-chip version of their Signal Log Ratio. When done, it will look like the final step in RMA.

• With RMA and Li & Wong, nominal SEs are not as good as observed ones and p-values are better than (log) fold change.

Conclusions from replicate chip ROC curves

Figure 5. Box plots showing the distribution of observed fold changes for non-spiked in genes. The different colors represent the different quantiles. The relationship of color and quantile is demonstrated in the first box from the left.

Log Fold Change of Non-Differentially-Expressed Genes


Conclusions from single chip comparison ROC curves

http://www.bioconductor.org/workshops/JAX02/jax-B.pdf

• On the basis of the data just presented, and much more:

• With FC, RMA is best, LW (Li & Wong) next. MAS does not do well here.

• With p-values, RMA is a good as, and usually better than MAS, which is next. MAS does best on Affymetrix spike-in data sets. LW (dChip) does not do so well here.

• All judgments are comparative. Everyone does well in absolute terms, but some do better.

• In general it appears that RMA has better In general it appears that RMA has better

precision and similar accuracy as MAS Signal 5.0precision and similar accuracy as MAS Signal 5.0

• RMA had slightly better consistency of fold RMA had slightly better consistency of fold change estimatechange estimate

More Conclusions

Comment on MM


NAR, 2003:

“It is possible that information about non-specific binding is contained in the MM values, but empirical results demonstrate that mathematical subtraction does not translate to biological subtraction. We have found that, until a better solution is proposed, simply ignoring these values is preferable.”

METHODS TO IDENTIFY DE GENES

Mult-t

Statistical Methods for Identifying Differentially ExpressedGenes in Replicated cDNA Microarray Experiments

byDudoit, Yang, Callow, and Speed

Statistica Sinica 12 (2002), 111-139.

**Additional Details in Parmigiani et al., 2003.

Mult-t : Outline

• Data: AI Knockout & SRBI transgenic mice. AI, SRBI are two genes invovled in HDL metabolism.

• Image: Segmentation and background correction (Yang et al.).

• Normalization: Spatial and intensity dependent effects.

• Gene summary: Construction of t-statistic for each gene. Evaluation of the statistic at a gene uses only data at that gene.

•Hypothesis test at each gene (accounts for multiple tests).

Mult-t : Normalization

Use lowess ( ) to identify curves through points grouped by print tips.

(log2 R/G)’ = (log2 R/G) - cj(A)

cj(A) is lowess ( ) fit to M vs. A for print tip j.

Mult-t : Gene Specific Summary

Compute Welch t-statistic for every gene

• Tj and tj: Random variable and realization of random variable for every gene j.

• Hj0: j th null is true.

• Hj1: j th null is false.

2

22

1

21

12

ns

ns

xxt

jj

jjj

Mult-t : Hypothesis tests

Given evaluated test statistics, which are unusually large in magnitude ?

Informal assessment:

QQ plots

MA plots

Other (numerator vs. denominator of t-stat)

More precise assessment:

For Hj, pj = P( | Tj | > | tj | | H j 0) and determine how small pj

should be so that you reject given many (m) tests are done.


1 2 . . . . . . n12...m

genes

samples

j = 1, 2,…, m genes (6384)

i = 1, 2, …, n samples (16); n1 + n2 = n (n1 = n2= 8)

Xji = log2 (Rji/Gji) is relative (transformed, normalized, and background corrected) expression level for jth gene on ith array.

Xji


H1: 11=12

H2: 21=22

:Hm: m1=m2

1 2 . . . . . . n12...m

genes

samples

t1

t2

. . .tm

H1

H2

. . .Hm

Determine distribution of test statistics under null

For n reasonably large,

T-stat ~ tv

Mult-t : Determine distribution of test statistics under null

Since n is generally not large in microarray experiments, build up distrubtion of test statistics under the null via permutation.

1 2 . . n1 n1+1 . . n2

12...m

samples

1 2 . . . . . B

permutations

t11

t21

. . .tm1

12...m

t1B

t2B

. . .tmB

Pj* = (1/B) ( | tj,b | > | tj | )

Mult-t : Notes on Permutations

Computationally, getting the distribution of test statistic via permutations is reasonable. Getting the distribution of the p-values might not be.

If you have, say, 6 samples total (3 in each group), what’s the smallest p-value you could obtain via permutations ?

Mult-t : Adjusting for multiple tests

• Family Wise Error Rate (FWER): probability of at least one type I error for all tests considered.

• Goal: Control FWER

Strong Control: Control for any combination of true and false nulls.

Weak Control: Control for the complete null (all nulls true).

Mult-t : Adjusting for multiple tests

• Procedures to control FWER:

Bonferonni:

Reject Hj if pj < m

pj* = min (pjm, 1)

Sidak:

pj* = 1 - (1-pj)m

Westfall & Young:

pj*’s obtained via reordering of permutation matrix.

Mult-t : Westfall and Young’s Procedure

Order observed t-statistics: | t rm | < | t rm-1 | < … < | t r2 | < | t r1 |

1 2 . . . . . B

t11

t21

. . .tm1

12...m

t1B

t2B

. . .tmB

reorder

u 1,1

. .um-1,1

u m ,1

u 1,B

. .um-1,B

u m,B

u m, b = | t rm, b |

u m-1, b = max (u m, b, | t rm-1, b |)

:

u 1, b = max (u 2, b, | t r1, b |)

Mult-t : Westfall and Young’s Step Down Max T Procedure

Order observed t-statistics: | t rm | < | t rm-1 | < … < | t r2 | < | t r1 |

1 2 . . . . . B

t11

t21

. . .tm1

12...m

t1B

t2B

. . .tmB

reorder

u 1,1

. .um-1,1

u m ,1

u 1,B

. .um-1,B

u m,B

Prj* = (1/B) ( | uj,b | > | trj | )

...and enforce monotonicity

Mult-t : Westfall and Young’s Step Down Max T Procedure

• Less conservativee than Bonferonni, Sidak, Holm’s

• Provides Strong Control of FWER

• Max T = Min P when the t-statistics are identically distributed. Generally, this is not the case; and, again, the minP algorithm is more computationally intensive.

Mult-t : Data

Experiment 1

8 AI Knock outs (Cy 5)

1 Reference (Cy 3)

8 Normals (Cy 5)

1 Reference (Cy3)

8 SRBI Transgenics (Cy 5)

1 Reference (Cy 3)

8 Normals (Cy 5)

1 Reference (Cy3)

Experiment 2

6382 probes

257 (~4%) related to lipid metabolism

Reference: Pooled cDNA from 8 normals

Q: What does this mean for permutation tests ?

Mult-t : Histogram and QQ plot of t-statistics

Mult-t : Max T adjusted and unadjusted p-values

Comments on Mult-t

• Welch’s t-statistic is used. Welch proposed solution to Behrens-Fisher problem. Implicit assumptions guide choice of the test statistic even though “no assumptions are made regarding distribution of the test statistics”.

• Permutations are advantageous for a number of reasons, but do not provide useful results when sample sizes are small.

• Permutation test not valid for this experimental design.

• Method is compared with single slide methods !

• Page 132, Newton et al. ... false positives... Consider definition of a false positive here !

Mult-t : Comparison of methods

SRBI data. Newton et al (orange); Chen et al. (purple)

Analysis of Variance for Microarrays (ANOVA)

Analysis of Variance for Gene Expression Microarray Data

byKerr, Martin, and Churchill

Journal of Computational Biology 7: 819-837, 2000.

Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments

byKerr and Churchill

PNAS 98 (16): 8961 - 8965, 2001.


ANOVA : Outline

• Data: Human liver and human muscle tissue hybridized to two cDNA arrays. Final data set had 1286 spots.

• Normalization via terms in ANOVA model (“global analysis”)

• Gene summary: Construction of statistic for each gene. Evaluation of the statistic uses data from that gene (“local analysis”).

• Hypothesis test at each gene (uses bootstrap; does not account for multiple tests).

ANOVA: Model Development

Liver

Liver

Muscle

Muscle

1

1 2

2

Array

cDNA

A; i indexes array ( i=1,2 )

D; j indexes dye ( j=1,2 )

V; k indexes variety ( k = 1,2 )

G; g indexes gene (g = 1, 2, ..., N = 1286)

ANOVA : Model

log(yijkg) = + Ai + Dj + Vk + Gg +(AG)ig +(VG)kg+eijkg

m - overall average signal (*)

A - array (*)

D - dye (*)

V - variety (i.e., condition or tissue)

G - gene

AG - array by gene interaction (spot effect)

VG - variety by gene interaction (DE if VG1gVG2g)

* Normalization

ANOVA : Gene Specific Summary

Obtain parameter estimates via least squares

is of most interest when goal is to identifyDE genes.

Source df SS MS

Array 1 92.34 92.34

Dye 1 0.74 0.74

Variety 1 2.97 2.97

Gene 1285 1885.89 1.47

AG 1285 160.01 0.12

VG 1285 1357.28 1.06

Residual 1285 82.75 0.0644

Corrected Total

5143 3581.99

(Table 3, page 23)

21

VGVG

ANOVA : Hypothesis tests

Given evaluated test statistics, which are unusually large in magnitude ?

Informal assessment:

Plots of

More precise assessment:

Bootstrap to obtain confidence intervals for VG1-VG2.

21

VGVG

ANOVA : Bootstrap to Identify DE genes

Calculate Residuals

; distribution of residuals => f

Sample from C f to get b*

scale factor ensures that empirical distribution has variance equal to true residuals (Wu, 1986, Annals of Statistics).

Simulate Data

Fit the model to simulated data and calculate

**

21 ggVGVG

)log()log( ijkgijkg yy

** )log()log( bijkgijkg yy

ANOVA : Comments on ANOVA Approach

No adjustments for multiple tests !

The authors state “this may or may not be necessary based on the intended purpose of the analysis” (page 8).

ANOVA modelling framework provides a method of normalization* by accounting for array, dye, gene, ... effects.

Residual distribution on log scale is non-normal, but constant error variance assumption is not grossly violated.

Significance Analysis of Microarrays (SAM)

Significance Analysis of MicroarraysApplied to the Ionizing Radiation Response

byTusher, Tibshirani, and Chu

PNAS 98 (9): 5116-5121, 2001.


SAM : Outline

• Data: 2 wild type human lymphoblastoid cell lines (1,2) harvested in unirradiated or irradiated (U,I) state 4 hours after treatment. RNA samples were labelled and divided into two identical aliquots (A,B) prior to hybridization onto Affy chips. (U1A, U1B, U2A, U2B, I1A, I1B, I2A, I2B).

• Normalization via reference set obtained from average of intensity values across subsets of arrays.

• Gene summary: Construction of statistic for each gene. Evaluation of the statistic uses data from the entire array.

• Hypothesis test at each gene (accounts for multiple tests).

SAM : Normalization

• Generate reference set by averaging each gene expression level across the 8 hybridizations.

• Cube root scatter plot intensity values from each data set against reference (this handles negatives and Tusher et al. report that it resolved vast majority of lowly expressed genes).

• A linear least squares fit to the cube root scatter plot is used to calibrate each hybridization.

SAM : Post Normalization

SAM : Gene Specific Summary

The relative difference measure d j for gene j:

To ensure that the variance of d j is independent of gene expression, s0 (a small positive constant) is added to the denominator.

PNAS manuscript: The coefficient of variation of d j was computed as a function of s j in moving windows across the data and s0 was chosen to minimize the CV.

Parmigiani et al. text: “adaptively chosen”. Taken as median of all s (i).

0

21

ss

xxd

j

jjj

SAM Procedure to Identify DE Genes

1 2 3 4 5 6 7 8

12...m

samples

1 2 . . . . . B=36

permutations

d11

d21

. . .dm1

12...m

d1B

d2B

. . .dmB

To minimize potentially confounding effects between the two cell lines, they analyzed data by using 36 balanced permutations. A permutation is considered balanced for cell lines 1 and 2 if each group of 4 experiments contained two experiments from cell line 1 and 2 from cell line 2.


1 2 . . . . . B=36

permutations

d11

d21

. . .dm1

12...m

d1B

d2B

. . .dmB

order columns

1 2 . . . . . B=36

permutations

d(1)1

d(2)1

. . .d(m)1

12...m

d(1)B

d(2)B

. . .d(m)B

dE,j = (1/36) d ( j )b


Plot observed, ordered, d ( j ) against d E,j

d E,j

d ( j )

2

u

l

* u need not equal | l |

DE genes

DE genes

SAM: Estimate the False Discovery Rate (FDR)

Example from Tusher et al. 2001. 46 genes identified as DE using =1.2. For permutation 1, figure out how many genes you would have rejected using this and assuming dj1 ( j = 1,2,...,m) is data.

Repeat for every permutation and calculate average number of false positives. Average = 8.4 => FDR = 8.4 / 46 = 0.183 (18.3%).

d E,j

d ( j )1

u

l

5 FD’s for this set.

SAM: Defining s0 and :

s0 is chosen to make CV of dj approximately constant as a function of sj.

This dampens large values of dj that arise from genes with very small sj.

Generally, a constant CV (or approximately constant CV) is assumed

in models of microarray data.

To determine , fix the type I error rate . Calculate

hatFDR1 , hatFDR2 , ... , hatFDRn, for n values of . Take smallest *

such that hatFDR* < .

There are other suggestions for calculating (~281 of

Parmigiani text 2003) that involve controlling the

pFDR. Control of pFDR is becoming more common.

SAM: Comments by Tusher et al. 2001

• Dudoit et al. 2002 method (using step down max T) is too conservative. It found zero genes for this data set !

• 8 arrays are not enough for p-values based on permutations such as those done in Dudoit et al. 2002.

• SAM does not have strong or weak control of FDR.

• SAM estimates FDR. The estimate can be > 1 .

SAM: Comments on Tusher et al. 2001

• They stress the well known problems that arise from using fold

change (nice reference to cite).

• The optimal way in which to determine s0 and are open

problems. They have been addressed. See Parmigiani text, 2003.

• Application of SAM methodology to more than two conditions

has not been evaluated. Utility will rely on construction of a good

statistic (that can be hard).

• Intuitive approach. Implemented in Excel and R.

• SAM determines and calculates FDR using the same data. This

could introduce a bias. See page ~282 of Parmigiani text, 2003.

False Discovery Rate

Do not reject H0

Reject H0

H0 true

H0 false

U V

T S

m0

m-m0

mm-R R

FDR: E(Q) where Q = V/R (R > 0) and 0 (R = 0)

E(Q) = E (Q | R > 1) Pr (R > 1)

Benjamini - Hochberg Procedure to Control the FDR

• Let P1,…,Pm denote the p-values from m tests.

• Order the p-values: P(1) P(2) P(m).

• Let k* = max{ k: P(k) (k)/m}

• Reject all the null hypotheses for which Pi P(k).

• This ensures FDR (m0/m)

• Result does not depend on m0 (the number of true nulls) or the distribution of p-values under H1.

Benjamini - Hochberg Procedure to Control the FDR

slope

| | | . . . . . . . . |

1/m 20/m 40/m 1

Ord

ered

p-v

alue

s

k*/m k* = max {k : p(k) < (k)/m}

1<k<m

Empirical Bayes for Microarrays (EBarrays)

Journal of Computational Biology 8: 37-52, 2001.

On Differential Variability of Expression Ratios:Improving Statistical Inference

About Gene Expression Changes from Microarray Databy

M.A. Newton, C.M. Kendziorski, C.S. Richmond, F.R. Blattner, and K.W. Tsui

On Parametric Empirical Bayes Methods for Comparing Multiple GroupsUsing Replicated Gene Expression Profiles

byC.M. Kendziorski, M.A. Newton, H. Lan and M.N. Gould

Statistics in Medicine, to appear, 2003.


EBarrays: Outline

•Data: E.coli under 3 treatments, 1 control; 4 cDNA arrays, ~4200 spots. Rat mammary glands from parentals and congenics; 24 Affymetrix chips, ~26,000 intensities.

•Model Development: Hierarchical Mixture Model accounts for known sources of variability.

• Normalization: EBarrays assumes data has been normalized for effects within and between arrays.

• Gene summary: Posterior probability of DE for each gene. Evaluation uses data across the entire array.

• Hypothesis test at each gene (“naturally” accounts for multiple tests).

EBarrays: Data

E.coli K-12 cell lines: 4 samples labelled in red (control, IPTG-a, IPTG-b, HS) and 4 in green (all control).

EBarrays: Data

10 Affy chips from non-treated; 14 from treated (DMBA)

EBarrays: Model Development

),(~ ,ixi aGx

Measurement Error Actual Expression

),(~ ,iyi aGy

),(~, 0,, aIGiyix

zi 1 if X ,i Y ,i

0 if X ,i Y ,i

)(B~ pZ

EBarrays: Model Development

),(~ ,ixi aGx

Measurement Error Actual Expression

),(~ ,iyi aGy

),(~, 0,, aIGiyix

EBarrays: Model Fit

)|()|()|,( iiiiA ypxpyxp

dypxpyxp iiii )|()|()|,(0

0

k

kkkkkkkAkc pzpzyxpzyxpzpl )1(ln)1()(ln),(ln)1(),(ln, 0

E-step: 0)1(

,,1ˆpppp

pppyxzPz

A

Akkkk

M-step: Maximizing resulting form in . p,

ixixixii dpxpxp ,

0

,, )()|()|(

and iyiyiyii dpypyp ,

0

,, )()|()|(

EBarrays: Model Diagnostics (Marginal Densities)

EBarrays: Gene Specific Summary

DzP

DzP

i

i

0

1 odds

dpDpPyxpzPDzP iiii 1

0

,,11

p

p

yxp

yxp

ii

iiA

ˆ1

ˆ

),(

),( odds

0

EBarrays: Contour Plots of Odds

EBarrays: Model Diagnostics (Gamma QQ plots on 4 group comparison - DMBA treated)

EBarrays: Model Diagnostics (CV plots on 4 group comparison - DMBA treated)

EBarrays: Results on 4 group comparison (DMBA treated)

Gene ID COP CI CII WF P0 P1 P2 P3

J00801 3066 4777 995 9083 0.05 0.95 0 0

0.04 0.96 0 0

L08100 4368 1278 14162 0 1 0 0

0 1 0 0

J00772 392 122 679 0.04 0.96 0 0

0.97 0.02 0.00 0.01


EBarrays: Threshold

• The rule “classify into the pattern of expression with the highest posterior probability” is the rule which minimizes the posterior expected number of false positives and negatives (under 0-1 loss).

• For two conditions, this is the same as “classify into the pattern of expression with posterior probability > 0.5 ”.

• EBarrays reports posterior probabilities; user can decide on threshold.

EBarrays: Threshold (Meng Chen)

• 500,1000 and 2000 genes500,1000 and 2000 genes

• P(DE)=0.05, 0.1, …, 0.5P(DE)=0.05, 0.1, …, 0.5

• 2 conditions with 20 samples 2 conditions with 20 samples each.each.

• EBarrays, non-informative EBarrays, non-informative prior, 5 iterations, 0-1 loss.prior, 5 iterations, 0-1 loss.

• Each point is the average of Each point is the average of pFDR for 10 runs.pFDR for 10 runs.

• The plane is the Linear Model The plane is the Linear Model fit.fit.

• pFDR increases when P(DE) pFDR increases when P(DE) becomes larger, but is still becomes larger, but is still relatively low ! relatively low !

EBarrays: Some Ideas on pFDR control (Meng Chen)

• In Bayesian framework, one chooses a loss function to specify the relative cost of a false positive to a false negative; then, a rule is derived to minimize the Bayes risk.

• As stated previously, EBarrays reports posterior probabilities; user decides on threshold (under 0-1 loss, rule is to take the pattern with the highest posterior probability).

•The posterior expected false discovery rate can be controlled by adjusting the threshold.

• For example, one can decide beforehand at what level to control the pFDR and then use the rule that controls it at that level.

•Which one makes more sense?

EBarrays: Which one makes more sense ? (Meng Chen)

is the deciding point in EBarrays. Reject null if Pr(DE|data)>.

• 1000 genes, 2 conditions with 20 samples each. P(DE) = 0.2

• Increasing appears to decrease FDR.

• Large corresponds to bigger penalty to false positives, which makes sense.

EBarrays: Which one makes more sense ? A Simulation Study (Meng Chen)

• Simulations were carried out to compare the risk resulting from EBarrays and BH.

• 1000 genes, 2 conditions of 10 samples each.

• ~ N(,1). = 2,3,4,5.

• The risk is a function of the true proportion of DE genes.

• Didn’t replicate the runs.

EBarrays: Which one makes more sense ? Results (Meng Chen)

• Hierarchical model was developed to identify significant differential expression.

• Model accounts for measurement error process and for natural fluctuations in absolute expression levels.

• Multiple conditions are handled in the same way as two conditions (no extra work required!).

• Threshold can be adjusted to target a specific pFDR.

• R-library available at www.biostat.wisc.edu/~kendzior/ (soon in Bioconductor).

• In addition to identifying DE genes, EBarrays provides improved (shrinkage) estimates of expression.

EBarrays: Comments on Empirical Bayes Approach

Identifying DE genes: General Approach

• Use data to evaluate a test statistic (determine a gene specific summary) for every gene. Could use only data at that gene (Dudoit et al.) Could use data from all genes (Newton et al., Storey et al.)

• Evaluate method (model) used to generate test statistics. Were assumptions reasonable ? Does model fit well ? Does it provide additional information ?

• Perform hypothesis test at each geneDetermine threshold.Perhaps adjust for multiple tests.

EBarrays: Shrinkage Estimates of Fold Change

The posterior distribution of true differential expression at a given spot:

)(2

)1(0

01

),,(aa

i

i

i

aaiiii x

yyxp

i

ii y

xˆ

Use marginal maximum likelihood to determine . ),,( 0 aa

EBarrays: Shrinkage Plots

EBarrays: Shrinkage Estimates Provide Error Reduction

EBarrays: Shrinkage Estimates of Expression Re-rank Genes

CLASSIFICATION AND CLUSTERING

Problems to be Addressed via Classification or Clustering Methods

Unsupervised Learning: Identification of new groups of profiles or genes

Hierarchical clustering analysis SVD, PCA, K-means, Model Based Approaches,

SOM * (Golub et al. implementation uses both

unsupervised and supervised methods).

Supervised Learning: Classification into known classes (usually done on profiles)

Discriminant Analysis methods

Variable Selection: Identification of predictors (usually genes) that characterize known profile classes

Why cluster microarray data ??

11.03 1.22 0.92 2.61 -0.29 1.31 1.15 0.54 1.98 10.700.00 10.61 2.40 2.16 0.60 -0.22 1.64 0.89 10.64 0.210.12 0.46 10.30 0.14 1.56 2.29 1.30 11.18 2.06 0.14-0.92 0.97 -0.11 10.78 2.57 2.26 6.64 1.39 1.22 2.130.59 0.29 0.30 2.14 9.83 10.42 -0.50 1.66 0.29 0.462.14 2.19 2.19 0.01 9.93 9.84 -0.57 -0.52 1.36 0.48-0.77 0.65 1.51 10.39 0.90 0.92 10.26 0.69 2.13 -0.100.35 0.07 7.66 0.18 0.70 0.51 2.24 9.99 2.26 -0.151.67 11.30 1.48 1.89 1.29 0.72 0.39 0.94 8.41 1.2011.26 0.43 2.12 1.10 1.40 1.48 2.04 0.96 0.93 8.55


11.03 1.22 0.92 2.61 -0.29 1.31 1.15 0.54 1.98 10.700.00 10.61 2.40 2.16 0.60 -0.22 1.64 0.89 10.64 0.210.12 0.46 10.30 0.14 1.56 2.29 1.30 11.18 2.06 0.14-0.92 0.97 -0.11 10.78 2.57 2.26 6.64 1.39 1.22 2.130.59 0.29 0.30 2.14 9.83 10.42 -0.50 1.66 0.29 0.462.14 2.19 2.19 0.01 9.93 9.84 -0.57 -0.52 1.36 0.48-0.77 0.65 1.51 10.39 0.90 0.92 10.26 0.69 2.13 -0.100.35 0.07 7.66 0.18 0.70 0.51 2.24 9.99 2.26 -0.151.67 11.30 1.48 1.89 1.29 0.72 0.39 0.94 8.41 1.2011.26 0.43 2.12 1.10 1.40 1.48 2.04 0.96 0.93 8.55

Simple Answer: To recognize patterns that aren’t easy to see.


Current methods for classifying human tumors rely on a variety of morphological, clinical, and molecular variables.

There are still uncertainties in diagnosis.

Existing tumor classes are most likely heterogeneous.

Microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression profiles on a genomic scale.

This may lead to more reliable classification of tumors !!

Nice motivation provided by Dudoit et al., 2002, JASA (cancer is used to illustrate)

Clustering Algorithms (Unsupervised)

“Model Free” algorithms (aka. combinatorial algorithms) directly assign observation to a group or model withoutconsideration of underlying probability model.

* Popular

* Intuitive

* (Fairly) Easy to Implement

“Model Based” algorithms: many assume data are i.i.d. from some population with pdf f where f is a mixture of component density functions. Each component describes one of the clusters. The model can be fit using ML or Bayesian methods.

Model Free Algorithms

The most popular clustering algorithms directly assign each observation to a group or cluster without regard to a probability model describing the data.

Each observation is labelled i {1,2,...,N}

Each cluster is labelled k {1,2,...,K} (K < N)

Each observation is assigned to one (and only one) cluster.

C: i -> k ( C (i) = k )

Consider the distance d(xi, xi’) between every pair of observations.

Find C* that achieves some goal (i.e., minimize some summary function of the d’s).

K - Means

Goal: Minimize W(C)Algorithm:

1. For a given C, calculate the means of each cluster {m1, m2, ..., mk}

2. Assign each data value to cluster with closest cluster mean

3. Repeat 1 & 2 until convergence.

p

jiijiijii xxxxxxd

1

2'

2'',

K

kii

kiC kiC

xxCW1

2'

)( )'(

2

1minarg ki

KkmxiC

Comments on K - Means

1) Uses quantitative data values

2) No ordering of objects within a cluster.

3) Number of clusters, K, must be chosen in advance

4) As K changes, cluster membership can change in arbitrary ways. (i.e., clusters need not be nested).

5) The algorithm on the previous page guarantees convergence, but convergence may be to a local min.

Comments on K - Means

(4). To choose K, oftentimes different data values are considered

K

Dis

tanc

e

(5). To identify if min is local, should start from many different configurations (never guaranteed).

Final Note on K - Means

K-means is similar to K-medoids

In K-medoids, instead of finding mean values of clusters, “centers” of clusters are found.

“Centers” are points that minimize the total distance to other points in that cluster.

Hierarchical Clustering

Specify measure of distance (dissimilarity) between pairs of observations (ie, construct a distance matrix).

Produce hierarchical representations in which clusters at each level are created by merging clusters at the next lower level.

Lowest Level: Each cluster is a single observation.

Highest Level: One cluster contains all data.

Hierarchical Clustering (continued)

Bottom up: Start at lowest level (single observations) and merge selected pair (most similar) into cluster. Using definition of distance between observations and clusters, continue...

Top down: Recursively split data.

Each level of the hierarchy represents particular grouping of the data into disjoint clusters of observations

Generally, significance measures on clusters are not considered. Exception: Fraley & Raftery, UW-TR, 1998.

Bottom Up Clustering: 3 common approaches

Single Linkage: Distance taken to be minimum distance among all pairwise distances

Complete Linkage: Distance taken to be maximum distance among all pairwise distances.

Average Linkage: Distance measure is averaged across all pairwise distances.

',

'

min ii

HiGi

SL dd

',

'

max ii

HiGi

CL dd

','

1, ii

Gi HiHGAL d

NNHGd

Bottom Up Clustering: Comments

If data exhibits strong clustering features (quantified by measure d) and each of the clusters is well separated from the others, then the 3 methods will produce similar results.

With SingleLinkage, there is a tendency to combine observations linked by a series of close intermediate observations (“chaining”). The clusters might not be compact.

With Complete Linkage, compact clusters are obtained; but the distance between clusters might be small.

Hierarchical Clustering: Eisen et al., PNAS, 1998

jiFor genes i and j,

ji x

jOSkjN

k x

iOSkiji

xxxx

Nxxd ,,

1

,,1,

N

i

OSiG N

GG

1

2

GOS = 0; Note that when GOS is mean of observations on G, d is the correlation coefficient.

Hierarchical Clustering: Eisen et al.

Comments on Hierarchical Clustering

Intuitive

Biological Information can (but often is not) incorporated into measures of distance

Nice as a descriptive or diagnostic tool

Cluster order is arbitrary

Where one “cuts” the tree is arbitrary

No confidence measures are applied to clusters.

CAUTION!

Model Based Approaches for Unsupervised Clustering

Model Assumptions are made

Confidence or Probability of genes in particular groups can be assessed.

*** Many methods to identify DE genes can be thought of as model based clustering approaches where clustering is done using gene specific summaries.

EBarrays: Data

10 Affy chips from non-treated; 14 from treated (DMBA)

EBarrays: Identification of a few interesting genes

Gene ID COP CI CII WF P0 P1 P2 P3

J00801 3066 4777 995 9083 0.05 0.95 0 0

0.04 0.96 0 0

L08100 4368 1278 14162 0 1 0 0

0 1 0 0

J00772 392 122 679 0.04 0.96 0 0

0.97 0.02 0.00 0.01

POE: Probability of Expression

Journal of Royal Statistical Society 64: 717-736 (with discussion), 2002.


A Statistical Framework for expression based molecular classification in cancer

by G. Parmigiani, E.S. Garrett, R. Anbazhagan, E. Gabrielson

POE: Probability of Expression

POE:

Model gene expression using latent categories (on, off , baseline)

Use model to

1. Remove noise prior to clustering

2. Defining molecular subclasses

3. Determine probability that particular gene is in a class.

POE:

Model gene expression using latent categories (on, off , baseline)

Use model to

1. Remove noise prior to clustering

2. Define molecular subclasses

3. Determine probability that particular gene is in a class

over, under, neutral

POE:

For each gene and tumor, calculate the probability that the gene is expressed at baseline, over-expressed, or under-expressed in that tumor.

Identify clusters of genes based on this probability. Identify representative (“seed”) genes within each cluster.

Identify patterns (“profiles”) of expression across seed genes. For each tumor, calculate the posterior probability of each expression pattern (“profile”).

THIS CLASSIFIES TUMORS !

t 1 t 2 ………… t N

g 1

g 2

.

.

g m

m genes

N tumors

POE: Basic Idea Behind Model

A given gene g can be over-expressed, under-expressed, or neutral.

Suppose there are K tumor classes

If gene g* is related to tumor class, then the distribution of expression values of g* will be different in at least one of the k classes.

Currently assumes classes are not known (unsupervised), but they are working on extending this.

POE

• Notation:

• Modeling observed gene expression, agt:

• For gene g, the proportions of differentially expressed tumors in the population of unclassified tumors are

e g t

e g t

e g t

g t

g t

g t

1

0

1

g en e h as ab n o rm a lly lo w ex p ress io n in tu m o r

g e n e h as n o rm a l ex p re ss io n in tu m o r

g e n e h as ab n o rm a lly h ig h ex p re ss io n in tu m o r

a e e f eg t g t e g| ( ) ~ ( ) { , , }, 1 0 1

g g t g g tP e P e ( ) ( )1 1

Garrett JSM 2002

POE: Quantities of Interest

p P e a f f

f a

f a f a

g t g t g t g g g g

g g g t

g g g t g g g g t

( | , , , , )

( )

( ) ( ) ( )

, ,

,

, ,

1

1

1 0

1

1 0

p P e a f f

f a

f a f a

g t g t g t g g g g

g g g t

g g g t g g g g t

( | , , , , )

( )

( ) ( ) ( )

, ,

,

, ,

1

1

1 0

1

1 0

Interpretation: The probability that gene g in tumor t is over expressed given observed expression and the model parameters

Interpretation: The probability that gene g in tumor t is under expressed given observed expression and the model parameters

Garrett JSM 2002

POE: Distributional Assumptions

f U

f N

f U

g g t g t g

g t g g

g t g t g g

1

0

1

,

,

,

( ) ( , )

( ) ( , )

( ) ( , )

Empirical Bayes Approach: Could put priors on unknowns and integrate to give predictive distributions. Then maximize the marginal likelihood to identify unknowns.


t : Sample expression for normal expression levels.

g: gene effect in gene g for normal expression.

(g+/g) > r where r is approximately 5.

f -1,g

f 1,g

f 0,g

t + g


After fitting, parameters can be used to “de noise” data or tocluster genes.

p gt +, p gt -, and p gt 0 are the most important quantities.

g

g

g

g

g

g

N

G

E

E

N

N

| , ~ ( , )

| , ~ ( , )

| ~ ( )

| ~ ( )

( ) | , ~ ( , )

( ) | , ~ ( , )

2

lo g it

lo g it

POE: “De-noised” measures of expression

a g t ~ N( gt, g)

Eappa gtgtgtgtgtgtgt (|,)()()

Normal class: g t = t + g with g unknown

Elevated class: g t - t - g ~ U ( 0, k g+)

Low expression class: g t - t - g ~ U ( k g-, 0)

Posterior means of gt can be used as estimates of expression values.

POE: Cluster genes

1. Choose DE pattern of interest. Indicate what proportion of genes are over-expressed and under-expressed.

2. For each gene, using p gt+ and p gt

-, calculate the probability that the samples have a pattern of the type specified. Sort by this probability.

3. Calculate a J x J matrix of “gene agreement”:

)0()1()1(1 )1()()()|,...,( gtgtgt eI

gtgteI

gtt

eIgtgTg ppppeeP

0

1

0 11 mi

I

igimigimigigm ppppppr

POE: Cluster genes

4. Identify genes with “high coherence”

5. Out of this group, pick gene with the highest probability calculated from step 2 (seed gene).

6. Group genes if they are similar to the seed gene.

7. Remove this group and repeat.

Should repeat process using different initial patterns.

Results in some number of seed genes (for example, say 4).

POE: Comments on seed genes (pg. 379 of Parmigiani et al. text)

Any number of seed genes can be used to create a collection of profiles. s genes gives 3s possible profiles.

“ Profiles based on 4 or more genes are seldom required with the sample sizes and signal-to-noise ratio achievable.”

POE: Creating molecular profiles

For each tumor t, calculate the posterior probabilities of each expression profile using p sg1, t +, p sg1, t -, p sg1, t 0, ...

This classifies tumors !

Each of the 4 seed genes could be under-expressed (-1), over-expressed (1), or neither (0). This gives 34 = 81 patterns of expression (or profiles).

P1 P2 P3 ...... P81

sg1 0 0 -1 1

sg2 0 1 0 1

sg3 0 0 -1 1

sg4 0 0 1 1

POE: Quotes from Garrett & Parmigiani in Parmigiani et al. text 2003

“ The benefit of [the Bayesian hierarchical modeling] approach is that it borrows strength across genes using the entire genomic distribution instead of fitting a separate independent model for each gene.”

“Hierarchical Bayesian models have been shown to have appealing properties in estimation of large vectors of related quantities.”

Comments on POE

Unsupervised, model based clustering approach. The model accounts for measurement errors.

NOT intended for gene clustering.

Uses scale-independent measures of expression which allows combination of data across platforms

Defines a molecular profile based on a small number of genes. This could be useful clinically.

TIME SERIES ANALYSIS OF MICROARRAY DATA

Analysis of Microarray Time Series

Manuscript in Progress

by

M. Yuan and C.M. Kendziorski

Methods for Microarray Time Series Data

Every method that we know considers TS data in one condition. General goal is to cluster genes with similar expression patterns over time.

We consider TS data in multiple conditions. Group genes based on differential expression patterns over time.

Microarray Time Series: Example

• Two treatments

• 6 time points: 0, 2, 6, 24, 48, 120 Hours

• Number of Genes: 12625

Microarray Time Series: Apply EBarrays at each time point

• Differentially expressed genes identified by Ebarrays

• From 24 hrs to 48 hrs, from 48 hrs to 120hrs

0 Hr 2 Hrs 6 Hrs 24 Hrs 48 Hrs 120 Hrs

0 9 0 28 170 333

Pr(DE|DE) Pr(DE)

48 Hrs 6/28 170/12625

120 Hrs 36/170 333/12625

Microarray Time Series: Correlation

• What if there is no correlation?

Pr(DE|DE)=Pr(DE)

• Why should we care about correlation?

Pr(DE)f(x|DE)/Pr(EE)f(x|EE)If Pr(DE) is large, it is easier to claim DE

If Pr(DE) is small, it is harder to claim DE

Microarray Time Series: Correlation

• What if there is no correlation?

Pr(DE|DE) = Pr(DE)

• Why should we care about correlation?

Pr(DE)f(x|DE)/Pr(EE)f(x|EE)If Pr(DE) is large, it is easier to claim DE

If Pr(DE) is small, it is harder to claim DE

Microarray Time Series: How much information did we lose ?

• Given that a gene is DE at 24 Hours

We could claim DE at 48 Hours if f(x|DE)>3.67f(x|EE)

If we do not consider correlation, we claim DE at 48 Hours if f(x|DE)>74.26f(x|EE)

• Given that a gene is DE at 48 Hours

We could claim DE at 48 Hours if f(x|DE)>3.72f(x|EE)

If we do not consider correlation, we claim DE at 48 Hours if f(x|DE)>36.91f(x|EE)

Microarray Time Series: HMM Model Structure

• Pattern process S[t]: Pattern of expression at time t.

Treatment vs Control: DE or EE

Compare treatments: i.e. 5 patterns for 3 trts

Markov Process

• Expression vector x[tk]: K expressions observed at time t.

Distributed according to S[t]

Conditional independent given S[t]

Microarray Time Series: HMM Model Structure

…… ……

Gene 1

Gene 2

Gene n

Gene 1

Gene 2

Gene n

Patterns HMM Expression Vectors

S[1]

X[11],…,x[1K]

S[2]

X[21],…,x[2K]

S[3]

X[31],…,x[3K]

……

Microarray Time Series: Options to Specify HMM

• Marginal expression distribution using EBarrays

• Transit matrix Pr(S[t]|S[t-1]) free of time Homogeneous HMM

• Force Pr(DE|S[t-1]=DE)=Pr(DE|S[t-1]=EE) Independent analysis

• Homogeneous independent analysis Constraint that Pr(DE) is constant over time

Microarray Time Series: Estimation Using EM

• Infer unknown pattern process from observed expression data – Maximum a Posteriori (MAP): max Pr(S[t]=i|X)

• Unknowns:Parameters associated with f(x|S)Parameters associated with pattern process

• EM algorithmE Step: parameters pattern process (Baum-Welch)M Step: pattern process parameters (MLE)

Microarray Time Series: Cluster Genes Into Patterns

• Maximum a posteriori

max Pr(S[t], t=1,…,T|X)

• Viterbi Algorithm

Microarray Time Series: Simulation Study

• 6 time points

• 2 treatments

• 1500 genes

• Proportion of DE at the first time point is 0.1

• Pr(DE|EE)=0.1

• Pr(DE|DE)=0.1, 0.5, 0.7

Microarray Time Series: Simulation Results

P(DE|DE) Method Time 1 Time 2 Time 3 Time 4 Time 5 Time 6

0.1

EB 55/60 57/59 66/67 82/101 93/107 98/115

HMM 57/62 62/67 68/69 81/98 85/94 96/108

0.5

EB 95/106 92/102 123/142 126/136 139/151 137/145

HMM 97/109 116/129 139/155 145/155 152/173 146/160

0.7

EB 63/75 123/132 165/191 203/230 158/173 172/184

HMM 72/88 144/154 191/213 236/258 214/237 201/224

Microarray Time Series: EBarrays vs. HMM

• Differentially expressed genes identified by Ebarrays based HMM

• How many more genes identified:


0 9 0 138 333 475


0 0 0 110 163 142

Microarray Time Series: Could it be so good ?

• Simulate data for the last two time points with parameters estimated from the HMM

• Performance comparisonMethod 48 Hours 120 Hours

EB 259/269 374/390

HMM 302/321 469/498

Microarray Time Series: Comments

• Correlation over time does exist in most studies.

• Taking correlation over time into account can significantly improve the efficiency of method to identify DE genes.

• HMM provide a flexible way to model the correlation over time

• Ebarrays based HMM is a useful option to analyze microarray time series data

• Technical Report coming soon...

EXPERIMENTAL DESIGN

Experimental Design Questions and Overview

What to print/spot on the array ?

How many pieces of one gene ?

Replicates of a gene ?

Housekeeping or other control spots ?

How to arrange spots/genes on the array ?

Spatial Bias

Print Tip Bias

cDNA ?

Affy (done)cDNA (~done)

Affy (varies ~11)

Experimental Design Questions and Overview (continued)

What to hybridize onto the array ?

What is reference/control ?

How is labelling done ?

Should samples be pooled ?

cDNA ?

Affy (can be ?)

cDNA (Dye swaps ?)

Affy (done)

cDNA (?)

Affy (?)

Actual Design

The actual design considers questions previously stated, but also includes issues of replication.

How many arrays should one use ?

How should samples be allocated to arrays ?

Answers to these questions will depend on three sources of variation: Biological, Technical, and Measurement Error.

In addition, the goal of the experiment should affect its design !

Sources of Variation

Biological Variation: Subject to subject variation. Intrinsic to the organisms studied. This CAN NEVER be reduced, but its effect can perhaps be reduced (e.g. by pooling biological samples).

Technical Variation: Introduced during extraction, labelling, and hybridization. Quantified (estimated) by hybridizing multiple mRNA samples from the same individual to many arrays. Also called array to array variation (caution: so are other sources of variation).

Measurement Error: Introduced when reading signals. Measured within a single array. Multiple spots on one array can reduce the effect of measurement error.

Loop Design

Graphical representation of the loop design indicates

Which differences can be estimated

Precision of the estimates

For example,

A and B can only be compared if there is a path from A to B

Here, A and B can be compared directly or through C. The direct comparison (log(A/B)) is less variable than

log (A/B) = log (A/C) + log (C/B)

A

BC

Graphical Representation of Loop Designs

A

BC

Simplest loop design

A B

A B

A B

A B

A B

A B5

Comparison Among Sources of mRNA

Consider sources A, B, and C to be of interest

Dudoit and Speed method (mult t) 3 times (AB, AC, BC). This is not optimal, but will give adjusted p-values.

ANOVA (Kerr et al., Wolfinger et al.) Better to use all the data at once, but there is no accounting for multiple tests with these approaches. Rank ordering provided might be useful.

EBarrays (Newton et al., Kendziorski et al.) Can handle multiple conditions and accounts for multiple tests. Must specify patterns. Computational issues.

Evaluation of Designs

Each approach gives a gene specific summary score. The scores depend on biological, technical, and measurement error variability.

Different designs result in different allocations of the variance components.

Evaluations of designs is often done by considering variability associated with resulting gene expression estimate.

Evaluation of Designs

log (A/B) = log (A/R) - log (B/R)

Yang & Speed, NRG, 2002

Feasibility of Design III decreases as the number of conditions increases. With 6 samples, there are 15 pairwise comparisons.

Kerr and Churchill proposed loop designs. No longer strongly recommended (by many including Churchill). Some comparisons are less precise than others. Problems with robustness.

Notes on Designs

Compare A with D

log (A/D) = log (A/B) + log (B/C) + log (C/D)

log (A/D) = log (A/F) + log (F/E) + log (E/D)

What if arrays C and E are bad ?

A Closer Look at the Loop Design

A

B

C

D

E

F

Table 2 from Yang & Speed

Yang & Speed, NRG, 2002

Time Course Experiments

If main focus of the experiment is relative change between T2, T3, T4 and initial time point T1, then a reference design is good.

T1 T2 T3 T4

Here, all comparisons are made with equal efficiency.

If a stable reference is available (T1), this will allow comparisons to be made over a relatively long period of time.

Author note on Reference Design

Gary Churchill has traditionally argued against a standard reference design since almost half the measurements are made on the reference sample (which might be of little or no interest) and the variance can be increased relative to other designs.

However, in his NG Reviews article (December 2002), he notes the advantages:

Paths connnecting 2 samples are 2 steps long.

Good way to handle comparisons across time.

Replication (Review and Note)

Technical Replicates: Yang and Speed also include measurement error here. They define such replicates as ones where target mRNA is from the same extraction (different than GC definition reviewed earlier).

Biological Replicates: mRNA samples from different individuals, different cell lines.

Where to spend resources ?

Gary Churchill (NG Reviews, December 2002):

Correlation from duplicate spots on one array (~95 %)

Same target to multiple arrays (~60-80 %)

Samples from individual inbred mice (~30 %)

Yang & Speed (NG Reviews, August 2002):

Technical replicates generally involve a smaller degree of variation in measurements than biological replicates.

Where to spend resources ?

Gary Churchill (NG Reviews, December 2002):

When measurement is expensive, it is preferable to add experimental units rather than technical replicates.

When the variability of measurements exceeds the variability between experimental units, technical replication can increase precision.

When variability between experimental samples is large and units are not too costly, it may be worthwhile to pool samples.

Summary

In short,

Identify goals of the experiment (which comparisons are most important ?)

Identify options

Calculate variability associated with all options

Research options to see how they work in practice !!

Choose design based on variability, feasibility, and cost.

statistical methods for microarrays christina kendziorski landon sego department of biostatistics...

Documents

dna transcription

cdnacrnacomplementary

dna microarray experiments

dna microarray measuring

dna strand cdnawhich

short fragment of dna

level of transcription

transcription levelwhere