structure and analysis of affymetrix arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · monnie mcgee...

68
Structure and Analysis of Affymetrix Arrays Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis Course, October 28, 2005 – p.1/56

Upload: others

Post on 09-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Structure and Analysis ofAffymetrix Arrays

Monnie McGee

Department of Statistical Science

Southern Methodist University

UTSW Microarray Analysis Course, October 28, 2005 – p.1/56

Page 2: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Outline

Brief Review of Spotted Array Technology

Structure of Affymetrix Arrays

Exploratory Data Analysis

Affymetrix Data Files

Obtaining Gene Expression Values

Software

UTSW Microarray Analysis Course, October 28, 2005 – p.2/56

Page 3: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Microarray Measurements

All raw measurements are fluorescence intensities

Target cDNA (or mRNA) is fluorescently labeled

Molecules in dye are excited using a laser

Measurement is a count of the photons emitted

Entire slide or chip is scanned, and the result is a digitalimage

Image is processed to locate probes and assignintensity measurements to each probe

UTSW Microarray Analysis Course, October 28, 2005 – p.3/56

Page 4: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Microarray Technologies

Two Channel Spotted ArraysRobotic MicrospottingProbes are 300 to 3000 base pairs in lengthLong-oligo arrays: probes are uniformly 60 to 90 bpCommerical arrays using inkjet technology

Single-channel ArraysHigh-density short oligo (25 bp) arrays (Affymetrix,Nimblegen)

UTSW Microarray Analysis Course, October 28, 2005 – p.4/56

Page 5: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Spotted Arrays

Diagram courtesy of Columbia Department of Computer Science

UTSW Microarray Analysis Course, October 28, 2005 – p.5/56

Page 6: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

The Affymetrix Chip

Human Genome U133 Plus 2.0 Array

Courtesy of Affymetrix

Some Definitions

Probes = 25 bpsequences

Probe sets = 11 to 20probes corresponding toa particular gene or EST

Chip contains 54K probesets

UTSW Microarray Analysis Course, October 28, 2005 – p.6/56

Page 7: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

In situ Synthesis of Probes

Image Courtesy of Affymetrix

UTSW Microarray Analysis Course, October 28, 2005 – p.7/56

Page 8: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Probe Selection: HG-U133 Plus 2.0Sequence data for new content obtained from dbEST,GenBank, and RefSeq.

Draft assembly of Human Genome (NCBI Build 31)used to assess sequence orientation and quality.

Probes selected from the 600 bases most proximal tothe 3′ end of each transcript.

Probe Selection regions defined by the following:3′ ends of RefSeq and complete CDS mRNAsequencesEight or more 3′ EST reads terminating at thesample position (evidence for polyadenylation)3′ end of the assembly (consensus end).

Details found in Affymetrix Technical Note (2003).

UTSW Microarray Analysis Course, October 28, 2005 – p.8/56

Page 9: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Types of Probe Sets

No suffix: predicted to perfectly match a singletranscript

“_a” suffix: recognize multiple alternative transcriptsfrom the same gene

“_s” suffix: common probes among multiple transcriptsfrom separate genes

“_x” suffix: contain some probes that are identical orhighly similar to other sequences.

UTSW Microarray Analysis Course, October 28, 2005 – p.9/56

Page 10: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

mRNA Hybridizes to Probes

Image Courtesy of Affymetrix

UTSW Microarray Analysis Course, October 28, 2005 – p.10/56

Page 11: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Sizes of Various GeneChips

Arrays for 27 organisms

Arabidopsis (2), Drosophilia (2), Mouse (5), Human (8),Yeast (2)

Arabidopsis: 24K genes, 11 pairs per probe setC Elegans: 22.5K genes, 11 pairs per probe setDrosophilia: 13.5K genes, 14 pairs per probe setHuman HG-U133 plus 2.0: 54K genes, 11-20 pairsper probe set.

Source: http://www.affymetrix.com/support/technical/datasheets.affx

UTSW Microarray Analysis Course, October 28, 2005 – p.11/56

Page 12: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Perfect Match vs. Mismatch

PM Probe = 25 bp probe perfectly complementary to aspecific region of a gene

MM Probe = 25 bp probe agreeing with a PM apart fromthe middle base.

The middle base is a transition (A ⇐⇒ T, C ⇐⇒ G) ofthat base

UTSW Microarray Analysis Course, October 28, 2005 – p.12/56

Page 13: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Perfect Match vs. Mismatch

PM Probe = 25 bp probe perfectly complementary to aspecific region of a gene

MM Probe = 25 bp probe agreeing with a PM apart fromthe middle base.

The middle base is a transition (A ⇐⇒ T, C ⇐⇒ G) ofthat base

Image Courtesy of Affymetrix

UTSW Microarray Analysis Course, October 28, 2005 – p.12/56

Page 14: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Riddle of the Mismatches

Mismatches were designed to capture non-specifichybridization

Hypothesized True Signal = PM - MM

Problem: Approximately 30% of the mismatches aregreater than their corresponding perfect matches.

UTSW Microarray Analysis Course, October 28, 2005 – p.13/56

Page 15: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Riddle of the Mismatches

Mismatches were designed to capture non-specifichybridization

Hypothesized True Signal = PM - MM

Problem: Approximately 30% of the mismatches aregreater than their corresponding perfect matches.

WHY ?

UTSW Microarray Analysis Course, October 28, 2005 – p.13/56

Page 16: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

PM and MM Example

Target Transcript for Human recA gene:

ctcagcttaagtcatggaattctagaggatgtatctcacaagtaggatcaag

c t c a g c t t a a g t c a t g g a a t t c t a g PM1

c t c a g c t t a a g t g a t g g a a t t c t a g MM1

t c a g c t t a a g t c a t g g a a t t c t a g a PM2

t c a g c t t a a g t c t t g g a a t t c t a g a PM2

a t t c t a g a g g a t g t a t c t c a c a a g t PM3

a t t c t a g a g g a t c t a t c t c a c a a g t MM3

a g g a t g t a t c t c a c a a g t a g g a t c a PM4

a g g a t g t a t c t c t c a a g t a g g a t c a MM4

UTSW Microarray Analysis Course, October 28, 2005 – p.14/56

Page 17: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

PM and MM Example

Target Transcript for Human recA gene:

ctcagcttaagtcatggaattctagaggatgtatctcacaagtaggatcaag

c t c a g c t t a a g t c a t g g a a t t c t a g PM1

c t c a g c t t a a g t g a t g g a a t t c t a g MM1

t c a g c t t a a g t c a t g g a a t t c t a g a PM2

t c a g c t t a a g t c t t g g a a t t c t a g a PM2

a t t c t a g a g g a t g t a t c t c a c a a g t PM3

a t t c t a g a g g a t c t a t c t c a c a a g t MM3

a g g a t g t a t c t c a c a a g t a g g a t c a PM4

a g g a t g t a t c t c t c a a g t a g g a t c a MM4

Morals: Large Overlap of sequences and variable GC content

UTSW Microarray Analysis Course, October 28, 2005 – p.14/56

Page 18: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Other Sources of VariationSystematic

Amount of RNA in biopsy extraction, Efficiencies of RNA

extraction, reverse transcription, labeling, photodetection, GC

content of probes

Similar effect on many measurements

Corrections can be estimated from data

Calibration corrections

UTSW Microarray Analysis Course, October 28, 2005 – p.15/56

Page 19: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Other Sources of VariationSystematic

Amount of RNA in biopsy extraction, Efficiencies of RNA

extraction, reverse transcription, labeling, photodetection, GC

content of probes

Similar effect on many measurements

Corrections can be estimated from data

Calibration corrections

StochasticPCR yield, DNA quality, Spotting efficiency, spot size,

Non-specific hybridization, Stray signal

Too random to be explicitly accounted for in a model

Noise components & “Schmutz”

UTSW Microarray Analysis Course, October 28, 2005 – p.15/56

Page 20: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Quality Control

We wish to find and eliminate problem probes beforeanalyzing the data

Problems may be local (scratch on the array,inadequate washing) or global (background set toohigh)

Look at image plots, histograms, MA plots, boxplots,etc.

UTSW Microarray Analysis Course, October 28, 2005 – p.16/56

Page 21: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Contaminated Image

Image courtesy of http//:www.biostat.harvard.edu/complab/dchip

UTSW Microarray Analysis Course, October 28, 2005 – p.17/56

Page 22: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Why Normalize ?

Ensure that differences in intensities are truly due todifferential expression, not printing, hybridization, orscanning artifacts

Must be done before an analysis which involvescomparison of intensities within or between slides

Procedures depend on the array technology

UTSW Microarray Analysis Course, October 28, 2005 – p.18/56

Page 23: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Dilution Data

Human liver tissue hybridized to human array HGU95A

Large range of proportions and dilutions

Our data hybridized at 10.0 and 20.0 µg

Two replicate arrays for each generated cRNA

Each array replicate was processed in a differentscanner

For more information, see http://qlotus02.genelogic.com/datasets.nsf/

UTSW Microarray Analysis Course, October 28, 2005 – p.19/56

Page 24: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Histograms from Dilution Study

6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

0.5

0.6

log intensity

dens

ity

UTSW Microarray Analysis Course, October 28, 2005 – p.20/56

Page 25: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Boxplots

X20A X20B X10A X10B

68

1012

14

Small part of dilution study

UTSW Microarray Analysis Course, October 28, 2005 – p.21/56

Page 26: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

M-A PlotsPlot of log fold change for gene j (Mj) versus the averagelog intensity for that gene (Aj).

6 8 10 12 14

−1

01

23

4

10B vs pseudo−median reference chip

A

M

Median: −0.535IQR: 0.207

UTSW Microarray Analysis Course, October 28, 2005 – p.22/56

Page 27: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Exploratory Data Analysis

UTSW Microarray Analysis Course, October 28, 2005 – p.23/56

Page 28: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Exploratory Data Analysis (cont’d)

UTSW Microarray Analysis Course, October 28, 2005 – p.24/56

Page 29: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Affymetrix Files

CDF file: Chip description file, describes which probesgo into which probe sets

DAT file: TIFF Image file, 107 pixels, ∼ 50 MB

CEL file: Probe intensities, ∼ 600,000 numbers

CHP file: Gene expression values as calculated byGeneChip Operating Software (GCOS)

Probe sets correspond to genes, gene fragments, or ESTs

UTSW Microarray Analysis Course, October 28, 2005 – p.25/56

Page 30: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Affymetrix DAT file

Scan of whole chip (left) and top left-hand corner (right) ofArabidopsis thaliana Genome Array.

Images courtesy of NASCA Arrays Help.

UTSW Microarray Analysis Course, October 28, 2005 – p.26/56

Page 31: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

From DAT to CEL

CEL files contain fluorescence intensity values for all probepairs and all probe sets.

Use gridding to estimate location of probe cell centers

Remove outer 36 pixels → 8 × 8 pixels

PM (MM) intensity is the 75th percentile of the 8× 8 pixelvalues

Background: Average of the lowest 2% of probe cells is sub-

tracted

UTSW Microarray Analysis Course, October 28, 2005 – p.27/56

Page 32: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Analysis Tasks

Identify up- and down-regulated genes.

Find groups of genes with similar expression profiles.

Find groups of experiments (tissues) with similarexpression profiles.

Find genes that explain observed differences amongtissues (feature selection).

UTSW Microarray Analysis Course, October 28, 2005 – p.28/56

Page 33: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

From CEL to Gene Expression

Computing Expression Values for each probe set requiresthree steps which begin with probe level data:

UTSW Microarray Analysis Course, October 28, 2005 – p.29/56

Page 34: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

From CEL to Gene Expression

Computing Expression Values for each probe set requiresthree steps which begin with probe level data:

Central Dogma of Microarray Analysis:

Background correction (local vs. global)

Normalization (baseline array vs. complete data)

Summarization (single vs. multiple chips)

UTSW Microarray Analysis Course, October 28, 2005 – p.29/56

Page 35: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

From CEL to Gene ExpressionThe “Big Four” algorithms for correcting, normalizing, andsummarizing probe level data.

Microarray Analysis Suite 5.0 (MAS5 - Affymetrix, 2001,2003)

Model Based Expression Index (MBEI - Li and Wong,2001a,b)

Robust Multichip Analysis (RMA - Irizarry et. al., 2003)

Significance Analysis of Microarrays (SAM - Tusher,Tibshirani, and Chu, 2001)

UTSW Microarray Analysis Course, October 28, 2005 – p.30/56

Page 36: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Background Correction in MAS 5.0Affymetrix proposed two methods: location specificadjustment and ideal mismatch.

Location Specific Adjustment:

Array is split into K rectangular zones, denotedZk, k = 1, . . . ,K. The default for K is 16.

Control cells and masked cells are not used in thecalculation

Intensities within zones are ranked and the lowest 2% ischosen as the background b for that zone (bZk)

Standard deviation of bZk is calculated as an estimateof the background variability n for each zone (nZk).

UTSW Microarray Analysis Course, October 28, 2005 – p.31/56

Page 37: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Background Correction in MAS 5.0Result is smoothed via the following formula

wk(x, y) =1

d2k(x, y) + ψ

The background is given by

b(x, y) =1

∑Kk=1wk(x, y)

K∑

k=1

wk(x, y)bZk

where dk(x, y) is the Euclidean distance between chip coor-

dinate (x, y) and the center of the kth zone and ψ is a smooth-

ing parameter (100 by default).

UTSW Microarray Analysis Course, October 28, 2005 – p.32/56

Page 38: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

LSA ContinuedCalculate a local noise background n based on thestandard deviation of the lowest 2% of the backgroundin that zone (nZk).

Weight n(Zk) for background values using sameformula as for smoothing of background correction

Set threshold and floor such that no value is adjustedbelow that threshold.

Compute the Adjusted Intensity, A(x, y), via

A(x, y) = max(I ′(x, y) − b(x, y), fn(x, y))

where I ′(x, y) = max(I ′(x, y), 0.5) is the cell intensity at chip

coordinates (x, y), and f (default 0.5) is the threshold.

UTSW Microarray Analysis Course, October 28, 2005 – p.33/56

Page 39: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Affy Method 2: Ideal Mismatch

IMi,j =

MMij MMij < PMijPMij

2SBiMMij ≥ PMij and SBi > τc

PMij

2

0

@

τc

1+τc−SBi

τs

1

A

MMij ≥ PMij and SBi ≤ τc.

where τs is a cutoff describing the variability of the probepairs within the probe set, and τc is some tolerance level.

Defaults: τc = 0.03, τs = 10.

Now Signal = Tbi (PVi,1, . . . , PVi,ni), where

PVi,j = log2(max(PMij − IMij , δ)), for δ small.

UTSW Microarray Analysis Course, October 28, 2005 – p.34/56

Page 40: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Normalization in MAS 5.0Let X by a p× n matrix with columns representing arraysand rows probes or probesets.

Pick a column of X = log(X) to serve as baseline array, saycolumn j.

1. Compute (trimmed) mean of column j. Call this X̃j.

2. Compute (trimmed) mean of column i. Call this X̃i.

3. Compute βi = X̃j

X̃i

.

4. Multiply elements of column i by βi.

Repeat 2 – 4 for all columns.

UTSW Microarray Analysis Course, October 28, 2005 – p.35/56

Page 41: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Summarization in MAS 5.0A signal (expression) value is calculated by combining theprobe intensities for each probe pair within a probe set.

Find a typical log ratio of PM to MM for probe pair j inprobe set i- known as Specific Background

SBi = Tbi(log2(PMij) − log2(MMij) : j = 1, . . . , ni)

where Tbi is the Tukey Biweight.

If SBi is large, values from the probe set are useddirectly to construct the ideal mismatch (IM) for a probepair.

If SBi is small (as defined by τc), smooth MM to usemore of PM value as IM.

UTSW Microarray Analysis Course, October 28, 2005 – p.36/56

Page 42: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Content of the CHP fileData analysis output for a Single Array Analysis includesthe following:

List of probes (transcripts)

Stat Pairs: Number of probe pairs to interrogate eachgene

Stat Pairs Used: Number of pairs used to calculatesignal

Signal: Raw Adjusted Intensity

Detection Call: presence or absence of transcript

Detection P-value: p-value used to determine presenceor absence of transcript

UTSW Microarray Analysis Course, October 28, 2005 – p.37/56

Page 43: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

What is a P-value?

The probability that a test statistic as extreme or moreextreme will be obtained assuming that the null hypothesisof the test is true.

For probe pairs, the null hypothesis is that there is nosignificant difference in intensity between PM and MMvalues for the same probe pair.

UTSW Microarray Analysis Course, October 28, 2005 – p.38/56

Page 44: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Absolute Analysis of One ArrayFour steps to calculating presence/absence of transcripts:

1. Remove saturated prove pairs and ignore probe pairs where

PM ≈ MM + τ (default: τ = 0.015).

2. Calculate discrimination scores (Ri) for each probe pair

Ri =PMi − MMi

PMi + MMi

3. Use Wilcoxon’s signed-rank test to calculate a p-value for each

pair

4. Compare the p-value wtih preset significance levels as follows:

Present if p < α1 (default: α1 = 0.04).

Marginal if α1 = p < α2

Absent if p ≥ α2 (default: α2 = 0.06).

UTSW Microarray Analysis Course, October 28, 2005 – p.39/56

Page 45: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Comparisons of Multiple ArraysLet γ1 and γ2 be user defined thresholds for change callssuch that 0 < γ1 < γ2 < 1.

p = Change p-value, calculated using signed rank testcomparing PM and MM differences for each probe pair in aprobe set present on both arrays being compared.

Possible Outcomes:

Increase (p < γ1)

Marginal Increase (γ1 ≤ p < γ2)

No Change (γ2 ≤ p ≤ 1 − γ2)

Marginal Decrease (1 − γ2 > p ≤ 1 − γ1)

Decrease (p > 1 − γ1)

Source: http://www.wadsworth.org/genomics/microarray/

UTSW Microarray Analysis Course, October 28, 2005 – p.40/56

Page 46: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Marginal CallsWhat do I do with Marginal Calls?

UTSW Microarray Analysis Course, October 28, 2005 – p.41/56

Page 47: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Marginal CallsWhat do I do with Marginal Calls?

Ignore them (treat them as absent)

Include them (treat them as present)

UTSW Microarray Analysis Course, October 28, 2005 – p.41/56

Page 48: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Marginal CallsWhat do I do with Marginal Calls?

Ignore them (treat them as absent)

Include them (treat them as present)

Include them with some probability (detection filter -McClintick, et. al., 2003)

UTSW Microarray Analysis Course, October 28, 2005 – p.41/56

Page 49: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Marginal CallsWhat do I do with Marginal Calls?

Ignore them (treat them as absent)

Include them (treat them as present)

Include them with some probability (detection filter -McClintick, et. al., 2003)

Examine literature

UTSW Microarray Analysis Course, October 28, 2005 – p.41/56

Page 50: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Marginal CallsWhat do I do with Marginal Calls?

Ignore them (treat them as absent)

Include them (treat them as present)

Include them with some probability (detection filter -McClintick, et. al., 2003)

Examine literature

Examine other arrays for the call of that same transcript

UTSW Microarray Analysis Course, October 28, 2005 – p.41/56

Page 51: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Marginal CallsWhat do I do with Marginal Calls?

Ignore them (treat them as absent)

Include them (treat them as present)

Include them with some probability (detection filter -McClintick, et. al., 2003)

Examine literature

Examine other arrays for the call of that same transcript

Rules for the inclusion of marginal calls seem to be an openresearch question.

UTSW Microarray Analysis Course, October 28, 2005 – p.41/56

Page 52: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Problem: Multiple Comparisons

The Type I Error is the probability of rejecting the nullhypothesis when it is true (1 - sensitivity).

α1 & γ1 are meant to control P(Type I Error).

If α1 = 0.04, there are 4 chances in 100 that we will obtain afalse positive result.

UTSW Microarray Analysis Course, October 28, 2005 – p.42/56

Page 53: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Problem: Multiple Comparisons

The Type I Error is the probability of rejecting the nullhypothesis when it is true (1 - sensitivity).

α1 & γ1 are meant to control P(Type I Error).

If α1 = 0.04, there are 4 chances in 100 that we will obtain afalse positive result.

For absolute analysis, approximately 600,000 statisticaltests are done for each array.

At α = 0.04, we expect 600, 000 × 0.04 = 24, 000 false positiveresults!

Solutions: Bonferroni Adjustment, False Discovery Rate,etc.

UTSW Microarray Analysis Course, October 28, 2005 – p.42/56

Page 54: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Model Based Expression IndexFit the following model using multiple chips for one gene:

yij = PMij −MMij = θiφj + ǫij

where θi is the expression index in chip iφj is a scaling factor characterizing probe pair jǫij are normal errors

Least squares estimates for parameters are carried out byiteratively fitting the set of θs and φs, treating the other setas known.

Standard errors of θ used to identify array outliers

Standard errors of φ used to identify probe outliers

MBEI model can also be based on PM only value

UTSW Microarray Analysis Course, October 28, 2005 – p.43/56

Page 55: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Normalization in MBEI

Non–linear, baseline array method

1. Pick a column of X to serve as baseline array, saycolumn j. For MBEI, the common baseline array is onehaving median overall brightness.

2. Fit a smooth non-linear relationship mapping column ito the baseline. Call this f̂i.

3. Normalized values for column j are given by f̂i(Xj).

4. Repeat 2 and 3 for all columns of X.

Various non-linear relationships are possible:cross-validated splines, running median lines, loesssmoothers, etc.

UTSW Microarray Analysis Course, October 28, 2005 – p.44/56

Page 56: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Summarization in MBEI

For each probeset n = 1, . . . , NP , fit the model

log2

(

y(n)ij

)

= β(n)j + α

(n)i + ǫ

(n)ij

where α(n)i is a probe effect and ǫ(n)

ij are errors.

Use standard linear regression techniques to fit themodel.

The estimated β(n)j are the base 2 log expression

values.

Outlier arrays, probes, and individual intensities areremoved prior to summarization.

UTSW Microarray Analysis Course, October 28, 2005 – p.45/56

Page 57: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Background Correction in RMA

Assumption:X = S + Y

where

X = observed probe–level intensity

S ∼ E(α) = true signal

Y ∼ TN(µ, σ2) = background noise

Reference: Irizarry et. al., Biostatistics, 2003

UTSW Microarray Analysis Course, October 28, 2005 – p.46/56

Page 58: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

RMA for the Right–Brained ...

Image courtesy of Terry Speed

UTSW Microarray Analysis Course, October 28, 2005 – p.47/56

Page 59: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Parameter EstimationBackground Corrected intensity is Eij = E(Sij|Xij),where i = 1 . . . G, and j = 1, . . . , J .

We need to estimate µ, σ, and α.

UTSW Microarray Analysis Course, October 28, 2005 – p.48/56

Page 60: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Parameter EstimationBackground Corrected intensity is Eij = E(Sij|Xij),where i = 1 . . . G, and j = 1, . . . , J .

We need to estimate µ, σ, and α.

How does RMA estimate the parameters?

µ = Mode of observations to the left of the overall mode

σ = Sample standard deviation for observations to left ofoverall mode

α = Mode of observations to the right of the overall mode

UTSW Microarray Analysis Course, October 28, 2005 – p.48/56

Page 61: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Normalization in RMA

Quantile Normalization Algorithm

Given n arrays of length p, form matrix X of dimensionp× n where each array is a column.

Sort each column of X to give Xsort.

Take the mean across rows of Xsort.

Assign this mean to each element in the row to getquantile equalized X ′

sort.

Rearrange each column of X ′

sort to have the sameordering as the original matrix X to obtain Xnormalized.

UTSW Microarray Analysis Course, October 28, 2005 – p.49/56

Page 62: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Summarization in RMA

Median Polish Algorithm (Tukey 1977, Bolstad 2004)

Fits the following model

log2

(

y(n)ij

)

= µ(n) + θ(n)j + α

(n)i + ǫ

(n)ij

with constraints

median(θj) = median(αi) = 0

mediani(ǫij) = medianj(ǫij) = 0.

UTSW Microarray Analysis Course, October 28, 2005 – p.50/56

Page 63: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Median Polish AlgorithmForm a matrix for each probe set n such that the probes arein rows and the arrays are in columns.

Add a row and a column to give matrix of the form:

e11 . . . e1NAa1

... . . . ......

eIn1 . . . eInNAaIn

b1 . . . bNAm

where, initially, eij = y(n)ij and ai = bj = m = 0.

UTSW Microarray Analysis Course, October 28, 2005 – p.51/56

Page 64: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Median Polish (continued)

Take the median across columns, subtracting resultsfrom each element in that row and adding it to the finalcolumn

Take medians across rows, subtracting results fromeach element in that column and adding them to thefinal row.

Continue until the changes become small or zero

In conclusion: µ̂ = m, θ̂j = bj, and α̂i = ai.

UTSW Microarray Analysis Course, October 28, 2005 – p.52/56

Page 65: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Significance Analysis of Microarrays

Algorithm to determine “significantly” expressed genes

Original article mentions use of GeneChip AnalysisSuite software for background correction, normalizationand summarization.

Assigns a score to each gene on the basis of change ingene expression relative to the standard deviation ofrepeated measurements.

If the score exceeds a threshold, use permutations ofrepeated measurements to estimate the percentage ofgenes identified by chance.

More Information:http://www-stat.stanford.edu/ tibs/SAM/.

UTSW Microarray Analysis Course, October 28, 2005 – p.53/56

Page 66: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

Microarray SoftwareOpen Source

Bioconductor: Calculates RMA, MBEI, MAS5,

http://www.bioconductor.org

dChip (MBEI only, http://biosun1.harvard.edu/complab/dchip/)

Significance Analysis of Microarrays (SAM)

Generalized Probe Model (GPM - Fan, et. al.2005,

http://qge.fhcrc.org/probeplus)

Commerical

GCOS, MAS 5.0 (Affymetrix)

S-Plus ArrayAnalyzer: Calculates RMA, MBEI, MAS5*

Iobion GeneTraffic: RMA, MBEI, MAS5*

UTSW Microarray Analysis Course, October 28, 2005 – p.54/56

Page 67: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

References1. Affymetrix, Inc (2001). "Statistical Algorithms Reference". Data Analysis

Fundamentals Technical Manual, Chapter 5. www.affymetrix.com.

2. Affymetrix Technical Note: Design and Performance of the GeneChip Human GenomeU133 Plus 2.0 and Human Genome U133A Plus 2.0 Arrays (2003).www.affymetrix.com.

3. Affymetrix, Inc (2002). Statistical Algorithms Description Document.www.affymetrix.com.

4. Bolstad, Ben (2004). Low Level Analysis of High-density Oligonucleotide Array Data:Background, Normalization and Summarization. Dissertation. University of California,Berkeley.

5. Fan W, Pritchard JI, Olson JM, Khalid N, and Zhao LP (2005). A class of models foranalyzing gene expression analysis array data. BMC Genomics, 6:16,http://www.biomedcentral.com/1471-2164/6/16/.

6. Irizarry, R. A. , Bolstad, B. M. , Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P.(2003). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research,31 (4): e15.

7. Irizarry, R. A. , Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U.,and Speed, T. P. (2003). Exploration, normalization, and summaries of high densityoligonucleotide array probe level data. Biostatistics, 4: 249–264.

UTSW Microarray Analysis Course, October 28, 2005 – p.55/56

Page 68: Structure and Analysis of Affymetrix Arraysfaculty.smu.edu/mmcgee/utswtalk.pdf · Monnie McGee Department of Statistical Science Southern Methodist University UTSW Microarray Analysis

References Continued8. Li, C. and Wong. H. W. (2001). Model-based analysis of oligonucleotide arrays:

Expression index computation and outlier detection. Proceedings of the NationalAcademy of Sciences, 98 (1): 31-36.

9. Li, C. and Wong. H. W. (2001). Model-based analysis of oligonucleotide arrays: modelvalidation, design issues and standard error application. Genome Biology, 8 (8):research0032.1-0032.11.

10. McClintick JN, Jerome RE, Nicholson CR, Crabb DW, Edenberg HJ (2003).Reproducibility of oligonucleotide arrays using small samples. BMC Genomics:4(4),http://www.biomedcentral.com/1471-2164/4/4.

11. Naef, F and Magnasco (2003). Solving the riddle of the bright mismatches: Labelingand effective binding in oligonucleotide arrays. Physical Review, 68.

12. Tukey JW (1977). Exploratory Data Analysis. Addison-Wesley, ReadingMassachusetts.

13. Tusher VG, Tibshirani R and Chu G (2001). Significance analysis of microarraysapplied to the ionizing radiation response. Proceedings of the National Academy ofSciences 98: 5116-5121 (Apr 24).

UTSW Microarray Analysis Course, October 28, 2005 – p.56/56