statistical methods for interpreting microarray data · (of statistical methods for…) low level...

43
1 Statistical methods for interpreting microarray data Terry Speed Department of Statistics, UC Berkeley Walter & Eliza Hall Institute of Medical Research Workshop on Molecular and Statistical Genomic Epidemiology Paris, May 9-11, 2005

Upload: others

Post on 30-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

1

Statistical methods forinterpreting microarray data

Terry SpeedDepartment of Statistics, UC Berkeley

Walter & Eliza Hall Institute of Medical Research

Workshop on Molecular and Statistical Genomic EpidemiologyParis, May 9-11, 2005

Page 2: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

2

My plan today: two illustrations(of Statistical methods for…)

Low level analysis: calling genotypes from AffymetrixSNP chip data.

Similar projects are underway for analysing chip datafor DNA copy number determination, DNAresequencing, whole genome tiling arrays for globalexpression and ChIP-chip studies, and whole genomeexon arrays for exon and gene-level expression. AlsoQA/QC.

Higher level analysis: one experiment to identifygenes involved in the host response to Leishmaniamajor, not atypical of the special experiments we do.

Page 3: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

3

No time to mention today

Middle level analysis, many examples, e.g.the summarization and ranking of genesusing microarray time course data.

Page 4: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

4

The Affymetrix SNP Chip

1.28cm > 100,000 features / array

1.28cm

88µµmm

8µm

> 1million of identical 25bp probes / feature

* **

**

Page 5: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

5

TAGCCATCGGTANGTACTCAATGAT

Genomic DNA

ATCGGTAGCCATTCATGAGTTACTAPerfect Match probe for Allele A

ATCGGTAGCCATCCATGAGTTACTAPerfect Match probe for Allele B

A SNP

GTAGCCATCGGTA GTACTCAATGAT

Affymetrix SNP chip terminology

Page 6: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

6

Affymetrix SNP probe tiling strategy, 1SNP Tiling Strategy

TAGCCATCGGTA N

SNP 0 Position

A / G

GTA C TCAATGATCAGCT

ATCGGTAGCCAT T

ATCGGTAGCCAT CATCGGTAGCCAT A

ATCGGTAGCCAT ACAT G AGTTACTACAT G AGTTACTA

CAT G AGTTACTACAT G AGTTACTA

PM AlleleMM Allele

PM AlleleMM Allele

AA

BB

Central probe quartet

Page 7: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

7

Affymetrix SNP probe tiling strategy, 2

TAGCCATCGGTA N

SNP

+4 PositionA / G

GTA C TCAATGATCAGCT

GTAGCCAT T

GTAGCCAT CGTAGCCAT C

GTAGCCAT TCAT G AGTTACTAGTCGCAT C AGTTACTAGTCG

CAT G AGTTACTAGTCGCAT C AGTTACTAGTCG

PMMM

PMMM

AA

B B

+4 Allele+4 Allele

+4 Allele+4 Allele

+4 offset probe quartet

Page 8: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

8

Affymetrix SNP probe tiling strategy, 3

MMBMMBMMBMMBMMBMMBMMB

PMBPMBPMBPMBPMBPMBPMB

MMAMMAMMAMMAMMAMMAMMA

PMAPMAPMAPMAPMAPMAPMA

7654321

Repeated on the opposite strand: 56 probes in all.More recently, 40: just 4 offset quartets instead of 6.

Central quartetOffset quartets Offset quartets

Page 9: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

9

Affymetrix SNP identificationFake (idealized) image for 3 samples on one SNP

AA AB BB

The current vendor-supplied genotype-calling algorithm DM seeks the best fitting pattern of the above kind, including nocall (NC). It is a mix of normal likelihood-based model selectionand a Wilcoxon test. There is no training, and it is single chip.

Page 10: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

10

DM (no NCs) vs HapMap

1,452327,4151,13225BB

1,745544355,168457AB

1,42091,249339,502AA

NCBBABAAHapMapDM

11,446 SNPs, 90 samples99.67% concordance (both called) 3,416 discordant calls

Page 11: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

11

Why attempt an improvement over DM?

• Perhaps the error rate is too high?

• There is reason to believe it can be improved bya) using the training/test set paradigm;b) carrying out multi-chip analyses, which identifyand exploit probe behaviour; andc) exploiting the massive parallelism across SNPs.

• The 100K SNPs were selected from a much largerscreening set using DM. For the 500K and >1M SNPchips, a higher yield is desirable, and perhaps abetter genotype-calling algorithm could achieve this.

Page 12: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

12

Robust Linear Model with theMahalanobis distance classifier

• RLMM pronounced pronounced ““REALMREALM””• Based on an RMA-like model

– Uses PM only– Linear additive multi-chip model on log scale– A- and B-probe and chip effects– Robustly estimated parameters

• Classification using Mahalanobis’ distance

Page 13: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

13

RLMM: single SNP, multi-chip model

For SNP n we fit the following models for the A and B-probes toquantile-normalized PM values yi,A,j

(n) and yi,B,j(n) .

log2(yi,A,j(n)) = θA,i

(n) + βA,j(n) + εij ,

log2(yi,B,j(n)) = θB,i

(n) + βB,j(n) + εij ,

where θA,i(n) and θB,i

(n) are the A- and B-effects for sample i,and βA,j and βB,j are the relative probe affinities, subject to ∑βA,j

(n) = ∑βB,j

(n) =0.As errors are likely to be contaminated due to outlier probes,

we use a robust linear model to estimate the parameters.

Page 14: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

14

RLMM: outline of the algorithm1. Quantile normalize PM intensities across chips.2. For SNP n, obtain estimates of (θ(n)

A,i ,θ(n)B,i ) for each

sample i in the training set using the previous model.3. Estimate the mean vectors (µ(n)

AA, µ(n)AB , µ(n)

BB)and covariance matrices (Σ(n)

AA , Σ(n)AB , Σ(n)

BB) ofthe 2-dimensional vectors (θ(n)

A,i ,θ(n)B,i) using samples

from the AA, AB and AB groups in the training set.4. Obtain estimates (θ(n)

A,i ,θ(n)B,i ) for each sample i in the

test set.5. Classify each sample in the test set to the genotype

group closest to it in Mahalanobis’ distance.

Page 15: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

15

Mahalanobis’ distanceIntroduced by P.C. Mahalanobis in1936.A Euclidean-type metric which takesinto account the variances andcovariances, here Σg , between thecomponents θA and θ B of θ = (θA ,θB) :

D2g(θ) = (θ – µg)’Σ-1

g(θ – µ g)where D2

g(θ) is the generalized squareddistance of the θ vector from the mean µgof genotype group g = AA, AB or BB.We choose the g with smallest D2

g(θ).Note: we are not using ^’s to designateestimates, trusting to context.

Page 16: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

16

From raw intensities to θ values:AA

SNP 5 data from 13 AA samples (horizontally)

PMA+PMA-PMB+PMB-

PMA+PMA-PMB+PMB-

Relatively low (high)intensity probeRelatively

dim chip

Page 17: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

17

From raw intensities to θ values: AB

SNP5 data from 39 AB samples

PMA+PMA-PMB+PMB-

PMA+PMA-PMB+PMB-

Page 18: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

18

From raw intensities to θ values: BB

SNP 5 data from 75 BB samples

PMA+PMA-PMB+PMB-

PMA+PMA-PMB+PMB-

Page 19: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

19

SNP 5: θ- and residual plots

BB

AA

AB

Every sample has its (θA ,θB) pair: plot them!Do likewise for the residuals in the fitted model.

Residuals areuseful for QC;here skewed b/c of + strand failure.

Similar plots are used byAB, Chemicon and Illumina.

New sample points are assigned to the “closest” genotype

Page 20: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

20

SNP 200655: θ- and residual plots

A more satisfactory SNP’s plots.

Page 21: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

21

A-1706313 (DM NoCalls=10%) A-1659973 (Nocalls=23%)

A-1726964 (Nocalls=19%) A-1657538 (DM NoCalls = 6%)

Here are fourSNPs with someharder calls: thegenotype groupsare closer togetherand internallymore straggly.

The DM defaultmakes NCs onthese.

Page 22: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

22

Empirical Bayes Multi-SNP model Averaging of genotype centerscenters (µ(n)

AA, µ(n)AB , µ(n)

BB)and covariance matrices (Σ(n)

AA, Σ(n)AB , Σ(n)

BB) acrossSNPs n leads to

• empirically estimated conjugate Gaussian prior,• giving prior estimates of genotype means and

covariance matrices for all SNPs,• which when combined with the data for a particular

SNP, gives• better estimates of genotype group means and

covariance matrices, and hence better genotypicassignments for that SNP.

Main benefit: better genotype prediction when there arefew or no training samples with a given genotype.

Page 23: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

23

RLMM (no NCs) vs HapMap

1,478327,77249832BB

1,699184356,575196AB

1,44012476339,756AA

NCBBABAAHapMapRLMM

11,446 SNPs, 90 samples, LOOCV99.86% concordance (both called)1,398 discordant calls

Page 24: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

24

Availability

A version of RLMM will go into the opensource R-based Bioconductor package

before the end of this summer.

Page 25: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

25

Leishmaniasis

Page 26: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

26BALB/c C57BL/6

Page 27: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

27

L. major response loci in mice

• lmr1 Chromosome 17– MHC region– BALB/c susceptible

• lmr2 Chromosome 9– BALB/c susceptible

• lmr3 X Chromosome– C57BL/6 susceptible in the presence of BALB/c

homozygosity at lmr1

Page 28: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

28

lmr1, lmr2, and lmr3 affectthe course of disease

0

1

2

3

4

5

2 3 4 5 6 7 8 9 10 11 12

B/c.lmr3BALB/cB/c.lmr1B/c.lmr2

Aver

age

lesio

n sc

ore

Week post infection

*

* p < 0.05

Page 29: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

29

C.lmr1/2• BALB/c background• lmr1 and lmr2 from C57BL/6• Predict: More resistant than BALB/c

B6.lmr1/2• C57BL/6 background• lmr1 and lmr2 from BALB/c• Predict: more susceptible than C57BL/6

Compound congenics

Page 30: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

30

0

1

2

3

4

5

2 3 4 5 6 7 8 9 10 11 12

B/c.lmr3BALB/cB/c.lmr1B/c.lmr2B/c.lmr1/2B6.lmr1/2B6.lmr1B6.lmr2C57BL/6

Course of infection in strainscongenic for lmr loci

weeks post infection

aver

age

lesio

n sc

ore

Page 31: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

31

Summary of challenge infections

• All three loci confirmed to play a role inresponse to L. major infection

• Having all three resistance alleles(C.lmr1/2/3) or all three susceptibility alleles(B6.lmr1/2/3) does NOT recapitulate theparental phenotype in every mouse

• There are possibly other genes involved

Page 32: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

32

Infected macrophages

C57BL/6 B6.lmr1/2

Page 33: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

33

Design of microarray experiment

C57BL/6uninfected

B6.lmr1/2uninfected

C57BL/6infected

B6.lmr1/2infected

BALB/cuninfected

B/c.lmr1/2infected

BALB/cinfected

B/c.lmr1/2uninfected

Boxes indicate bone marrow derived macrophage samples arrayed on Affymetrix chips; red arrows indicate comparisons of interest.

Page 34: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

34

Uninfected B6.lmr1/2vs uninfected C57BL/6

• 83 genes t* > 5– Antigen presentation– Receptors– Cell surface– Chemokines– Inflammatory response– Cytoskeleton

Extracellular matrix9 genes in C57BL/6– Cell cycle– Mitochondrial– Signal transduction– Transcription factors

*Analysis carried out with RMA and limma, t here denotingmoderated Student t-statistic; qq-plots also used.

Page 35: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

35

Genes differently differentially expressed*

• Over 20 genes common to both arms of the experiment• Some immunological genes and others• Genes involved in tissue remodelling, wound repair and

extracellular matrix deposition– Metalloproteinases– Cytokines involved in extracellular matrix deposition– Collagens

Hypothesis: wound repair is important

*Again analysis done in limma, this time a 2×2 factorial analysis.

Page 36: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

36

Is a lesion a wound which fails to heal?

Page 37: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

Rate of wound healing

0

0.5

1

1.5

2

2.5

0 3 4 5 6 7 8 9 10 11Days

Lesi

on

Siz

e (

mm

)

BALBcC.lmr1/2C57BL6BL6.lmr1/2

Page 38: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

38

Collagen bundles in congenics

C57BL/6 B6.Clmr1/2 C.B6lmr1/2 BALB/c

Uninf.punch biopsies

L.majorinfected

Page 39: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

39

Conclusions

• lmr1, lmr2 and lmr3 affect progression of disease• Expression of Th1/Th2 cytokines is not mediated by lmr1,

lmr2, or lmr3 loci at any time during infection (not shown)• Early difference in cytokine response not seen (not shown)• Microarray analysis of macrophages has identified genes

involved in wound healing as being important.

• Wound healing experiments show that collagen depositionis indeed different between congenics and parentals.

Page 40: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

40

Acknowledgements

Nusrat Rabbee, UCB

Simon Cawley, Affymetrix

Simon FooteEmanuela HandmanColleen ElsoLynden RobertsAnuratha SakthiandeswarenJoan CurtisDenise BullenBeena KumarLynn BuckinghamFleur RoddaClaire, Kerry and Melissa (Kew)Tracey Baldwin

Funding: HHMI, NIH, NHMRC, Gene CRC, NSF

Gordon SmythRussell ThompsonKen Simpson

All WEHI

Page 41: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

41

Page 42: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

42

DM vs HapMap (no NCs)

1,452327,4151,13225BB

1,745544355,168457AB

1,42091,249339,502AA

NCBBABAAHapMapDM

11,446 SNPs, 90 samples99.67% concordance (valid calls) 3,416 discordant calls

Page 43: Statistical methods for interpreting microarray data · (of Statistical methods for…) Low level analysis: calling genotypes from Affymetrix SNP chip data. Similar projects are underway

43

Comparison TableComparison TableRLMM RLMM vs vs DMDM

(n=11,446 SNPs)(n=11,446 SNPs)99.7% concordance99.7% concordance

Total discordant calls: 2866Total discordant calls: 2866

32916483228BB

592356899445AB

24945341211AA

BBABAADMRLMM