genomic dna pooling for whole-genome association scans in complex disease: empirical demonstration...

12
ORIGINAL ARTICLE Genomic DNA pooling for whole-genome association scans in complex disease: empirical demonstration of efficacy in rheumatoid arthritis S Steer 1,10 , V Abkevich 2,10 , A Gutin 2 , HJ Cordell 3 , KL Gendall 4 , ME Merriman 4 , RA Rodger 4 , KA Rowley 4 , P Chapman 5 , P Gow 6 , AA Harrison 7 , J Highton 8 , PBB Jones 9 , J O’Donnell 5 , L Stamp 5 , L Fitzgerald 2 , D Iliev 2 , A Kouzmine 2 , T Tran 2 , MH Skolnick 2 , KM Timms 2 , JS Lanchbury 2 and TR Merriman 4 1 Kings College London School of Medicine at Guy’s, Department of Rheumatology, King’s and St Thomas’, London, UK; 2 Myriad Genetics Inc., Salt Lake City, UT, USA; 3 Institute of Human Genetics, University of Newcastle, Newcastle, UK; 4 Department of Biochemistry, University of Otago, Dunedin, New Zealand; 5 Department of Rheumatology, Christchurch Hospital, Christchurch, New Zealand; 6 Department of Rheumatology, Middlemore Hospital, Auckland, New Zealand; 7 Wellington School of Medicine, University of Otago, Wellington, New Zealand; 8 Otago School of Medicine, University of Otago, Dunedin, New Zealand and 9 Department of Rheumatology, QE Hospital, Rotorua, New Zealand A pragmatic approach that balances the benefit of a whole-genome association (WGA) experiment against the cost of individual genotyping is to use pooled genomic DNA samples. We aimed to determine the feasibility of this approach in a WGA scan in rheumatoid arthritis (RA) using the validated human leucocyte antigen (HLA) and PTPN22 associations as test loci. A total of 203 269 single-nucleotide polymorphisms (SNPs) on the Affymetrix 100K GeneChip and Illumina Infinium microarrays were examined. A new approach to the estimation of allele frequencies from Affymetrix hybridization intensities was developed involving weighting for quality signals from the probe quartets. SNPs were ranked by z-scores, combined from United Kingdom and New Zealand case–control cohorts. Within a 1.7 Mb HLA region, 33 of the 257 SNPs and at PTPN22, 21 of the 45 SNPs, were ranked within the top 100 associated SNPs genome wide. Within PTPN22, individual genotyping of SNP rs1343125 within MAGI3 confirmed association and provided some evidence for association independent of the PTPN22 620W variant (P ¼ 0.03). Our results emphasize the feasibility of using genomic DNA pooling for the detection of association with complex disease susceptibility alleles. The results also underscore the importance of the HLA and PTPN22 loci in RA aetiology. Genes and Immunity (2007) 8, 57–68. doi:10.1038/sj.gene.6364359; published online 7 December 2006 Keywords: genome scan; association; DNA; pooling Introduction Rheumatoid arthritis (RA) is a chronic debilitating autoimmune disease caused by inflammation of synovial tissue. Although it clearly has a genetic basis, 1 the genetic causes of RA remain poorly defined. Until now, most insight into genetic aetiology has come from the study of functional candidate genes. Genetic association with alleles of the class II antigen-presenting molecule human leucocyte antigen (HLA)-DRB1 on chromosome 6p has been established for decades (odds ratio (ORs) ¼ 2.5–3.0), with the shared epitope, defined mainly by subtypes of DRB1*04 and *01, prominent in Caucasians. 2 Recently, the 620W allele of the PTPN22 gene (which encodes the lymphoid tyrosine phosphatase), has been confirmed as a determinant of RA by extensive replication of asso- ciation in Caucasian patient cohorts. 3 Other genes are implicated in RA susceptibility, with CTLA4 and PADI4 the closest to being ‘confirmed’, 4,5 although their effect (OR ¼ 1.1–1.3) is less than that of PTPN22 (OR ¼ 1.5–2.0). Microarray-based technology to enable whole-genome scanning for association (WGA) has evolved to the point where this approach to elucidating the genetic basis for common disease has become feasible. 6–8 By the simulta- neous genotyping of hundreds of thousands of single- nucleotide polymorphisms (SNPs) using the widely available Affymetrix and Illumina technologies, WGA scanning offers the promise of disease gene discovery through linkage disequilibrium (LD) to causal DNA changes. Although the optimal study design for a WGA experiment is a matter for debate, identification and validation of the genes encoding complement factor H, insulin-induced gene 2 and interferon-induced helicase as determinants of age-related macular degeneration, body mass index and type 1 diabetes, respectively, 9–11 do Received 2 August 2006; revised and accepted 25 October 2006; published online 7 December 2006 Correspondence: Dr TR Merriman, Biochemistry Department, 710 Cumberland Street, Dunedin 9054, New Zealand. E-mail: [email protected] 10 These authors contributed equally to this work. Genes and Immunity (2007) 8, 57–68 & 2007 Nature Publishing Group All rights reserved 1466-4879/07 $30.00 www.nature.com/gene

Upload: independent

Post on 11-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

ORIGINAL ARTICLE

Genomic DNA pooling for whole-genome associationscans in complex disease: empirical demonstrationof efficacy in rheumatoid arthritis

S Steer1,10, V Abkevich2,10, A Gutin2, HJ Cordell3, KL Gendall4, ME Merriman4, RA Rodger4,KA Rowley4, P Chapman5, P Gow6, AA Harrison7, J Highton8, PBB Jones9, J O’Donnell5, L Stamp5,L Fitzgerald2, D Iliev2, A Kouzmine2, T Tran2, MH Skolnick2, KM Timms2, JS Lanchbury2 andTR Merriman4

1Kings College London School of Medicine at Guy’s, Department of Rheumatology, King’s and St Thomas’, London, UK; 2MyriadGenetics Inc., Salt Lake City, UT, USA; 3Institute of Human Genetics, University of Newcastle, Newcastle, UK; 4Department ofBiochemistry, University of Otago, Dunedin, New Zealand; 5Department of Rheumatology, Christchurch Hospital, Christchurch,New Zealand; 6Department of Rheumatology, Middlemore Hospital, Auckland, New Zealand; 7Wellington School of Medicine,University of Otago, Wellington, New Zealand; 8Otago School of Medicine, University of Otago, Dunedin, New Zealand and9Department of Rheumatology, QE Hospital, Rotorua, New Zealand

A pragmatic approach that balances the benefit of a whole-genome association (WGA) experiment against the cost ofindividual genotyping is to use pooled genomic DNA samples. We aimed to determine the feasibility of this approach in a WGAscan in rheumatoid arthritis (RA) using the validated human leucocyte antigen (HLA) and PTPN22 associations as test loci. Atotal of 203 269 single-nucleotide polymorphisms (SNPs) on the Affymetrix 100K GeneChip and Illumina Infinium microarrayswere examined. A new approach to the estimation of allele frequencies from Affymetrix hybridization intensities was developedinvolving weighting for quality signals from the probe quartets. SNPs were ranked by z-scores, combined from United Kingdomand New Zealand case–control cohorts. Within a 1.7 Mb HLA region, 33 of the 257 SNPs and at PTPN22, 21 of the 45 SNPs,were ranked within the top 100 associated SNPs genome wide. Within PTPN22, individual genotyping of SNP rs1343125 withinMAGI3 confirmed association and provided some evidence for association independent of the PTPN22 620W variant(P¼ 0.03). Our results emphasize the feasibility of using genomic DNA pooling for the detection of association with complexdisease susceptibility alleles. The results also underscore the importance of the HLA and PTPN22 loci in RA aetiology.Genes and Immunity (2007) 8, 57–68. doi:10.1038/sj.gene.6364359; published online 7 December 2006

Keywords: genome scan; association; DNA; pooling

Introduction

Rheumatoid arthritis (RA) is a chronic debilitatingautoimmune disease caused by inflammation of synovialtissue. Although it clearly has a genetic basis,1 the geneticcauses of RA remain poorly defined. Until now, mostinsight into genetic aetiology has come from the study offunctional candidate genes. Genetic association withalleles of the class II antigen-presenting molecule humanleucocyte antigen (HLA)-DRB1 on chromosome 6p hasbeen established for decades (odds ratio (ORs)¼ 2.5–3.0),with the shared epitope, defined mainly by subtypes ofDRB1*04 and *01, prominent in Caucasians.2 Recently,the 620W allele of the PTPN22 gene (which encodes the

lymphoid tyrosine phosphatase), has been confirmed asa determinant of RA by extensive replication of asso-ciation in Caucasian patient cohorts.3 Other genes areimplicated in RA susceptibility, with CTLA4 and PADI4the closest to being ‘confirmed’,4,5 although their effect(OR¼ 1.1–1.3) is less than that of PTPN22 (OR¼ 1.5–2.0).

Microarray-based technology to enable whole-genomescanning for association (WGA) has evolved to the pointwhere this approach to elucidating the genetic basis forcommon disease has become feasible.6–8 By the simulta-neous genotyping of hundreds of thousands of single-nucleotide polymorphisms (SNPs) using the widelyavailable Affymetrix and Illumina technologies, WGAscanning offers the promise of disease gene discoverythrough linkage disequilibrium (LD) to causal DNAchanges. Although the optimal study design for a WGAexperiment is a matter for debate, identification andvalidation of the genes encoding complement factor H,insulin-induced gene 2 and interferon-induced helicaseas determinants of age-related macular degeneration,body mass index and type 1 diabetes, respectively,9–11 do

Received 2 August 2006; revised and accepted 25 October 2006;published online 7 December 2006

Correspondence: Dr TR Merriman, Biochemistry Department, 710Cumberland Street, Dunedin 9054, New Zealand.E-mail: [email protected] authors contributed equally to this work.

Genes and Immunity (2007) 8, 57–68& 2007 Nature Publishing Group All rights reserved 1466-4879/07 $30.00

www.nature.com/gene

provide confidence that this approach can be widelyapplied to complex disease.

Whole-genome association studies can be very expen-sive if case–control or family-based cohorts of a 1000 ormore subjects are individually genotyped. This is likelyto limit the number of primary discovery experimentsthat can be conducted. A pragmatic approach thatbalances the benefit of a WGA experiment against thecost of individual genotyping is to use pooled genomicDNA samples, followed by individual genotyping forvalidation in an expanded or independent sample.12,13

Pooled DNA samples have been analysed for severalgenerations of DNA-based genetic markers such asmicrosatellite and candidate SNPs using several techno-logies, including using the Affymetrix GeneChip Map-ping 100K Array.14–20 To date, however, the empiricalvalidity of DNA pooling and genotyping using arraytechnology has not been sufficiently demonstrated toenable researchers to apply the method with confidencein WGA experiments.

Here our aim was to investigate whether DNA poolingwas an effective approach for a WGA study in anempirical setting with the Affymetrix GeneChip Map-ping 100K and Illumina Infinium microarray platformswith specific testing for association of the establishedHLA and PTPN22 loci with RA. To maximize accuracy ofpooled allele frequency estimates (and hence power ofWGA scanning) using the Affymetrix GeneChip Map-ping 100K Array, a novel algorithm to account for thequality of individual probe quartets was developed. Wealso improved our WGA scan by overlapping analysis ofpools of case–control cohorts from two racially andclinically similar populations (Caucasians from NewZealand (NZ) and the United Kingdom (UK)).

Results

Forty different sets of pooled samples were run onAffymetrix 100K GeneChip microarrays (data notshown). Data from replicate microarrays were comparedand the median standard deviation (s.d.) in allelefrequency was 2.9%, in comparison to 7.4% using theAffymetrix algorithm (based on averaging relative allelesignal (RAS)1 (sense probes) and RAS2 (antisense probes)values). The median s.d. in allele frequency obtainedfrom comparison of data obtained from running 14different sets of pooled samples on replicate IlluminaInfinium microarrays was similar at 2.7% (data notshown).

In order to evaluate how well allele frequency could beestimated from a pool, 39 individually typed Centred’Etude du Polymorphisme Humain (CEPH) sampleswere pooled and genotyped across an Affymetrix 100KGeneChip microarray. Figure 1 shows the relationshipbetween minor allele frequencies calculated from thepool and actual frequencies derived from the individu-ally typed samples for 200 randomly selected SNPs. Themedian absolute difference between the correspondingfrequencies was 3.1% and mean dispersion 3.4%. Themedian s.d. of predicted allele frequency for the sameCEPH pool run on four different chips was 3.2%.

In order to compare the ability to detect associationusing our DNA pooling approach with the ability todetect association based on individual genotyping, a data

set of 250 cases and 250 matched controls was consi-dered. Assuming that two pools, one for cases and onefor controls, were run on four replicate microarrayseach, we were able to estimate s.d. for the typical SNP(see Patients and methods, Detection of association).We defined the minimal detectable difference in allelefrequency (MDDAF) as an expected (in infinitely largedata sets) allele frequency difference between cases andcontrols at which power to detect association is 80% asDf¼ z*s, where z* is a threshold value for a z-score todetect association. Assuming that an association is detec-ted whenever a P-value is below P¼ 4� 10�5 (z*B4.13)and s0¼ 0.028, we calculated MDDAF as a function ofminor allele frequency. Figure 2 shows the loss in abilityto detect allele frequency differences using DNA poolingcompared to individual genotyping. This loss appears tobe acceptable.

Alle

le F

req

uen

cy E

stim

ated

fro

m P

oo

ling

Exp

erim

ent

Allele Frequency Based on Individual Genotypes

0

0

0.2

0.2

0.6

0.6

0.4

0.4

0.8

0.8

1

1

Figure 1 Comparison between allele frequencies estimated fromthe pooling experiment and allele frequencies based on individualgenotypes of 94 Caucasian CEPH individuals. Each point representsone of 200 randomly selected SNPs from the Affymetrix 100KGeneChip microarray. The diagonal line represents equal allelefrequencies.

Min

imal

Det

ecta

ble

Dif

fere

nce

In

Alle

le F

req

uen

cy

Minor Allele Frequency

0 0.1 0.2 0.3 0.4 0.5

Pooling (250 cases + 250controls)

Individual Genotyping (150cases + 150 controls)

Individual Genotyping (250cases + 250 controls)

0

0.02

0.06

0.04

0.08

0.12

0.14

0.16

0.18

0.1

Figure 2 Ability to detect association using the DNA poolingapproach. The MDDAF is plotted as a function of minor allelefrequency at a significance level P¼ 0.001.

Genomic DNA pooling for WGAS Steer et al

58

Genes and Immunity

We then performed a WGA scan in RA and specificallyexamined association at two loci previously unambigu-ously associated with RA. If association at HLA andPTPN22 could be detected using DNA pooling, thiswould be empirical demonstration of efficacy of thistechnique in WGA analysis in RA genetics and wouldwarrant later examination of the entire genome for noveldisease-associated loci. NZ and UK case and controlpools were hybridized, in quadruplicate, to Affymetrix100K GeneChip and Illumina Infinium microarrays and acombined z-score determined for all SNPs. The estimatedallele frequencies, z-scores and genome-wide ranks ofSNPs within the class II/III HLA and PTPN22 windowsare presented in Tables 1 and 2, respectively. Themaximal z-score at HLA was 5.554 (P¼ 3� 10�8) and atPTPN22 was 5.511 (P¼ 4� 10�8). Both of these remainsignificant after correcting for the total number of SNPsanalysed (Pco0.01).

At HLA, a total of 33 SNPs (representing 18 ‘uniquehits’ after exclusion of SNPs exhibiting completeintermarker LD with at least one other SNP) wereranked within the top 100 associated SNPs genome widefrom both the Affymetrix and Illumina microarrays

(Table 1). HapMap CEPH CEPH Utah (CEU) genotypingdata were available on 181 of the 257 HLA SNPs onthe microarrays and the LD relationships betweenthese are shown in Figure 3. Twenty-six of the disease-associated SNPs (for 15 of which Hapmap datawere available) were clustered within four closely relatedLD blocks encompassing the predicted gene C6orf10,BTNL2 and HLA-DRA. These blocks are definedby HapMap markers rs9296015–rs3129941 (blocks9–10, Figure 3), and rs2076530–rs1041885 (blocks 11–12,Figure 3). Outside the C6orf10 and HLA-DR regions,seven other SNPs (for which Hapmap data wereavailable on six) within the major histocompatibilitycomplex class III region provided evidence for asso-ciation. Six of these were clustered around the comple-ment component C2 (rs3020664, rs1042663 and rs541862),complement factor B, RD RNA binding protein(rs760070) and superkiller viralicidic activity 2-likehomologue (rs438999) loci, of which four exhibitedstrong intermarker LD (r240.75; LD block 4 in Figure 3).The seventh, rs9296009, is intergenic and lies approxi-mately 2.5 kb p-telomeric of proline-rich transmembraneprotein 1.

Table 1 Predicted minor allele frequencies, z-scores and genome-wide rankings of all HLA SNPs ranking in the top 100

SNPa Location (Mb) Combined z P-value UK z Frequency NZ z Frequency Rank

Case Control Case Control

Affymetrixrs9296009 (T) 32222493 3.917 9*10�5 2.768 0.356 0.262 3.169 0.334 0.230 21rs9296021 (T) 32405668 4.086 4*10�5 2.19 0.507 0.421 4.377 0.536 0.364 11rs9268542 (G) 32492699 3.807 10�4 2.737 0.553 0.431 3.044 0.552 0.414 23rs9268614 (G) 32510756 5.548 3*10�8 4.143 0.451 0.272 5.441 0.467 0.227 1rs2395173 (A) 32512837 4.93 8*10�7 �3.414 0.212 0.330 �4.642 0.175 0.327 3rs2395178 (C) 32513340 4.951 8*10�7 �3.414 0.212 0.330 �4.642 0.175 0.327 2rs2395182 (G) 32521295 4.258 2*10�5 �2.95 0.164 0.252 �3.616 0.143 0.245 7rs2227139 (C) 32521437 3.636 3*10�4 �2.959 0.336 0.440 �2.465 0.346 0.430 41

Illuminars3020644 (G) 32002605 4.28 2*10�5 3.549 0.477 0.316 2.504 0.428 0.313 30rs1042663 (A) 32013109 3.733 2*10�4 �2.367 0.065 0.112 2.912 0.059 0.114 58rs653414 (A) 32015147 3.733 2*10�4 2.367 0.065 0.112 2.912 0.059 0.114 59rs541862 (G) 32024930 3.733 2*10�4 �2.367 0.065 0.112 �2.912 0.059 0.114 60rs760070 (G) 32027935 3.733 2*10�4 �2.367 0.065 0.112 �2.912 0.059 0.114 61rs438999 (C) 32036285 3.733 2*10�4 �2.367 0.065 0.112 �2.912 0.059 0.114 62rs482194 (C) 32368537 5.197 2*10�7 3.090 0.496 0.389 4.259 0.526 0.380 4rs537757 (A) 32376479 5.197 2*10�7 �3.090 0.496 0.389 �4.259 0.526 0.380 5rs477005 (C) 32378478 5.197 2*10�7 3.090 0.496 0.389 4.259 0.526 0.380 6rs3132958 (A) 32405879 4.167 3*10�5 2.743 0.159 0.246 3.150 0.142 0.239 32rs9405090 (G) 32406350 5.197 2*10�7 3.090 0.496 0.389 4.259 0.526 0.380 7rs9268368 (C) 32441933 5.197 2*10�7 3.090 0.496 0.389 4.259 0.526 0.380 8rs9268384 (G) 32444564 5.197 2*10�7 3.090 0.496 0.389 4.259 0.526 0.380 9rs3129941 (A) 32445664 4.167 3*10�5 2.743 0.159 0.246 3.150 0.142 0.239 33rs2076530 (G) 32471794 5.554 3*10�8 2.785 0.580 0.473 5.069 0.636 0.442 1rs2076529 (C) 32471933 5.554 3*10�8 2.785 0.580 0.473 5.069 0.636 0.442 2rs2076525 (G) 32478594 5.073 4*10�7 2.787 0.429 0.326 4.387 0.484 0.321 10rs2076523 (G) 32478813 3.488 4*10�4 1.821 0.496 0.427 3.112 0.528 0.411 90rs9268492 (C) 32483258 5.073 4*10�7 �2.787 0.429 0.326 �4.387 0.484 0.321 11rs9268494 (C) 32483330 5.073 4*10�7 2.787 0.429 0.326 4.387 0.484 0.321 12rs9268497 (A) 32483402 5.073 4*10�7 �2.787 0.429 0.326 �4.387 0.484 0.321 13rs14004 (A) 32515687 3.5 5*10�4 �2.120 0.470 0.384 �2.83 0.477 0.362 86rs7192 (T) 32519624 3.989 7*10�5 3.127 0.246 0.359 2.515 0.274 0.365 40rs7195 (A) 32520517 3.989 7*10�5 3.127 0.246 0.359 2.515 0.274 0.365 41rs9268832 (A) 32535767 3.676 2*10�4 3.157 0.301 0.422 2.042 0.363 0.442 66

Abbreviations: HLA, human leucocyte antigen; NZ, New Zealand; SNP, single-nucleotide polymorphism; UK, United Kingdom.aThe minor allele is indicated in brackets.

Genomic DNA pooling for WGAS Steer et al

59

Genes and Immunity

Inspection of LD in the CEPH CEU families did notidentify any SNP markers that were in strong LD withthe known associated HLA-DRB1*0401 allele. The top-ranked Affymetrix 100K GeneChip SNP was one of themarkers closest to HLA-DRA (rs9268614), 155 kb fromHLA-DRB1, that exhibited some LD with HLA-DRB1*0401 in the CEPH CEU individuals (r2¼ 0.26).The extended NZ cohort was individually genotyped forrs9268614 and we tested for the possibility of an effect

independent of HLA-DRB1*04. Association was con-firmed, with allele G occurring at a significantly higherfrequency in cases than controls (Table 3a; P¼ 5.4�10�15). This SNP was in LD with DRB1*04 alleles(r2¼ 0.49 in NZ controls and 0.66 in the NZ extendedcase cohort). Conditional analysis of rs9268614 on thepresence of DRB1*04 alleles showed weak evidence forindependent association (P¼ 0.02). Analysis of haplo-types between rs9268614 and the DRB1*04 allele revealed

Table 2 Predicted minor allele frequencies, z-scores and genome-wide rankings of all SNPs in the PTPN22 window

SNPa Mb Combined z P-value UK z Frequency NZ z Frequency Rank rb,c

Case Control Case Control

Affymetrixrs1418958 (A) 113822878 3.770 2*10�4 1.659 0.296 0.242 4.394 0.355 0.212 28 0.243rs1343125 (C) 113823078 3.770 2*10�4 1.659 0.296 0.242 4.394 0.355 0.212 29 0.243rs10489936 (G) 113867451 0.840 0.4 �0.074 0.315 0.319 �1.135 0.287 0.325 0.065rs1080307 (G) 113928491 0.840 0.4 �0.074 0.315 0.319 �1.135 0.287 0.325 0.065rs1217380 (C) 114055378 4.174 3*10�5 3.154 0.327 0.218 3.260 0.322 0.210 9 0.435rs3811021 (C) 114068705 1.803 0.07 �1.303 0.162 0.207 �1.302 0.203 0.251 0.041rs1970559 (C) 114089190 2.448 0.01 �1.073 0.202 0.237 �2.510 0.179 0.261 0.046rs1217407 (A) 114105790 4.440 9*10�6 3.354 0.327 0.218 3.504 0.322 0.210 4 0.435rs1217410 (C) 114108858 4.419 10�5 3.354 0.327 0.218 3.504 0.322 0.210 5rs2476602 (A) 114108997 2.589 0.01 �1.137 0.202 0.237 �2.686 0.179 0.261 0.046rs1539438 (C) 114142398 0.037 1.0 0.071 0.348 0.345 �0.138 0.385 0.391 0.079

Illuminars10858000 113787838 4.639 4*10�6 �2.352 0.288 0.217 �4.208 0.298 0.175 14rs6537790 (C) 113807250 0.016 1.0 �0.688 0.315 0.286 0.710 0.335 0.366rs1343128 (A) 113820294 4.639 4*10�6 �2.352 0.288 0.217 �4.208 0.298 0.175 15 0.243rs1217201 (G) 113835815 4.639 4*10�6 2.352 0.288 0.217 4.208 0.298 0.175 16 0.255rs1217236 (G) 113844754 4.639 4*10�6 2.352 0.288 0.217 4.208 0.298 0.175 17rs1235000 (C) 113846067 1.503 0.13 �0.895 0.277 0.322 �1.23 0.302 0.335rs1217231 (G) 113848965 2.905 0.004 1.933 0.406 0.481 2.175 0.415 0.499rs1217228 (A) 113851401 4.639 4*10�6 �2.352 0.288 0.217 �4.208 0.298 0.175 18rs1217226 (G) 113851964 4.639 4*10�6 2.352 0.288 0.217 4.208 0.298 0.175 19rs3761931 (A) 113869065 0.689 0.5 �.422 0.297 0.312 �.553 0.258 0.275 0.065rs1217192 (G) 113874236 3.516 4*10�4 2.848 0.370 0.265 2.125 0.308 0.232 80rs1217193 (G) 113874661 4.412 10�5 �2.312 0.320 0.248 �3.928 0.323 0.205 20 0.435rs3747998 (C) 113898180 0.689 0.5 �0.422 0.297 0.312 �0.553 0.258 0.275 0.065rs1777237 (G) 113899413 4.412 10�5 2.312 0.320 0.248 3.928 0.323 0.205 21rs1230673 (C) 113906555 2.565 0.01 �0.847 0.291 0.319 �2.781 0.247 0.337 0.06rs1146182 (G) 113908925 4.412 10�5 2.312 0.320 0.248 3.928 0.323 0.205 22 0.435rs1146187 (T) 113924613 4.412 10�5 �2.312 0.320 0.248 �3.928 0.323 0.205 23rs13524 (G) 113940242 2.565 0.01 �0.847 0.291 0.319 �2.781 0.247 0.337 0.06rs6698586 (C) 113961424 4.412 10�5 2.312 0.320 0.248 3.928 0.323 0.205 24 0.435rs1981319 (T) 113979687 4.412 10�5 2.312 0.320 0.248 3.928 0.323 0.205 25rs2273758 (T) 113992759 4.319 2*10�5 �3.003 0.462 0.355 �3.105 0.433 0.325 27 0.202rs1935836 (G) 114021214 2.565 0.01 �0.847 0.291 0.319 �2.781 0.247 0.337 0.06rs3789598 (A) 114030769 4.319 2*10�5 �3.003 0.462 0.355 �3.105 0.433 0.325 28 0.202rs3789600 (C) 114046264 3.114 0.002 1.557 0.493 0.438 2.847 0.517 0.418 0.165rs1217413 (C) 114069792 5.511 4*10�8 3.304 0.306 0.183 4.49 0.353 0.180 3rs2476600 (A) 114081776 3.114 0.002 �1.557 0.493 0.438 �2.847 0.517 0.418 0.165rs1217395 (G) 114086477 4.412 10�5 2.312 0.320 0.248 3.928 0.323 0.205 26 0.435rs1970559 (C) 114089190 2.208 0.03 �1.215 0.221 0.262 �1.907 0.198 0.261rs1217406 (A) 114105195 3.114 0.002 �1.557 0.493 0.438 �2.847 0.517 0.418rs1217419 (T) 114113946 3.114 0.002 �1.557 0.493 0.438 �2.847 0.517 0.418rs1235005 (C) 114129479 2.671 0.008 1.342 0.462 0.414 2.435 0.481 0.396 0.161rs6665194 (G) 114129885 2.671 0.008 �1.342 0.462 0.414 �2.435 0.481 0.396 0.206rs1217385 (T) 114130247 2.671 0.008 �1.342 0.462 0.414 �2.435 0.481 0.396 0.209rs1217401 (G) 114150993 1.923 0.05 �0.328 0.274 0.285 �2.391 0.247 0.326 0.065rs1217397 (T) 114159607 1.923 0.05 0.328 0.274 0.285 2.391 0.247 0.326 0.037rs3761936 (C) 114161704 3.274 0.001 2.295 0.301 0.230 2.335 0.263 0.195 0.597

Abbreviations: NZ, New Zealand; SNP, single-nucleotide polymorphism; UK, United Kingdom.aThe minor allele is indicated in brackets.bThe genome-wide rank for SNPs ranked in the top 100 is shown.cr2¼CEU intermarker LD with PTPN22 R620W (rs2477601) (www.hapmap.org).

Genomic DNA pooling for WGAS Steer et al

60

Genes and Immunity

global association (P¼ 4.4� 10�30) and confirmed themajor effect of the presence or absence of the DRB1*04allele.

At PTPN22, 21 SNPs (representing seven ‘unique hits’after exclusion of SNPs exhibiting complete intermarker

LD with at least one other SNP) were ranked within thetop 100 associated SNPs genome wide (Table 2). LDrelationships between the associated SNPs for whichHapMap CEPH CEU genotyping data were available areshown in Figure 4. These data suggest that the PTPN22

Figure 3 Intermarker LD between SNPs within the class II and class III HLA window. Only those SNPs contained on the Affymetrix 100KGeneChip and Illumina Infinium microarrays and for which CEPH CEU genotype data were available are shown. SNPs ranked in the top 100are arrowed. Haplotype blocks (n¼ 22) generated in Haploview (www.broad.mit.edu/mpg/haploview) are outlined by a solid black line.Block numbers referred to in the text are from left to right. Horizontal lines indicate genes; A¼C2/CFB/RDBP/SKIV2L, B¼PPRT1,C¼C6orf10, D¼BTLN2, E¼DRA, F¼HLA-DRB1.

Table 3a Association analysis of rs9268614 and rs1343125 in the extended NZ cohort

Variant Cohort Allele (1/2) Genotype number (%) Allele number (%) Allelic OR (95% CI) Allelic P

1/1 1/2 2/2 1 2

rs9268614a Case T/G 317 (37.0) 414 (48.3) 126 (14.7) 1048 (61.1) 666 (38.9) 2.36 (1.99–2.81) 5.4� 10�15

Control T/G 352 (62.7) 180 (32.1) 29 (5.2) 884 (78.8) 238 (21.2)rs1343125a Case T/C 427 (50.7) 358 (42.5) 57 (6.8) 1212 (72.0) 472 (28.0) 1.40 (1.17–1.68) 1.8� 10�4

Control T/C 329 (59.1) 214 (38.4) 14 (2.5) 872 (78.3) 242 (21.7)

Abbreviations: NZ, New Zealand; OR, odds ratio.aHWE P-values (cases, controls): rs9268614 (0.63, 0.34), rs1343125 (0.12, 0.002).

Table 3b Association analysis of haplotypes of DRB1-rs9268614 and R620W-rs1343125 in the extended NZ cohort

Estimated case chromosomes (freq) Estimated control chromosomes (freq) OR (95% CI) Pa

(rs9268614-DRB1)T-*04 78 (0.047) 42 (0.038) 1.3 (0.89–1.91) 0.18G-*04 573 (0.346) 169 (0.152) 2.97 (2.44–3.60) 8.0� 10�31

T-NOT*04 933 (0.564) 832 (0.752) 0.42 (0.36–0.50) 5.9� 10�25

G-NOT*04 70 (0.043) 63 (0.057) 1 0.10

(R620W-rs1343125)b

620W-C 197 (0.117) 92 (0.083) 1.54 (1.18–2.00) 0.003620W-T 52 (0.030) 19 (0.017) 1.86 (1.09–3.18) 0.028620R-C 274 (0.163) 150 (0.135) 1 0.053620R-T 1157 (0.689) 851 (0.765) 0.68 (0.57–0.81) 1.3� 10�5

Abbreviation: NZ, New Zealand.aGlobal P¼ 4.4� 10�30 for the HLA markers and P¼ 8� 10�5 for the PTPN22 markers.brs1343125-R620W D0 and r2 values in the NZ controls are 0.78 and 0.24, respectively (0.69 and 0.24 in CEPH CEU (www.hapmap.org)).

Genomic DNA pooling for WGAS Steer et al

61

Genes and Immunity

620W variant is the major disease-causing allele in theextended haplotype block; there was correlation betweenamount of LD with the R620W variant (rs2476601) andz-score (correlation coefficient¼ 0.64). Four of the dis-ease-associated SNPs (rs1343128, rs1418958, rs1343125and rs1217201) were clustered at the telomeric end ofthe region within the membrane-associated guanylatekinase-related 3 (MAGI3) gene. Although it is possiblethat the association at the MAGI3 SNPs is due to LD withthe PTPN22 620W variant (Figure 4), given the presenceof a disease-associated haplotype in US Caucasiansthat is independent of PTPN22 620W21 we hypothesizedthat the MAGI3 SNPs themselves, or variants in LD,defined a disease-association distinct to the 620W asso-ciation. The MAGI3 SNP rs1343125 was genotyped acrossthe extended NZ RA case–control cohort. This confirmedassociation of the C allele with disease (Table 3a;P¼ 1.8� 10�4). Disease association of rs1343125-R620Whaplotypes was then analysed (Table 3b). The T allele atrs1343125 was present on a protective haplotype with the620R allele (OR¼ 0.68; P¼ 1.3� 10�5) whereas there wasweak evidence for over-representation of the C-620Rhaplotype in cases compared to controls (P¼ 0.05).Association analysis at rs1343125 conditional on PTPN22R620W also provided evidence for an effect independentof PTPN22 R620W (P¼ 0.03).

Discussion

This paper reports methodology for WGA scanning incomplex disease using pooled genomic DNA samplesand the Affymetrix 100K GeneChip and IlluminaInfinium microarray platforms. The empirical efficacy

of the method for detecting loci of moderate to strongeffect in complex disease was demonstrated by detectionof the HLA and PTPN22 loci in RA (OR41.5). Our dataemphasize the importance of the HLA and PTPN22loci (relative to the rest of the genome) in the aetiology ofRA. Using control population allele frequencies of 0.19for HLA-DRB1*04 and 0.099 for PTPN22 R620W, andgenotype relative risk estimates from the extended NZRA cohort, the estimated population attributable risk22

for each locus was 41.7 and 11.4%, respectively. Con-sidering that the environment also contributes to theaetiology of RA it is unlikely that more than several othergenetic variants of effect greater than PTPN22 R620Wremain to be discovered. Of course, it is most importantto acknowledge that the genotyping platforms used heretag, at r2

X0.8, only approximately 50% of commonvariation in the human genome in Caucasians.23 How-ever, an association analysis of 46500 nonsynonymousSNPs (nsSNPs) has also emphasized the importance ofPTPN22 R620W in autoimmunity;11 this nsSNP was themost associated with type I diabetes outside HLA.

The use of DNA pooling has potential as an extremelycost-effective method to identify a reduced set of poten-tially disease-associated SNPs suitable for follow-up inthe second phase of a WGA experiment. The key to usingDNA pooling in WGA scanning is reducing variability inestimation of allele frequency in genomic DNA pools.Given the current wider use and longer availability of theAffymetrix GeneChip Mapping 100K Array, we focusedon developing an improved algorithm able to minimizevariation in estimation of allele frequencies in DNApools. Previous methods have improved the Affymetrixalgorithm by averaging the RAS scores corresponding toeach of the sense and antisense probe sets, applying the k

Figure 4 Intermarker Intermarker LD between SNPs within the PTPN22 haplotype block. Only those SNPs contained on the Affymetrix100K GeneChip and Illumina Infinium microarrays and for which CEPH CEU genotype data were available are shown. Those with zX4.0 arered, 4.04zX3.0 are blue, 3.04zX2.0 are green and zo2.0 are black. The genes are not marked to scale. *Other relevant SNPs shown arePTPN22 R620W (rs2476601) and rs12760457 (defines protective ‘haplotype 5021) – neither was included in the WGA scan.

Genomic DNA pooling for WGAS Steer et al

62

Genes and Immunity

correction factor (often used in pooling experimentsto correct for unequal efficiencies in measuringallele signals24) and repeated measurement of DNApools.19,25,26 The fundamental difference between our andprevious algorithms for estimating allele frequencies inDNA pools using the Affymetrix GeneChip Mapping100K Array19,26 is in calculation of RAS scores for eachSNP. The Affymetrix algorithm obtains RAS as a medianvalue from five quartet sense (RAS1) and five quartetantisense (RAS2) sequences (containing match andmismatch SNP probes). For most SNPs, this is sufficientto distinguish between homozygous and heterozygousgenotypes in analysis of individual samples.6 However,the s.d. of the estimated allele frequency in pooledsamples using the Affymetrix algorithm (7.4%) is simplytoo great to apply to WGA scanning. To address this, wecalculated all 10 RAS scores (corresponding to the fivesense and five antisense probe quartets) and summedthese scores weighted with coefficients that wereinversely proportional to the square of their variability.The s.d. of the difference in allele frequency betweenpools of cases and controls consists of two components,the first coming from the limited size of the pools and thesecond from imprecise measurement of the allelefrequency in the pool (see above). Although the s.d. willalways decrease with increasing number of pooled

samples it cannot become lower than its secondcomponent. Thus the power of the study will quicklyplateau after the size of the pools becomes sufficientlylarge for the second component of the s.d. to becomelarger than the first. Such conditions will be obtained forpools with B400 samples for Affymetrix 100K chips andwith B550 samples for Illumina 100K chips. In our study,the NZ case pool approaches the optimal sample size forthe Affymetrix chip, but other sample sizes would needto be increased to improve power to detect loci withstatistically smaller effects than PTPN22. The approachof combining data from the NZ and UK cohorts alsoreduced the noise in the WGA scan data observed whenthe cohorts were analysed separately (Figure 5); the HLAand PTPN22 associations are considerably more obviouswhen the cohorts are analysed together (Figure 5b and d)than separately (Figure 5a and c).

For several reasons, if economic considerations are notcentral to the design of a WGA study, individualgenotyping is preferable to estimation of allele frequen-cies by DNA pooling. This is because the use of DNApooling does not enable the detection and controlling ofpopulation stratification, and there is the loss of ability tostudy haplotypes and to undertake gene–gene inter-action studies. However, if cost is an impediment thenthis DNA pooling methodology will enable WGA

3

3.5

4

4.5

5

5.5

6

3

3.5

4

4.5

5

5.5

6

3

3.5

4

4.5

5

5.5

6

3

3.5

4

4.5

5

5.5

6

6.5

7

0

0

10000 20000

20000

30000 40000

40000

60000

60000

50000 70000 80000

80000 100000 0 20000 40000 60000 80000 100000

90000 0 10000 20000 30000 40000 6000050000 70000 80000 90000

a

c

b

d

Figure 5 Z-score plots of SNPs for the separate (a) and combined (b) analyses of the NZ (red dots) and UK (blue dots) WGA Affymetrix dataand for the separate (c) and combined analyses of the Illumina data. The scale is in 100 kbpU, beginning at Chr 1 and finishing at Chr X.PTPN22 is at 113 Mb, HLA at 3900 Mb.

Genomic DNA pooling for WGAS Steer et al

63

Genes and Immunity

scanning. We have demonstrated the efficacy of DNApooling on both the Affymetrix 100K GeneChip 100Kand Illumina Infinium platforms and have demonstratedits utility on the Illumina HapMap300 chip (data notshown). In general, cost considerations still limit currentWGA scans using individual genotyping to only oneplatform.

Carlton et al.21 identified one disease-susceptibilityPTPN22 haplotype (haplotype ‘2’, uniquely defined bythe 620W allele) and one disease-protective haplotype inUS Caucasians (haplotype ‘5’). Our data confirm theexistence of a disease-protective haplotype in NZCaucasians – the T allele of the MAGI3 SNP rs1343235in combination with PTPN22 620R defined a protectivehaplotype (Table 3b; P¼ 1.3� 10�5). Our data alsoprovide evidence that there is an additional risk haplo-type, carrying the rs1343125 risk C allele with the protec-tive PTPN22 620R allele. Combined with the analysis ofassociation of rs1343125 conditional on genotype atR620W (P¼ 0.03), these findings are evidence of a secondRA locus in the extended PTPN22 haplotype block(Table 3b). MAGI3 can be considered a candidate RAsusceptibility gene; MAGI3 associates with the Notch-activating Delta proteins27 and NOTCH signalling hasbeen implicated in the pathophysiology of RA.28 Thepossibility of a RA susceptibility determinant elsewherein the PTPN22 haplotype block is further supported bythe observation that the haplotype ‘5’-defining PTPN22SNP rs1276045721 is in complete LD with four MAGI3SNPs (Figure 4; rs10489936, rs3761931, rs3747998 andrs1080307). None of these SNPs were associated withRA in our study (Table 2; z¼ 0.84 for the SNPs on theAffymetrix 100K GeneChip chip and z¼ 0.69 for theSNPs on the Illumina Infinium chip). There is, however, agroup of SNPs within PHTF1 (putative homeodomaintranscription factor 1) and the 30 end of PTPN22(rs2273758, rs3789598, rs3789600 and rs2476600) that arein moderate LD (0.3or2o0.4) with the four MAGI3 SNPsand rs12760457, and for which there is stronger evidencefor association with RA (3.1ozo4.2). These PHTF1/PTPN22 SNPs exhibit low LD with PTPN22 R620W(rs2477601) (r2p0.2). These WGA data do indicate thatfurther analysis of the PTPN22 region is warranted;understanding the relationship of these SNPs with theprotective haplotype ‘5’ previously identified21 should bemost informative in identifying a possible second RAsusceptibility determinant in the PTPN22 region.

The HLA region, and to a lesser extent the PTPN22locus, dominated the top-ranked SNPs (Tables 1 and 2).This is atypical in the context of other WGA scans.9,10,29

PTPN22 maps within an unusually large block ofextended LD (4300 kb) that contains several other genes.This large haplotype block, combined with the selectionof gene-centric SNPs on the Illumina Infinium micro-array7 goes some way to explaining the number of top-ranked PTPN22 SNPs in our WGA data. Dominance ofSNPs in the HLA region is less surprising, givenprevious evidence for the existence of multiple RAsusceptibility loci in this region. The HLA region was oneof the top seven regions detected as associated with RAin Japanese as a result of the WGA genotyping of micro-satellites over pooled genomic DNA samples.30 Thestrongest associated HLA SNP was the 41st rankedAffymetrix SNP in our genome-wide scan (rs2227139)and maps within the same HLA-DRA-containing block

of LD (32489917–32521295 Mb) as the highest rankedAffymetrix SNP (rs9268614) in our WGA scan. This blockis flanked by the BTNL2 and HLA-DRB3 loci, as well asC6orf10 (for which no transcripts have been iden-tified to date). Several previous studies have reportedassociations in the telomeric class III region, independentof HLA-DRB1, particularly centred on the lymphotoxin a,and tumour necrosis factor-a loci.31–35 Our WGA scandata did not provide evidence for association with RA ofSNPs in this region (all zo0.8), but did demonstrateassociation with SNPs in and around loci encodingcomponents of the complement pathway, complementcomponent C2 and complement factor B. Previousstudies that have demonstrated association in andaround these loci have not been able to consistentlyshow that association is independent of the HLA-DRB1*04 haplotype.36,37 However, taken together, datagenerated by us and others30–35,38,39 strongly suggestmultiple RA susceptibility loci within the HLA region,with the HLA class II association the strongest. Acomprehensive HLA SNP genotyping experiment iswarranted in RA, using sufficiently large cohorts toenable detection of effects independent of HLA-DRB1.

We have empirically demonstrated that a WGA scanusing DNA pooling and combination of data fromindependent cohorts is an effective method for detectingassociation at a genome-wide level of significance tocomplex disease loci of relatively large effect. However,additional strategies will be needed to enable detectionof genuine association to loci situated in regions of lowerLD than PTPN22 (which would have lower numbers ofassociated SNPs) and to loci of weaker effect (OR 1.2–1.5.e.g. CTLA4, PADI4). Lowering the threshold of signifi-cance for selection of SNPs for follow-up analysis may benecessary for SNPs within functionally relevant genesand genes mapping within areas of linkage to disease.This can be achieved using the false-discovery rate (FDR)principle, which maximizes power by controlling thefraction of false rejections rather than the type I errorrate. An FDR approach for weighting WGA scan P-values on the basis of previous linkage data has beenproposed.40 Finally, replication of putative associationsto loci of weak effect will be vital in additional cohortsof large size. Guidelines for conducting WGA scanshave been published.41 However it is likely that optimalstrategies for novel disease gene discovery using WGAscanning will be refined in an empirical process, ofwhich this study and others9,20,29 represent the first steps.The use of DNA pooling should facilitate developmentof these WGA scanning strategies.

Patients and methods

Clinical samplesThe NZ cohort consisted of 384 (268 females and 116males) patients and 296 healthy controls (148 femalesand 148 males) and the UK cohort of 241 (184 femalesand 57 males) patients and 262 healthy controls (116females and 146 males). All patients satisfied theAmerican College of Rheumatology criteria for the classi-fication of RA.42 Clinical characteristics for the cohortsare (NZ/UK, 7s.d.); 69.4/76.3% females, 42.7714.5/48.3713.5 years of age at onset, 17.679.0/13.5710.8years disease duration, 81.7/64.7% rheumatoid factor

Genomic DNA pooling for WGAS Steer et al

64

Genes and Immunity

positive and 82.1/77.2% positive for the shared epitope.2

Ethical approval for the study in NZ was given by theOtago ethics committee (as lead committee), and in theUK by the Lewisham Hospital and Guy’s and St Thomas’Hospitals local research ethics committees. All subjectswere white Caucasian and gave written informedconsent.

Genomic DNA preparation and poolingThe NZ genomic DNA samples were all prepared fromwhite blood cells pelleted from whole blood usinga standard Gu-HCl-based white blood cell lysis andchloroform extraction protocol; the UK case genomicDNA samples were all extracted using a standard Tris-HCl-based white blood cell lysis and phenol/chloroformextraction protocol, and the UK control genomic DNAsamples were extracted from immortalized cell linesusing Qiagen spin column separation technology. DNAsamples were electrophoresed on agarose gels andsamples with intact genomic DNA showing no smearingon agarose gel electrophoresis were selected for pooling.Intact genomic DNA was diluted to 50 ng/ml concentra-tion based on Quant-iT Picogreen (Invitrogen, Eugene,Oregen) quantitation and then concentration confirmedby repeating the picogreen analysis. Concentrations wereadjusted based on these results and then picogreenanalysis was repeated. This process was repeated untilall samples consistently measured 50 ng/ml. Pools wereconstructed by combining equal volumes of each DNA.All pipetting steps were of volumes greater than 2ml tominimize pipetting error. Four replicates from each poolwere prepared and hybridized to Affymetrix GeneChipMapping 100K Array and Illumina Infinium microarraysaccording to the manufacturers protocols.

Estimation of pooled allele frequency using AffymetrixGeneChip mapping 100K arrayA new approach was developed to estimate allelefrequency by combining hybridization intensities foreach SNP from the 10 different oligonucleotide probequartets consisting of perfect-match (PM) and mismatch(MM) pairs for both alleles (five in the sense directionand five in the antisense direction). In the current versionof the Affymetrix algorithm, signals from perfectlymatched probes are adjusted on the signal frommismatched probes:43

PM0A ¼ PMA � MM and PM0

B ¼ PMB � MM

PMA is the signal from a probe perfectly matched toallele A, PMB is the signal from a probe perfectlymatched to allele B, and MM¼ (MMAþMMB)/2 is theaveraged signal from probes mismatched both to allelesA and B. If either PM0

A or PM0B is negative its value is

made equal to zero. For each of the quartets a RAS can becalculated:

RAS ¼ maxðPM0A; 0Þ=ðmaxðPM0

A; 0Þ þ maxðPM0B; 0ÞÞ

These RAS values allow clear distinction betweenhomozygous and heterozygous genotypes for the major-ity of SNPs. However there are three reasons (A, B andC) why they do not provide a sufficiently accurateestimate of average allele frequency among pooledsamples. A: Not all RAS values for the same SNP areequally good for determination of allele frequency (e.g.,

some of them might have higher noise level than others).B: RAS values are always between 0 and 1, which makesdistribution of RAS values different from normal forhomozygous genotypes in individual samples and forSNPs with low allele frequency in pools. In particular,variation of RAS values for homozygous calls isartificially reduced. This creates bias towards higherheterozygosity (average RAS values even for homozy-gous calls appear to be higher than 0 and lower than 1)and makes any further statistical analysis difficult. C:The intensity of fluorescence of different alleles is notnecessarily equal, creating bias in overestimation of thefrequency of the allele with higher expression. In ouranalysis we addressed these problems in order todecrease the s.d. First of all, we slightly changed thedefinition of RAS values:

RAS ¼ ðPMA � MMÞ=ðPMA þ PMB � 2MMÞ

This change allows RAS to be o0 and 41 thuseliminating bias towards higher heterozygosity andresolving problem B (we tried to exclude averagedmismatch signal MM from the above equation but wefound that precision in estimation of allele frequencydecreased for the majority of SNPs (data not shown)).We started by genotyping 94 individual CEPH samples.Then for each quartet we calculated the median RASvalues for genotypes AA, AB and BB (RASAA, RASAB andRASBB). We used median values rather than mean todecrease effect of outliers. For simplicity we assume inour further explanations that RASAAXRASABXRASBB. Toestimate the average allele frequency among pooledsamples linear interpolation was first used to calculatethe frequency of allele A based on the data from eachquartet. If in the pool RASXRASAB the followingequation was used to estimate allele frequency:

f ¼ 0:5ðRAS � RASAAÞ=ðRASAB � RASAAÞ

If RASABXRAS the following equation was usedinstead:

f ¼ 1 � 0:5ðRAS � RASBBÞ=ðRASAB � RASBBÞ

This approach allowed minimization of the bias fromproblems B and C. After that we combined thesefrequencies with weights reflecting the quality of thecorresponding RAS values, which allowed resolution ofproblem A:

f ¼X

qwqfq

where the sum is over q¼ 1 to 10 and wq is a weight forquartet q. To determine these parameters we usedgenotyping data of the individually typed CEPHsamples (with the 0.25 quality parameter cutoff theaverage call rate for the Affymetrix Xba chip was 97.8and 96.8% for the Hind chip). Each individuallygenotyped sample s can be considered as a poolconsisting of a single sample. The allele frequency fs inthis pool is fully determined by the individual’sgenotype: fs¼ 1 for the AA genotype, fs¼ 0.5 for the ABgenotype and fs¼ 0 for the BB genotype. In order todetermine optimal weights wq for quartets, we mini-mized the differences between predicted allele frequen-cies f¼

Pqwqfq and known frequencies fs. More precisely,

Genomic DNA pooling for WGAS Steer et al

65

Genes and Immunity

the following expression over wq was minimized:

E ¼X

s

Xq

wqfq � fs

!2

where s denotes individually genotyped samples(s¼ 1,y, n), and fs is 0, 0.5 or 1.0 according to thegenotype of the individual.

Estimation of pooled allele frequency using the IlluminaInfinium arrayA similar approach was used to estimate average allelefrequency among pooled samples using the IlluminaInfinium array. For each SNP there are two probes in thisarray corresponding to each allele. We started withanalysis of genotyping data for 120 individual DNAsamples provided to us by Illumina (the average call ratefor the Infinium microarray was 99.96%). Median signalwas determined for both alleles for genotypes AA (AAA

and BAA), AB (AAB and BAB) and BB (ABB and BBB). ThreeRASs for different genotypes was estimated using thefollowing formulae:

RASAA¼ AAA=ðkBAAþAAAÞ;RASAB¼ AAB=ðkBABþAABÞ;RASBB¼ ABB=ðkBBBþABBÞ

Parameter k compensates for over-expression ofone of the alleles. It was estimated from the formulaRASAB ¼ (RASAA þRASBB)/2

If in the pool RASXRASAB the following equation wasused to estimate allele frequency:

f¼ 0:5ðRAS � RASAAÞ=ðRASAB � RASAAÞ

If RASABXRAS the following equation was usedinstead:

f¼ 1 � 0:5ðRAS � RASBBÞ=ðRASAB � RASBBÞ

SNPs in LDEstimated allele frequencies for SNPs in total LD wereaveraged with weights based on their performance:

f0 ¼X

i

fi=s2i =X

i

1=s2i

where si2 is the SNP-specific s.d. of the allele frequency

estimation, fi is the estimate of allele frequency of one ofthe group of SNPs in LD, and f0 is the averaged estimateof allele frequency for this group of SNPs.

Exclusion of SNPs from analysisOf the 116 204 Affymetrix GeneChip Mapping 100KArray SNPs, 15.35% were excluded from analysis on thebasis of not being informative in the CEPH samples(12.69%), location in duplicated regions (0.27%; wherebythere was a 100% match of the 50 surrounding bases totwo different positions within the human genome),deviation from Hardy–Weinberg equilibrium (HWE)(1.19%; Po1�10�5), and poor performance in estimatingpooled allele frequency in CEPH individuals (1.20%;difference 412%). In the case of the Illumina Infiniumarray, 4.08% of the 109 365 SNPs were excluded fromanalysis on the basis of not being informative (4.07%)

or presence in duplicated regions (0.01%) (insufficientCaucasian CEPH samples were genotyped to obtain dataon SNPs deviating from HWE or performing poorly inallele frequency estimates from genomic DNA pools).

Differential bias in genotype scoring between case andcontrol cohorts can arise when DNA samples originatefrom different laboratories; this was observed using theMegAllele (ParAllele BioScience) technology, with theauthors having no reason to believe that the differentialbias phenomenon is specific to this genotyping plat-form.44 The possibility of this occurring is unlikely withour method – all SNPs with a high ‘half-call’ rate thatcontribute to this phenomenon44 would be excludedfrom further analysis owing to high s.d.

Detection of associationThe difference in allele frequency between pools wasexpressed as a z-score:

z ¼ jfcases � fcontrolsjswhere fcases is the frequency of a SNP allele in cases,fcontrols is the same in controls and s is the s.d. whichconsists of two parts:

s2 ¼ s2s þ s2

m

The first part of the total s.d., ss2, represents a sampling

error owing to the limited number of cases and controls,and is the only error in the situation where cases andcontrols are genotyped individually and can be ex-pressed as:

s2s ¼ fð1 � fÞðNcases þ NcontrolsÞ=ðNcasesNcontrolsÞ

where Ncases and Ncontrols are the number of chromosomesin cases and controls, respectively, and f is the populationallele frequency. The second part of the total s.d., sm

2 , isthe error owing to imprecise measurement of the allelefrequency in a pool:

s2m ¼s2

0=ncases þ s20=ncontrols

¼s20ðncases þ ncontrolsÞ=ðncasesncontrolsÞ

where ncases and ncontrols are the numbers of microarrayreplicates for pools of cases and controls (for thesestudies ncases ¼ 4 and ncontrols¼ 4), respectively, and s0

2 isthe SNP specific s.d. of the allele frequency estimation.Under the null hypothesis of no association, the z-score isnormally distributed with mean¼ 0 and variance¼ 1.Therefore a z-score can be translated into a significancelevel expressed as a P-value.

Composite z-scores were calculated in four steps(A–D). Step A. For each study SNPs were orderedaccording to their z-score (z-score1UK, z-score1NZ) andcorresponding rank numbers assigned to each SNP. StepB. P-values were then calculated for each SNP to attainthe given rank in each study: Pi¼Ri/N where Ri is rank ofSNP i and N is the total number of SNPs. After that weconverted these P-values back to z-scores (z-score2UK,z-score2NZ) assuming a normal distribution. Step C. Foreach SNP with the allele frequency in cases lower than incontrols, z-score3¼�z-score2. Otherwise, z-score3¼z-score2. Step D. The composite z-score was calculatedas |z-score3UKþ z-score3NZ|/O2. Composite z-scoreswere obtained from Illumina Infinium-derived allele

Genomic DNA pooling for WGAS Steer et al

66

Genes and Immunity

frequency estimates by applying Steps C and D only,whereas Steps A–D were applied to Affymetrix 100KGeneChip-derived allele frequency estimates becausestrong fluctuations in signal intensity in some SNPs ledto deviation from the normal distribution of originalz-scores. Therefore Steps A and B were added to enforcea Gaussian (normal) distribution of z-scores. Unfortu-nately, this also decreased the power of our method,weakening the effect from genuinely disease-associatedSNPs whose z-scores do not fall in a Gaussian distri-bution.

Target regions examinedFor PTPN22, analysis was limited to SNPs within alarge conserved haplotype block of B365 kb (113.800–114.165 Mb), which contained all of PTPN22 (114.004–114.127 Mb). In the case of the HLA region, analysisextended to the most centromeric class II locus (DPA3).The HLA class II and III physical window was 31.500–33.207 Mb. Between the Affymetrix GeneChip Mapping100K and Illumina Infinium arrays there were 257 SNPsincluded in the analysis within the HLA window (60SNPs on the Affymetrix GeneChip Mapping 100K arrayand 202 SNPs on the Illumina Infinium array, five ofthese SNPs appeared on both arrays) and 45 within thePTPN22 window (11 SNPs on the Affymetrix GeneChipMapping 100K array and 35 SNPs on the IlluminaInfinium array, one of these SNPs appeared on botharrays). Haplotype block structure and intermarker LDrelationships were inferred from CEU Caucasian dataavailable from HapMap (www.hapmap.org).

Individual genotypingIndividual genotyping was carried out for the HLArs9268614 and MAGI3 rs1343125 SNPs over 869 NZcases and 563 controls (the same cohort genotypedwith rs2476601 (PTPN22 R620W45) using PCR-RFLP.For rs9268614 PCR with primers ACACGGGCCATGAAGGAATCTGAA and GTTGAAGGCAGGAATGAGTGTGGT created a 229 bp product that was cleaved byHhaI into 150 and 79 bp products in the presence of the Gallele. For rs1343125 PCR with primers CTATCTACTGACCATTCTGGTATC and TCCTCTATAGTGTGAAATTGAGGG created a 305 bp product that was cleavedby MspI into fragments of 207 and 98 bp. 11.3% of thesamples were also genotyped for rs1343125 using SNPlex(Applied Biosystems, Foster City, CA, USA) – there was100% concordance in genotypes between the twomethodologies. This SNP was not in HWE in the NZcontrols (Table 3a; P¼ 0.002); a possible reason for this isthe possibility of a 63.1 kb deletion within MAGI3encompassing rs1343125.46 Allele frequencies at indivi-dual SNPs were compared between cases and controlsusing standard w2 statistics, with Fisher’s P-valuesreported. Alleles were tested for deviation from HWEusing the SHEsis software platform.47

COCAPHASE from UNPHASED48 was used to imple-ment the expectation-maximization (EM) algorithm forhaplotype estimation, to perform haplotype associationtesting (unconditional logistic regression based on themaximum-likelihood frequency estimates from the EMalgorithm) between cases and controls. Individualhaplotypes were tested for association by grouping allothers together, and haplotype-specific ORs generated inthe same way. Ninety-five percent confidence intervals

(CIs) were calculated from estimated haplotype ‘counts’using Woolf’s method; this is likely to underestimate the95% CI owing to uncertain haplotype phase. Conditionalanalysis of one locus on the allele at a second locus, inLD with the first locus and known to be associated,was also performed in COCAPHASE; this implementsthe test of equality of ORs for haplotypes identical atconditioning loci.

Acknowledgements

This work was supported by the Health ResearchCouncil of New Zealand, the Arthritis and RheumatismCouncil in the United Kingdom, Myriad Genetics Inc.and NHS Research and Development funding forrecruitment carried out at Guy’s and St Thomas’ andLewisham hospitals. We thank NZ research nurses GaelHewett and Sue Yeoman, UK research nurse JanetGrumley, and Bhaneeta Lad for technical assistance,and Cathryn Lewis and Sheila Fisher for statisticaladvice.

References

1 MacGregor AJ, Snieder H, Rigby AS, Koskenvuo M, Kaprio J,Aho K, Silman AJ. Characterising the quantitative geneticcontribution to rheumatoid arthritis using data from twins.Arthritis Rheum 2000; 43: 30–37.

2 Gregersen PK, Silver J, Winchester RJ. The shared epitopehypothesis. An approach to understanding the moleculargenetics of susceptibility to rheumatoid arthritis. ArthritisRheum 1987; 30: 1205–1213.

3 Gregersen PK, Lee HS, Batliwalla F, Begovich AB. PTPN22:setting thresholds for autoimmunity. Semin Immunol 2006; 18:214–223.

4 Iwamoto T, Ikari K, Nakamura T, Kuwahara M, Toyama Y,Tomatsu T et al. Association between PADI4 and rheumatoidarthritis: a meta-analysis. Rheumatology 2006; 45: 804–807.

5 Plenge RM, Padyukov L, Remmers EF, Purcell S, Lee AT,Karlson EW et al. Replication of putative candidate-geneassociations with rheumatoid arthritis in 44000 samples fromNorth America and Sweden: association of susceptibilitywith PTPN22, CTLA4, and PADI4. Am J Hum Genet 2005; 77:1044–1060.

6 Matsuzaki H, Dong S, Loi H, Di X, Liu G, Hubbell E et al.Genotyping over 100 000 SNPs on a pair of oligonucleotidearrays. Nat Methods 2004; 1: 109–111.

7 Gunderson KL, Steemers FJ, Lee G, Mendoza LG, Chee MS.A genome-wide scalable SNP genotyping assay using micro-array technology. Nat Genet 2005; 37: 549–554.

8 Gunderson KL, Kuhn KM, Steemers FJ, Ng P, Murray SS,Shen R. Whole-genome genotyping of haplotype tag singlenucleotide polymorphisms. Pharmacogenomics 2006; 7:641–648.

9 Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C et al.Complement factor H polymorphism in age-related maculardegeneration. Science 2005; 308: 385–389.

10 Herbert A, Gerry NP, McQueen MB, Heid IM, Pfeufer A, IlligT et al. A common genetic variant is associated with adult andchildhood obesity. Science 2006; 312: 279–283.

11 Smyth DJ, Cooper JD, Bailey R, Field S, Burren O, Smink LJet al. A genome-wide association study of nonsynonymousSNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region. Nat Genet 2006; 38: 617–619.

12 Zuo Y, Zuo G, Zhao H. Two-stage designs in case–controlassociation analysis. Genetics 2006; 173: 1747–1760.

Genomic DNA pooling for WGAS Steer et al

67

Genes and Immunity

13 Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis ismore efficient than replication-based analysis for two-stagegenome-wide association studies. Nat Genet 2006; 38: 209–213.

14 Fisher PJ, Turic D, Williams NM, McGuffin P, Asherson P, BallD et al. DNA pooling identifies QTLs on chromosome 4 forgeneral cognitive ability in children. Hum Mol Genet 1999; 8:915–922.

15 Bansal A, van den Boom D, Kammerer S, Honisch C, Adam G,Cantor CR et al. Association testing by DNA pooling:an effective initial screen. Proc Natl Acad Sci USA 2002; 99:16871–16874.

16 Begovich AB, Carlton VE, Honigberg LA, Schrodi SJ,Chokkalingam AP, Alexander HC et al. A missense single-nucleotide polymorphism in a gene encoding a proteintyrosine phosphatase (PTPN22) is associated with rheumatoidarthritis. Am J Hum Genet 2004; 75: 330–337.

17 Spector TD, Reneland RH, Mah S, Valdes AM, Hart DJ,Kammerer S et al. Association between a variation in LRCH1and knee osteoarthritis. Arthritis Rheum 2006; 54: 524–532.

18 Spinola M, Meyer P, Kammerer S, Falvella FS, Boettger ME,Hoyal CR et al. Association of the PDCD5 locus with lungcancer risk and prognosis in smokers. J Clin Oncol 2006; 24:1672–1678.

19 Meaburn E, Butcher LM, Schalkwyk LC, Plomin R. Genotyp-ing pooled DNA using 100K SNP microarrays: a step towardsgenomewide association scans. Nucl Acids Res 2006; 34: e27.

20 Downes K, Barratt BJ, Akan P, Bumpstead SJ, Taylor SD,Clayton DG, Deloukas P. SNP allele frequency estimation inDNA pools and variance components analysis. Biotechniques2004; 5: 840–845.

21 Carlton VE, Hu X, Chokkalingam AP, Schrodi SJ, Brandon R,Alexander HC et al. PTPN22 genetic variation: evidence formultiple variants associated with rheumatoid arthritis. Am JHum Genet 2005; 77: 567–581.

22 Altshuler D, Hirschhorn JN, Klannemark M, Lindgren CM,Vohl MC, Nemesh J et al. The common PPARgamma Pro12Alapolymorphism is associated with decreased risk of type 2diabetes. Nat Genet 2000; 26: 76–80.

23 Barrett JC, Cardon LR. Evaluating coverage of genome-wideassociation studies. Nat Genet 2006; 38: 659–662.

24 Moskvina V, Norton N, Williams N, Holmans P, Owen M,O’Donovan M. Streamlined analysis of pooled genotype datain SNP-based association studies. Genet Epidemiol 2005; 28:273–282.

25 Meaburn E, Butcher LM, Liu L, Fernandes C, Hansen V,Al-Chalabi A et al. Genotyping DNA pools on microarrays:Tackling the QTL problem of large samples and large numbersof SNPs. BMC Genomics 2005; 6: 52–59.

26 Kirov G, Nikolov I, Georgieva L, Moskvina V, Owen MJ,O’Donovan MC. Pooled DNA genotyping on Affymetrix SNPgenotyping arrays. BMC Genomics 2006; 7: 27.

27 Wright GJ, Leslie JD, Ariza-McNaughton L, Lewis J. Deltaproteins and MAGI proteins: an interaction of Notch ligandswith intracellular scaffolding molecules and its significancefor zebrafish development. Development 2004; 131: 5659–5669.

28 Ando K, Kanazawa S, Tetsuka T, Ohta S, Jiang X, Tada T et al.Induction of Notch signaling by tumor necrosis factor inrheumatoid synovial fibroblasts. Oncogene 2003; 22: 7796–7803.

29 Arking DE, Pfeufer A, Post W, Kao WHL, Newton-Cheh C,Ikeda M et al. A common genetic variant in the NOS1regulator NOS1AP modulates cardiac repolarization. NatGenet 2006; 644: 644–651.

30 Tamiya G, Shinya M, Imanishi T, Ikuta T, Makino S, OkamotoK et al. Whole genome association study of rheumatoidarthritis using 27 039 microsatellites. Hum Mol Genet 2005; 14:2305–2321.

31 Newton J, Brown MA, Milicic A, Ackerman H, Darke C,Wilson JN et al. The effect of HLA-DR on susceptibility torheumatoid arthritis is influenced by the associated lympho-toxin alpha-tumor necrosis factor haplotype. Arthritis Rheum2003; 48: 90–96.

32 Newton JL, Harney SM, Timms AE, Sims AM, Rockett K,Darke C et al. Dissection of class III major histocompatibilitycomplex haplotypes associated with rheumatoid arthritis.Arthritis Rheum 2004; 50: 2122–2129.

33 Jawaheer D, Li W, Graham RR, Chen W, Damle A, Xiao X et al.Dissecting the genetic complexity of the association betweenhuman leukocyte antigens and rheumatoid arthritis. Am JHum Genet 2002; 71: 585–594.

34 Singal DP, Li J, Lei K. Genetics of rheumatoid arthritis (RA):two separate regions in the major histocompatibility complexcontribute to susceptibility to RA. Immunol Lett 1999; 69:301–306.

35 Zanelli E, Jones G, Pascual M, Eerligh P, van der Slik AR,Zwinderman AH et al. The telomeric part of the HLA regionpredisposes to rheumatoid arthritis independently of the classII loci. Hum Immunol 2001; 62: 75–84.

36 Dyer PA, Thomson W, Sanders PA, Grennan DM. Are majorhistocompatibility system class III products independentmarkers for susceptibility to rheumatoid arthritis? Dis Markers1986; 4: 151–155.

37 Fielder AH, Ollier W, Lord DK, Burley MW, Silman A, Awad Jet al. HLA class III haplotypes in multicase rheumatoidarthritis families. Hum Immunol 1989; 25: 75–85.

38 Okamoto K, Makino S, Yoshikawa Y, Takaki A, Nagatsuka Y,Ota M et al. Identification of I kappa BL as the second majorhistocompatibility complex-linked susceptibility locus forrheumatoid arthritis. Am J Hum Genet 2003; 72: 303–312.

39 Brintnell W, Zeggini E, Barton A, Thomson W, Eyre S, Hinks Aet al. Evidence for a novel rheumatoid arthritis suscepti-bility locus on chromosome 6p. Arthritis Rheum 2004; 50:3823–3830.

40 Roeder K, Bacanu SA, Wasserman L, Devlin B. Using linkagegenome scans to improve power of association in genomescans. Am J Hum Genet 2006; 78: 243–252.

41 Ehm MG, Nelson MR, Spurr NK. Guidelines for conductingand reporting whole genome/large-scale association studies.Hum Mol Genet 2005; 14: 2485–2488.

42 Arnett FC, Edworthy SM, Bloch DA, McShane DJ, Fries JF,Cooper NS et al. The American Rheumatism Association 1987revised criteria for the classification of rheumatoid arthritis.Arthritis Rheum 1988; 31: 315–324.

43 Liu WM, Di X, Yang G, Matsuzaki H, Huang J, Mei R et al.Algorithms for large-scale genotyping microarrays. Bioinfor-matics 2003; 19: 2397–2403.

44 Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, MaierLM et al. Population structure, differential bias and genomiccontrol in a large-scale, case–control association study. NatGenet 2005; 37: 1243–1246.

45 Simkins HM, Merriman ME, Highton J, Chapman PT,O’Donnell JL, Jones PB et al. Association of the PTPN22 locuswith rheumatoid arthritis in a New Zealand Caucasian cohort.Arthritis Rheum 2005; 52: 2222–2225.

46 Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK.A high-resolution survey of deletion polymorphism in thehuman genome. Nat Genet 2006; 38: 75–81.

47 Shi YY, He L. SHEsis, a powerful software platform foranalyses of linkage disequilibrium, haplotype construction,and genetic association at polymorphic loci. Cell Res 2005; 15:97–98.

48 Dudbridge F. Pedigree disequilibrium tests for multilocushaplotypes. Genet Epidemiol 2003; 25: 115–121.

Genomic DNA pooling for WGAS Steer et al

68

Genes and Immunity