linkage disequilibrium maps constructed with common snps are useful for first-pass disease...

14
Linkage disequilibrium maps constructed with common SNPs are useful for first-pass disease association screens P. Taillon-Miller a,1 , S.F. Saccone b,1 , N.L. Saccone c,1 , S. Duan a , E.F. Kloss a , E.G. Lovins d , R. Donaldson a , A. Phong d , C. Ha d , L. Flagstad a , S. Miller a , A. Drendel a , D. Lind d , R.D. Miller a , J.P. Rice b , P-Y. Kwok d,e, * a Department of Dermatology, Washington University School of Medicine, St. Louis, MO 63110, USA b Department of Psychiatry, Washington University School of Medicine, St. Louis, MO 63110, USA c Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA d Cardiovascular Research Institute, University of California at San Francisco, Long 1332A, San Francisco, CA 94143, USA e Department of Dermatology, University of California at San Francisco, Long 1332A, San Francisco, CA 94143, USA Received 10 August 2004; accepted 10 August 2004 Available online 17 September 2004 Abstract To develop an efficient strategy for mapping genetic factors associated with common diseases, we constructed linkage disequilibrium (LD) maps of human chromosomes 5, 7, 17, and X. These maps consist of common single nucleotide polymorphisms at an average intermarker distance of 100 kb. The genotype data from these markers in a panel of American samples of European descent were analyzed to produce blocks of markers in strong pair-wise LD. Power calculations were used to guide block definitions and predicted that high-level LD maps would be useful in initial genome scans for susceptibility alleles in case-control association studies of complex diseases. As anticipated, LD blocks on the X chromosome were larger and covered more of the chromosome than those found on the autosomes. D 2004 Elsevier Inc. All rights reserved. Keywords: Polymorphism; Single nucleotide; Linkage disequilibrium; Genomics; Sample size Introduction With the finished reference human genome sequence available and haplotype maps for human panels from three continents being constructed [1], there is intense discussion and debate on how to take advantage of the genomic information to study common diseases with a genetic approach. In an ideal world, the functions of all the genes and their regulatory elements would be known and the entire genome of an individual could be sequenced quickly and inexpensively. Therefore, one could obtain the com- plete genome sequences of a group of patients suffering from a disease and compare them to those of a group of appropriately selected control individuals and identify genes or regulatory elements with functional differences between the two groups. But until this level of biological under- standing and technical capability is reached, one has to rely on indirect tools in the search for disease-predisposing alleles. While genetic linkage studies of families have been very successful in identifying genes responsible for simple traits, it has been proposed that association studies using single nucleotide polymorphisms (SNPs) will be most useful to identify genes involved in complex diseases [2]. Association testing relies on the principle that nearby SNPs are often in linkage disequilibrium (LD); however, LD is not a simple function of the distance between markers. Instead, one observes a complicated pattern of regions of extensive LD punctuated by regions of no LD across the genome [3–9]. This pattern of LD is a result of many 0888-7543/$ - see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2004.08.009 * Corresponding author. Fax: (415) 476 2283. E-mail address: [email protected] (P-Y. Kwok). 1 These authors contributed equally to this publication. Genomics 84 (2004) 899 – 912 www.elsevier.com/locate/ygeno

Upload: independent

Post on 22-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

www.elsevier.com/locate/ygeno

Genomics 84 (20

Linkage disequilibrium maps constructed with common SNPs

are useful for first-pass disease association screens

P. Taillon-Millera,1, S.F. Sacconeb,1, N.L. Sacconec,1, S. Duana, E.F. Klossa,

E.G. Lovinsd, R. Donaldsona, A. Phongd, C. Had, L. Flagstada, S. Millera,

A. Drendela, D. Lindd, R.D. Millera, J.P. Riceb, P-Y. Kwokd,e,*

aDepartment of Dermatology, Washington University School of Medicine, St. Louis, MO 63110, USAbDepartment of Psychiatry, Washington University School of Medicine, St. Louis, MO 63110, USAcDepartment of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA

dCardiovascular Research Institute, University of California at San Francisco, Long 1332A, San Francisco, CA 94143, USAeDepartment of Dermatology, University of California at San Francisco, Long 1332A, San Francisco, CA 94143, USA

Received 10 August 2004; accepted 10 August 2004

Available online 17 September 2004

Abstract

To develop an efficient strategy for mapping genetic factors associated with common diseases, we constructed linkage disequilibrium

(LD) maps of human chromosomes 5, 7, 17, and X. These maps consist of common single nucleotide polymorphisms at an average

intermarker distance of 100 kb. The genotype data from these markers in a panel of American samples of European descent were analyzed to

produce blocks of markers in strong pair-wise LD. Power calculations were used to guide block definitions and predicted that high-level LD

maps would be useful in initial genome scans for susceptibility alleles in case-control association studies of complex diseases. As anticipated,

LD blocks on the X chromosome were larger and covered more of the chromosome than those found on the autosomes.

D 2004 Elsevier Inc. All rights reserved.

Keywords: Polymorphism; Single nucleotide; Linkage disequilibrium; Genomics; Sample size

Introduction

With the finished reference human genome sequence

available and haplotype maps for human panels from three

continents being constructed [1], there is intense discussion

and debate on how to take advantage of the genomic

information to study common diseases with a genetic

approach. In an ideal world, the functions of all the genes

and their regulatory elements would be known and the

entire genome of an individual could be sequenced quickly

and inexpensively. Therefore, one could obtain the com-

plete genome sequences of a group of patients suffering

0888-7543/$ - see front matter D 2004 Elsevier Inc. All rights reserved.

doi:10.1016/j.ygeno.2004.08.009

* Corresponding author. Fax: (415) 476 2283.

E-mail address: [email protected] (P-Y. Kwok).1 These authors contributed equally to this publication.

from a disease and compare them to those of a group of

appropriately selected control individuals and identify genes

or regulatory elements with functional differences between

the two groups. But until this level of biological under-

standing and technical capability is reached, one has to rely

on indirect tools in the search for disease-predisposing

alleles. While genetic linkage studies of families have been

very successful in identifying genes responsible for simple

traits, it has been proposed that association studies using

single nucleotide polymorphisms (SNPs) will be most

useful to identify genes involved in complex diseases [2].

Association testing relies on the principle that nearby SNPs

are often in linkage disequilibrium (LD); however, LD is

not a simple function of the distance between markers.

Instead, one observes a complicated pattern of regions of

extensive LD punctuated by regions of no LD across the

genome [3–9]. This pattern of LD is a result of many

04) 899–912

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912900

factors, including genetic drift, admixture, migration,

population structure, variable mutation rates, variable

recombination rates, gene conversion, and natural selection

[10].

Defining the patterns of LD and the underlying haplotype

structure across the human genome is expected to aid in

elucidating the causes of complex human diseases by

helping researchers design association studies that make

efficient use of SNP markers [11]. Several groups have

demonstrated that the haplotype structures are complex,

requiring many markers to define the major haplotypes

found in even small regions of the genome (10 to 20 kb)

[4,6,8,9,12,13]. What is often overlooked, however, is the

fact that many contiguous bhaplotype blocksQ are highly

correlated and that markers in neighboring haplotype blocks

are often in strong LD with each other [5,9]. These

observations prompted us to evaluate whether high-level

LD maps would capture useful long-range LD structure and

allow selection of representative btagQ SNPs in regions of

strong LD. A hierarchical strategy of performing initial

genome scans with reduced-density maps in case-control

association studies could then provide an efficient way to

identify genetic factors associated with common diseases. In

this strategy a secondary screen would likely be required to

localize a disease variant to a specific gene once a region of

association has been detected. Variants in regions with no

LD could be missed in the initial pass, as would those with

very low frequency or multiple alleles, but these cases will

be difficult with any screening strategy. For such cases a

second pass with increased density in regions of low LD

might be necessary. However, a whole genome scan

following this approach could potentially use many fewer

SNPs than has been previously proposed, thus making such

studies practical. The approach could also be used to study

genetic effects that have been localized to chromosomal

regions in prior family studies, or to survey candidate genes

of special interest, as it provides a means for prioritization of

SNPs for follow-up genotyping.

To develop this approach, we present the LD maps of

human chromosomes 5, 7, 17, and X (covering ~20% of the

human genome) consisting of common SNPs chosen

according to allele frequency data generated from three

panels of 42 individuals (European, Asian, and African

American) and genotyped in a panel of 94 American

samples of European descent. LD block definitions were

guided by power calculations, which predict that reduced-

Table 1

Characteristics of SNPs analyzed

Chromosome Chromosome

length (Mb)

SNPs

genotyped

5 179.7 1442

7 156.4 1753

17 82.3 662

X 149.5 1487

density maps will be useful in genetic analyses of common

diseases.

Results

We selected 5344 common SNPs (each with previ-

ously estimated minor allele frequencies of z10%) in

each of three panels of individuals of European, Asian,

and African descent (a list of SNPs is provided in

Supplemental Tables S1a-S1d) across human chromosomes

5, 7, 17, and X. These SNPs were genotyped in 94

genetically unrelated samples from the CEPH collection

(Supplemental Tables S2a and S2b) using a primer

extension assay with fluorescence polarization detection

[14]. On the X chromosome, only males were genotyped so

we were able to determine complete haplotypes across the

chromosome (Supplemental Table S3). The genotypes

generated from the autosomal SNPs are given in Supple-

mental Tables S4a–S4c.

For each chromosome, pair-wise linkage disequilibrium

between all markers with minor allele frequencies (MAF)

z10%was measured usingD,DV, and r2. Because the resultswere very similar between SNPs with MAFz10% and those

with MAF z20%, only the data from the SNPs with MAF

z20% will be presented. The LD results from all pair-wise

comparisons with SNPs with MAFz10% are not given with

this publication but are available at http://snp.wustl.edu/

snp_research/ld_blocks; edited results for markers withMAF

z20% are also available at this Web site.

Table 1 summarizes the statistics of the SNPs genotyped

in this study. As predicted, 5098 (N95%) of the SNPs

genotyped were indeed common in the CEPH panel we

studied, with MAF z10%. In fact, 4405 (83%) of the

SNPs had a MAF z20% in the CEPH panel. The mean

intermarker distance between the SNPs with MAF z20%

was found to be 151, 106, 148, and 124 kb in

chromosomes 5, 7, 17, and X, respectively. But the mean

intermarker distance can be skewed by large gaps between

sequencing contigs, the centromeric region, and regions in

which very few SNPs were available at the time of this

study. Fig. 4c shows that for the complete marker set 15%

were closer than 5 kb, about 50% were closer than 25 kb,

and more than 70% (3130 SNPs, MAF z20%,) were

within 100 kb of their nearest neighbor. When only these

3130 SNPs are considered, in fact the mean intermarker

SNPs with

MAF z10%

SNPs with

MAF z20%

Mean intermarke

distance (kb)

MAF z20%

1389 1188 151

1659 1470 106

638 550 148

1412 1197 125

r

Table 2

Sample sizes to detect disease association in a case–control study with 90%

power at a significance level of 0.001, assuming a multiplicative model

g ES K f11 DV N

A. Disease allele frequency = 0.5; marker allele frequency = 0.5

1.5 1.042 0.05 0.072 0.5 1882

0.6 1306

0.7 958

1.5 1.042 0.1 0.144 0.5 1690

0.6 1172

0.7 860

1.5 1.042 0.2 0.288 0.5 1336

0.6 927

0.7 681

2 1.11 0.05 0.089 0.5 675

0.6 467

0.7 342

2 1.11 0.1 0.178 0.5 606

0.6 420

0.7 308

2 1.11 0.2 0.356 0.5 480

0.6 333

0.7 244

4 1.36 0.05 0.128 0.5 205

0.6 141

0.7 103

4 1.36 0.1 0.256 0.5 185

0.6 127

0.7 93

4 1.36 0.2 0.512 0.5 147

0.6 101

0.7 74

B. Disease allele frequency = 0.2; marker allele frequency = 0.5

2 1.11 0.05 0.139 0.5 2712

0.6 1882

0.7 1382

0.8 1057

2 1.11 0.1 0.278 0.5 2434

0.6 1690

0.7 1240

0.8 949

2 1.11 0.2 0.556 0.5 1924

0.6 1336

0.7 981

0.8 751

4 1.57 0.05 0.313 0.5 532

0.6 368

0.7 270

0.8 205

4 1.57 0.1 0.625 0.5 478

0.6 331

0.7 242

0.8 185

g, genotypic relative risk; ES, sibling relative risk; K, population

prevalence; f11, penetrance for d1d1 genotype; N, number of cases, which

equals the number of controls.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912 901

distance goes down to 26–28 kb on each of the four

chromosomes. Thus for the majority of the SNPs analyzed,

map density is much better than the overall average of

~100–150 kb.

Power calculations

Our primary motivation for constructing maps of LD

blocks is to provide the most efficient framework for

association studies of common, complex diseases. The idea

is that within LD blocks, one or more btagQ SNPs in strong

LD with the other markers in the block can be chosen to

represent the block (bLD tagsQ), thereby reducing the

amount of genotyping needed for a genome-wide associa-

tion study. We propose to define LD blocks using a DVthreshold that yields LD tags with good power for a case-

control disease association study, assuming that the strength

of LD between alleles of the disease gene and a genotyped

marker in the block is similar to the LD strength within the

block. We have tested this assumption using marker data in

place of the disease locus (see Using blocks to predict LD).

Given this assumption, a single SNP chosen from each LD

block can be used as an LD tag for a genome-wide

association screen. Using LD tags would not capture all

the haplotype diversity for the block, but would allow

detection of association in an initial screen while signifi-

cantly reducing genotyping time and costs; follow-up

analysis may then use a greater number of conventional

bhaplotype tagQ SNPs to characterize the most common

haplotypes for fine mapping. Because there are typically

numerous SNPs from which to choose, each LD tag may be

chosen to be as common as possible, and we have opted to

use LD tag markers with MAFs close to 0.5 for calculating

power. We have also made specific parameter choices for

the disease models to demonstrate this power-based

approach to LD map construction; these choices may be

varied depending upon the requirements of a particular

study. Details of the power calculations are provided under

Materials and methods.

Tables 2A and 2B give power results for a marker with

minor allele frequency of pm1 = 0.5 and two choices of

disease allele frequencies: pd1 = 0.5 and pd1 = 0.2. Sample

sizes (N) for the number of cases and an equal number of

controls required for 90% power at a significance level of

0.001 are reported as a function of DV, as calculated from the

noncentrality parameter for the appropriate noncentral m2

distribution. We chose 90% power rather than the usual 80%

because we would want to avoid false negative results in an

initial screen. The results show that requiring the absolute

value of DV to be greater than 0.7 is a useful threshold for

defining LD blocks. That is, if we use this threshold to

define LD blocks and assume that jDVj between alleles of thedisease gene and any typed marker from the block is also

0.7 or greater, then realistic sample sizes (approximately

1000 cases and 1000 controls) are able to detect disease

association with 90% power at a significance level of 0.001

for disease models corresponding to very modest genetic

effects of ES (sibling relative risk) = 1.04 and g (genotypic

relative risk) = 1.5, when pd1 = 0.5. Note that when pd1 =

0.2, corresponding to a more realistic disease allele

frequency, power to detect association at the marker is

reduced; however, for a threshold of jDVj = 0.7, a sample of

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912902

approximately 1000 cases and 1000 controls still gives 90%

power at a significance level of 0.001 for disease models

corresponding to a slightly larger sibling relative risk of 1.11

and g = 2.

Needless to say, for disease genes with stronger genetic

effects, smaller sample sizes are sufficient; alternatively,

larger sample sizes will yield comparable power even if

smaller thresholds for jDVj are chosen (cf. the table entries

for g = 4). Additional power calculations indicate that with a

threshold of jDVj = 0.7, reasonable power to detect genetic

effects for a common disease is maintained for small

departures of the marker allele frequency from 0.5. For

example, consider a disease with prevalence of 0.2 and g

ranging from 2 to 3. If pm1 = 0.4 and pd1 = 0.2, and

considering both possible phases of co-occurrence of the

alleles, the range of sample sizes runs from N = 229 to N =

1451. If the prevalence is lowered to 0.1, sample sizes are

still under 2000 cases and 2000 controls. Thus as long as

LD block tags can be chosen to have allele frequency close

to 0.5, reasonable power is maintained.

We therefore chose a jDVj threshold of 0.7 to define our

primary blocks, described below, to maximize the LD block

size while minimizing the sample size needed for a useful

range of disease models. We also report LD block results for

jDVj thresholds of 0.6, 0.8, 0.9, and 1.0.

Linkage disequilibrium blocks

We define a linkage disequilibrium block to be a

consecutive set of at least three markers along with a

threshold T and a disequilibrium coefficient L (DV or r2)

such that all pairs of markers in the block satisfy jLj z T. By

a weak linkage disequilibrium block we mean a set of at

least three consecutive markers along with three parameters

Tl, Tu, and F and a disequilibrium coefficient L such that all

pairs of markers in the block satisfy jLj z Tl (the lower

threshold) and at least F% of the pairs of markers satisfy

jLj z Tu (the upper threshold). Guided by our power

analyses, we will focus on LD blocks constructed using a

threshold T = 0.7. Although our power calculations were

based on DV, they may be reformulated to correspond to r2

levels via the formula r ¼ Dffiffiffiffiffiffiffiffiffiffiffiffiffip1p2q1q2

p .

Tables 3A and 3B contain summary statistics for the

average and maximum number of the LD blocks as well

as the average and maximum size of the blocks on each

chromosome as found by our algorithm for the coefficients

DV and r2, respectively. To indicate the extent of these

blocks, Table 3 also gives the percentage of the total

number of SNP markers that fall within blocks, and also

statistics Rmax, Rhap, and bcoverage,Q defined as follows.

From the point of view of reducing genotyping, the best

case scenario would be when a single SNP can be chosen

to represent each LD block; note that for SNPs not lying

within any block, no breductionQ takes place and each

such SNP must still be included in the genotyping. The

original number of SNPs is then reduced by a percentage

we call Rmax. For our chromosome X data, if there exists

a subset of SNPs in a block that determines all possible

haplotypes observed, we may choose to use this subset to

represent the block. We then obtain a set of SNPs

consisting of these representatives for every block together

with all SNPs not in blocks, and the original number of

SNPs is then reduced by a percentage we call Rhap. An

example of an LD block with significant haplotype

reduction is given in Fig. 1, in which all the haplotypes

found in the males we genotyped are unambiguously

defined by only 3 SNPs in an LD block consisting of 13

SNPs. We define the coverage to be the percentage of the

chromosome covered by blocks while ignoring gaps

greater than 100 kb. This definition allows us to account

for variable spacing of markers and focus on the parts of

the chromosomes that are most densely covered by our

SNP map. See Fig. 2 for a graph of the coverage by DVthreshold. Additional details on coverage are provided

under Materials and methods.

Chromosomal coverage

At the jDVj threshold of 0.7 the autosomes exhibit block

coverage of roughly 25%, while the coverage on the X

chromosome is substantially higher at 43%. Our definition

of coverage is in fact conservative compared to the

percentage of markers that occur within blocks, as the

latter measure is uniformly higher than our measure of

coverage (Table 3). While the percentage of markers in

blocks does not account for the variable spacing of our

SNP markers, it still provides a useful measure and reveals

that blocks are extensive across the chromosomes in these

data. With a jDVj threshold of 0.7, roughly 40% of

autosomal SNPs and 60% of chromosome X SNPs fall

within blocks. A graphical representation of the LD block

structure of all four chromosomes is given in Fig. 3. It is

clear from this figure that LD blocks are found across the

entire length of all four chromosomes and in some cases

can be extensive. The markers included in each LD block

are in Supplemental Table S5.

The reduction (Rmax) in the total number of SNPs,

obtained by taking one tag SNP per block while still

including every SNP that does not lie in a block, is

approximately 29% on the autosomes and 44% on X.

Hence, in theory, we can genotype just 71% of our markers

and still retain a significant amount of power in a genome-

wide association screen. On the X chromosome, only 53%

of the SNPs have to be genotyped in a genome-wide

association study instead of 73% of the SNPs needed to

define all the common haplotypes (e.g., for a threshold of

0.7, Rmax = 47% versus Rhap = 27%).

Overall, the coverage results were similar for the

autosomes while the coverage on chromosome X was

nearly twice as much. Indeed, even when we require that

jDVj = 1 the coverage on X is still 22%; roughly three times

that of the autosomes (Fig. 2). These observations confirm

Table 3

|DV| threshold Number

of blocks

Mean size of

blocks (SNPs)

Max size

(SNPs)

Mean

width (kb)

Max

width (kb)

% of markers

within blocks

Rmax

(%)

Rhap

(%)

Coverage

(%)

A. LD blocks using DV

Chromosome X

0.6 148 4.9 14 129.3 2410 60.65 48.3 21.6 47.4

0.7 145 4.7 13 104.7 987 56.47 44.4 20.5 43.2

0.8 137 4.5 13 91.2 987 51.13 39.7 19.9 36.5

0.9 124 4.3 13 76.3 689 44.11 33.8 18.2 28.4

1 107 4.1 12 67 689 36.51 27.6 16 21.7

Chromosome 5

0.6 111 4.3 15 92.6 536 40.49 31.1 - 28.4

0.7 106 4.3 13 82.3 536 38.13 29.2 - 25.4

0.8 101 4.1 11 69 536 34.51 26 - 22.1

0.9 87 3.8 11 47 323 28.11 20.8 - 15.8

1 64 3.5 7 26.8 171 18.69 13.3 - 6.5

Chromosome 7

0.6 158 4.1 10 79.3 636 43.61 32.9 - 27.1

0.7 149 4 11 74.1 471 40.48 30.3 - 23.4

0.8 138 3.9 10 63.5 446 36.67 27.3 - 19.9

0.9 115 3.8 11 51.7 446 30.07 22.2 - 14.9

1 76 3.4 10 40.3 446 17.76 12.6 - 8.3

Chromosome 17

0.6 58 3.9 6 71.8 426 40.91 30.4 - 29.9

0.7 53 3.8 6 90.8 963 36.55 26.9 - 26.9

0.8 45 3.6 6 52.6 339 29.45 21.3 - 19.1

0.9 39 3.5 6 47.7 339 24.73 17.6 - 15.8

1 25 3.2 4 18 74 14.36 9.8 - 5

B. LD blocks using r2

Chromosome X

0.6 98 4.2 13 73.3 700 34.5 26.3 16 21.4

0.7 91 4.1 13 72.4 689 31.08 23.5 14.9 19.3

0.8 71 4.2 11 76.5 689 24.64 18.7 13.2 15

0.9 61 3.9 9 56.3 527 19.8 14.7 11.8 10.6

1 39 3.5 8 36.2 169 11.53 8.3 8.3 5.4

Chromosome 5

0.6 61 3.9 10 59.9 525 19.78 14.7 - 11.7

0.7 52 3.6 10 46.4 366 15.91 11.5 - 8.6

0.8 39 3.6 8 42.8 366 11.7 8.4 - 5.7

0.9 28 3.2 4 28.8 222 7.58 5.2 - 3.2

1 7 3.1 4 23.1 67 1.85 1.3 - 0.8

Chromosome 7

0.6 76 3.5 8 49.6 357 17.96 12.8 - 8.6

0.7 63 3.3 6 35.1 189 14.35 10.1 - 6.7

0.8 46 3.3 6 29 189 10.41 7.3 - 3.8

0.9 33 3.2 6 17 62 7.07 4.8 - 1.8

1 8 3.2 4 7.1 20 1.7 1.2 - 0.2

Chromosome 17

0.6 28 3.2 4 38.1 163 16.18 11.1 - 10.7

0.7 23 3.2 4 39.3 163 13.45 9.3 - 8.9

0.8 15 3.2 4 24.7 148 8.73 6 - 3

0.9 11 3.2 4 24.8 148 6.55 4.6 - 1.9

1 5 3.2 4 8.8 21 2.91 2 - 0.4

Rmax is the reduction, expressed as a percentage, in the total number of markers if one SNP from each block is selected and all remaining SNPs not in blocks are

also selected. Rhap is the reduction if SNPs are selected on the basis of complete haplotype determination within blocks. The coverage is the percentage of the

chromosome covered by blocks if gaps larger than 100 kb are ignored.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912 903

Fig. 1. An LD block of 13 markers spanning 100 kb on chromosome X with strong haplotype reduction. The upper table contains the values of jDVj for themarker pairs while the lower table lists the haplotypes and their frequencies. Markers 1 and 7 form a complete subset for the common haplotypes (frequency

greater than 5%), while markers 1, 6, and 13 form a complete subset for all the haplotypes observed for this set of SNPs.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912904

the hypothesis that linkage disequilibrium on X is more

extensive than that on the autosomes.

Weak LD blocks

To investigate the structure of weak LD blocks we

applied our algorithm with parameters that were quite

relaxed: the lower cutoff was set at 0 and the percentage F

was set at 80%. The requirement for a block is therefore that

80% of the marker pairs must be in LD according to the

Fig. 2. Coverage as a function of jDVj threshold.

upper cutoff. The results are tabulated in Supplemental

Table S6. Even though the weak LD blocks allow a fairly

large number of bLD gapsQ in the blocks, this is not

accompanied by a large increase in the coverage. At the DVthreshold of 0.7, when these gaps are allowed, the coverage

increases by about 9% on X and 7–14% on the autosomes.

Note that the bstrictQ definition may give greater assurance

that a single SNP from the block would be an appropriate

representative of the block for a genome-wide disease

association analysis; we expected that the trade-off would be

reduced block coverage of the chromosome compared to the

less restrictive definitions. However, we observed that

coverage for LD blocks was in fact comparable to that for

weak blocks.

Using blocks to predict LD

As detailed under Materials and methods, we formulated

a test to determine if an LD block actually predicts whether

additional buntypedQ markers that are physically in the block

are also in LD with that block or whether a newly inserted

marker will break up the block. This test provides important

support for the strategy of representing blocks with single

SNPs, as it gives insight into whether it is reasonable to

expect an untyped polymorphism in a block to have the

expected strength of LD with the rest of the block, so that

power to detect association with it is maintained. The results

Fig. 3. A blue diamond indicates the location and size of each LD block. The scale of the size of the LD block is located on the left y axis. The location on the

chromosome in kilobases is on the x axis. At the bottom of each graph represented by a black square is the location of each marker with a minor allele

frequency N20%. A red circle indicates the number of markers in each LD block with the scale on the right x axis. For the markers included in each LD block

and their locations on the chromosome please refer to Table S5 in the supplemental materials.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912 905

Table 4

Results of test to determine if LD blocks predict LD

T1 T2 p

X 5 7 17

0.7 0.7 96.1 94.1 94.3 92.8

0.8 0.7 98.2 97.1 96.1 91.9

0.8 0.8 96.7 95.5 94.1 86.2

T1, the jDVj threshold for all pairs in the computed block of SNPs, A1, . . .,

An; T2, the jDVj threshold required for pairings with the additional SNP, B;

p, the empirical conditional probability that each pairing of B with Ai

satisfies jDVj z T2.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912906

of our test are summarized in Table 4. We see that indeed,

LD blocks are excellent predictors of the strength of LD

with other SNPs in our data. For example, on chromosome

5, if a collection of markers A1, A2, . . ., An is an LD block

with jDVj z 0.8 (the first threshold T1), then any other

marker B that is physically in this block will be, with 97.1%

certainty, in LD with A1, A2, . . ., An with jDVj z 0.7 (the

second threshold T2).

Discussion

The results of our study show that regions of strong LD

are commonly found on both autosomes and the X

chromosome with the X more extensively covered by

longer LD blocks. In fact, blocks of strong LD are easily

found with high-level SNP maps on chromosomes 5, 7,

17, and X consisting of common SNPs at ~100 kb density.

Power calculations predict that these LD blocks can be

represented by LD tag SNPs in genome-wide association

studies to identify alleles associated with common diseases

within LD blocks. As the number of common genotyped

SNPs continues to increase, we will be able to cover the

genome with LD blocks. Our results further predict that

from an overall map of 30,000 to 50,000 SNPs with minor

allele frequencies of z20% in each of the major

populations in the world, a subset of LD tag SNPs will

be useful in the initial genome-wide scanning for common

disease alleles.

The common SNPs selected for our study were identified

prior to the completion of the human genome sequence and

the launching of the International HapMap Project. Large

gaps therefore were found between contigs and the coverage

of common SNPs was incomplete. Despite these limitations,

LD blocks, defined by the DVstatistic, covered about 25% of

the autosomes and 40% of chromosome X where there was

sufficient SNP density. On average the blocks spanned

about 80 kb on the autosomes and 100 kb on X.

We designed an algorithm that reveals a block structure

whose usefulness is measured both by sharp reduction of the

number of SNPs needed in a genetic association study and

by the extent to which the LD blocks cover a chromosome.

In addition, our block definitions have the advantage of

being linked directly to the goal of using LD blocks to aid

disease association studies. As the motivation for our block

definitions differs from other important motivations such as

representation of haplotypes or study of possible evolu-

tionary models, the resulting block characteristics may differ

from those of other studies without detracting from our

conclusions. Our use of power calculations to guide block

definitions directly does rely on the assumption that the

strength of LD among the markers in the block is similar to

that of the LD that would be observed between those

markers and a disease gene if the disease gene occurred in

the block, but was untyped. We tested this assumption

directly within our data using genotyped markers in place of

the disease gene and found that for jDVj thresholds of 0.7 to

0.8, if an additional marker physically within an existing LD

block was introduced, it would satisfy the same jDVjthreshold with every block marker in the great majority of

cases. Note also that when this assumption is satisfied,

selecting a single LD tag SNP to represent the block, rather

than the more conventional approach of representing all

common haplotypes with a group of tag SNPs, may be

sufficient to detect a disease association in an initial screen,

provided appropriate parameters are chosen for the power

calculations. Further investigations that will use dense

marker data to help determine the most appropriate

strategies for SNP reduction are under way.

The above findings about the predictive ability of these

LD blocks are also consistent with our observations about

weak LD blocks. LD structure is complex and it is well

known that runs of SNPs in LD can be interrupted by SNPs

not in LD with the others [12,13]. To test this effect, we also

studied weak LD blocks, which allowed LD gaps within

blocks. Interestingly, we found that although the weak

definition would be expected to allow for larger blocks

encompassing more of the chromosome, in fact chromoso-

mal coverage increased only minimally.

The power analyses reported here considered only certain

disease models and thus should be viewed as only a guide to

selecting LD thresholds for block definition. We chose

marker allele frequencies of 0.5 for simplicity to illustrate

our approach; however, it is important to note that power is

affected by the frequency of the marker allele that tends to

co-occur with the disease allele as others have pointed out as

well [15,16]. Our overall approach provides a template

strategy to define LD blocks and prioritize SNPs for a given

disease association study; appropriate disease models and

the desired level of power may be chosen depending on the

particular study. Our report has focused on LD blocks for DVthresholds of 0.7 and above since these thresholds corre-

spond to power to detect common disease genes of modest

effect for the selected models. However, for disease genes of

stronger effects, smaller sample sizes or reduced jDVjthresholds (and therefore LD blocks having greater chro-

mosomal coverage) can yield sufficient power (Table 2).

Thus a given association genome screen can be designed

according to whether a project wishes to constrain recruiting

costs or genotyping costs.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912 907

Our hierarchical association mapping strategy is geared

toward detection of common disease alleles. The debate

about the bcommon disease, common geneQ hypothesis is

important and may be resolved by further study and

successful identification of such genes. Larger sample sizes,

stronger LD thresholds, or alternative strategies may be

necessary to allow the approach proposed here to move

beyond detecting common susceptibility alleles. Never-

theless, for detecting common alleles, a hierarchical

approach using an initial reduced-density SNP map offers

an efficient and economical alternative to large-scale

genotype studies.

In the initial phase of genetic association studies, it is

important to identify regions of the human genome

associated with a disease quickly and cost effectively.

Equally important is the ability to exclude large regions of

the genome not associated with the disease alleles. If N is

the number of individuals, M the number of markers being

genotyped, Cg the cost of genotyping a single marker for a

single individual, and CT the total cost, then CT = NMCg. N

is determined by our power calculations and M is

determined by the block structure for a given DV cutoff.Although our maps are not populated with enough SNPs to

determine M accurately, efforts such as the International

HapMap Project [1] are providing genotyping data for an

extremely dense set of SNPs across the genome. An

important next step for our approach will be to apply it to

these data to construct high-level, reduced-density associa-

tion screen maps, which can be compared to the full-density

data. Our present results strongly suggest that for the initial

genome-wide association studies of common disease,

screening maps of common SNPs at a modest density can

maintain power to detect disease association while reducing

genotyping costs compared to the approach of constructing

haplotype maps to define the common haplotypes.

Materials and methods

SNP markers

All SNPs used in this study had known frequencies,

mapped to a unique location according to the dbSNP map

(build 112), and met our criteria for primer design

(discussed below).

All SNP markers were selected from public databases or

selected and characterized for allele frequencies by our

group. They were then submitted to the SNP Consortium

(TSC) and dbSNP databases (dbSNP, http://www.ncbi.nlm.

nih.gov/SNP/; TSC, http://snp.cshl.org/).

SNPs with allele frequencies from 10 to 90% in all three

population samples of 42 individuals each (European

American, African American, and Asian American-bTheTSC setQ) were selected for genotyping. We also included a

small number of SNPs in which the allele frequencies were

between 10 and 90% in two of the three samples but the

information was not available for the third sample. In a

small number of cases from dbSNP, if the frequencies from

an equivalent set of samples met the criteria the SNP was

included.

We tracked the progress of this field by first discovering

SNPs, then characterizing SNPs, and finally taking advant-

age of the wealth of characterized SNPs available in the

public databases. We used a top-down hierarchical approach

to covering the chromosome with SNPs. SNPs were picked

across the chromosome at specific intervals without regard

for gene densities or gene locations. The first pick was

characterizing for frequency of one SNP every 25 kb across

the entire human genome with the SNPs and maps available

at that time. The SNPs with the desired frequencies were

then funneled into the genotyping project for chromosomes

X and 7. A second pick for characterization was done for

both chromosomes X and 7 at 10 kb, and chromosome X

went through a third round of characterizing SNPs at 5-kb

intervals. We also picked SNPs for chromosomes X and 7

from public databases when more characterized SNPs

became available. This study began with chromosomes X

and 7 and then later expanded to 20% of the genome with

the addition of chromosomes 5 and 17. Chromosome 5 and

17 SNPs were picked entirely from public databases that

included our own submissions. A list of the SNPs used in

this study is in supplemental Tables S1a–S1d.

SNP marker quality

All 5344 SNP assays used in this study (supplemental

Tables S1a-S1d) had a sample genotyping success rate

greater than 80%. Chromosome 7 had the best chromosome

coverage with 1753 SNPs for an average intermarker

distance of 89 kb and chromosome 5 had the least coverage

with 1442 markers and an average consecutive intermarker

distance of 125 kb. A breakdown of the number of SNPs at

10 and 20% MAF and 20% with intermarker distances b100

kb is shown in Table 1.

Fig. 4A shows the distribution of the MAF and Fig. 4B

shows the observed heterozygosity for all the SNP markers

used in this study. SNPs for this study were picked to have

a MAF of 10–50%. Although individual SNPs could have

different genotype frequencies from the predicted frequen-

cies, N95% of the SNP assays had a MAF N10% after

genotyping and N80% of the SNPs had a MAF N20% after

genotyping (Table 1). The estimate of heterozygosity

provided by dbSNP was also a good predictor of the

SNP frequency and was on average within 5% for each

bin. Between 5 and 6% of all SNPs genotyped had

heterozygosities less than 0.2 with the remaining 94–95%

being N0.2.

The distance from one SNP marker to the next is shown

in Fig. 4C. Approximately 50% of the markers are closer

than 25 kb and 70–80% of the markers are closer than 100

kb. The inset bar graph shows the 0–25,000 bp range. All

distances used were from dbSNP, build 112.

Fig. 4. (A) The distribution of the minor allele frequency in the complete SNP marker set. (B) The distribution of the observed heterozygosity in the complete

SNP marker set. (C) The distance in base pairs (bp) from one SNP marker to the next SNP marker on the chromosome.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912908

DNA samples for genotyping

The DNA samples used in this study were genetically

unrelated individuals from the Utah and French CEPH

families (supplemental Tables S2a and S2b). We used as

many of the families as possible. Most families have two

individuals, a few have one, and one family has three

individuals (a father and two grandfathers for X panel). The

samples were primarily grandparents with a few parents. We

used half males and half females for the autosome panel

(chromosomes 5, 7, and 17). For the X chromosome we

used a panel of 94 males so complete haplotypes could be

determined.

Genotyping using FP-TDI

FP-TDI is a homogeneous assay for SNP genotyping

based on fluorescence polarization (FP) detection. When

a fluorescent dye molecule is excited by plane-polarized

light, the fluorescence emitted by the molecule is also

polarized. The degree of polarization depends on the size

of the molecule. A small molecule tumbles and rotates in

solution so fast it causes depolarization of the fluores-

cence. A larger molecule has less molecular motion and

the fluorescence is polarized when it reaches the detector.

For a detailed description of the method see Refs.

[14,17–21].

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912 909

The FP-TDI assay requires three unlabeled oligonu-

cleotides for each SNP. Two serve as PCR primers and

the third is a SNP probe that is complementary to the

template sequence with its 3V end annealed to the target

one base before the polymorphic site. The entire reaction

is done in one reaction tube without separation or

purification.

Because of this simplicity the FP-TDI assay is

extremely flexible and can be adapted to any number of

assays and to all levels of throughput. To meet the needs

of the project outlined here we increased our throughput to

a level that was capable of producing 4 million genotypes/

year. This was accomplished by creating a modular

approach to genotyping. Each equipment portion of the

module consisted of a Perkin-Elmer (Boston, MA, USA)

Evolution P3 liquid handling machine with a b96 well-

head,Q 24 384-well thermocycler blocks, one Perkin-Elmer

fluorescence polarization 384-well plate reader, and two

low-speed centrifuges with deep well buckets. With

efficient use of time and equipment up to two runs of

24 � 384 samples or 18,432 genotypes can be accom-

plished in a 10-h day with three people working staggered

schedules.

Genotyping procedure

All pipetting steps were done by hand with a Rainin

electronic multichannel pipettor, the Perkin-Elmer Apricot

personnel pipettor, the Robbins Hydra, or preferably with

the Perkin-Elmer Evolution P3 depending on the stage of

the project.

All PCRs and SNP detection reactions were designed

for standard melting temperature (Tm) and used standard

cycling conditions. The two PCR primers and the one SNP

detection primer were ordered arrayed in a 96-well tray at

a standard concentration for ease of handling. Primers were

ordered as 6 nmol lyophilized in a 96-well plate and then

150 Al water was added to each well to make a 40 AMstock solution.

We also prepared in advance the DNA template. To make

dried DNA plates 3 Al (0.8 ng/Al) of DNA from a 96-well

DNA plate was dispensed to all four quadrants of 384-well

black plates with the Perkin–Elmer Evolution P3 using a 96-

tip head. The plates were spun down and air-dried at room

temperature overnight. The plates were stacked and wrap-

ped with plastic film and kept desiccated at room tempe-

rature until needed. One allele combination per day was

genotyped whenever possible.

PCR

Volumes are for one reaction. Samples contained

genomic DNA dried previously, 0.5 Al 10� PCR buffer

(200 mM Tris-HCl (pH 8.4) and 500 mM KCl), 0.25 AlMgCl2 (50 mM), 0.1 Al dNTPs (2.5 mM), 0.02 Al any Hot

Start Taq (5 U/Al), 2.13 Al water, final volume 3.0 Al. Then 3Al of primers (40 AM) was added.

PCR thermocycling program

The cycling program was 1 cycle of 958C for 2 min; 35

cycles of 928C for 10 s, 588C for 20 s, 688C for 30 s; and 1

cycle of 688C for 10 min. The temperature of the

thermocycler was reduced to 48C until samples were

removed for the next step.

SNP detection reactions

For remaining the steps we used the AcycloPrime-FP

SNP Detection System Kit (Perkin–Elmer Life Sciences).

EXO-SAP reaction to remove excess primers and

unincorporated dNTPs

The 10� PCR clean-up reagent was diluted with PCR

clean-up buffer provided by the kit. Of this, 2 Al was addedto each well, and the plates was spun down.

EXO-SAP cycling program

The cycling program was 378C for 60 min, 808C for 15

min, and the temperature of the thermocycler was reduced to

48C until samples were removed for the next step.

TDI reaction mix for SNP detection

For one reaction, the volumes were 2.0 Al 10� buffer,

0.05 Al AcycloPol, 1.0 Al terminator mix, and 4.95 Al water,for a total volume of 8.00 Al.

Pipetting steps for TDI (SNP detection)

Five microliters of 1 AM SNP primers was dispensed into

the relevant quadrants of a 384-well plate, and the plate was

spun down. Eight microliters of TDI mix was dispensed into

the plate, and again the plate was spun down.

TDI cycling program

The cycling program was 1 cycle of 958C for 2 min and

10–40 cycles of 958C for 15 s, 558C for 30 s (20 cycles are

standard). The temperature of the thermocycler was reduced

to 48C until samples were removed for the next step.

Reading plates

To read the plates, they were spun down and stacked in

the Perkin-Elmer Victor or Perkin-Elmer EnVision stacker.

Plates were read for both dyes. Analysis of the data was

accomplished with the Excel Macro software provided by

the manufacturer. Data were then saved as a tab-delimited

text document that could be imported into the linkage

disequilibrium analysis software. Our lab currently uses the

SNPScorer software, which outputs results directly into a

database.

Primer design for FP-TDI

To achieve the high throughput needed for this project

all primers were designed with a strict criterion for Tm

and all reactions were run under a single set of

parameters.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912910

PCR primer design

PCR primers were designed using Primer3, release 0.9

(with code available at http://www.genome.wi.mit.edu/

genome_software/other/primer3.html Ref. [22]) using the

parameters previously described [23]. Minor changes for

this application follow: TARGET = SNP_Position-20 bases,

20 bases; PRIMER_OPT_TM = 55, PRIMER_MAX_TM =

56, PRIMER_MIN_TM = 54; PRIMER_PRODUCT_

SIZE_RANGE = 80–400; PRIMER_PRODUCT_OPT_

SIZE = 250.

SNP primer design

The shortest primer with a Tm of 50–558C and a minimum

length of 16 bp and a maximum length of 40 bp was picked

on both the forward and the reverse strand. The SNP primer

is complementary to the sequence and ends with the base

adjacent to the SNP site. The Allawi and Santa-Lucia nearest

neighbor sequence-dependent thermodynamic parameters

[24] were used to determine Tm.

If both a forward and a reverse SNP primer could be

chosen in the first step, one was selected by assigning

penalties to various types of repeated sequences. This

minimized failures due to repeated sequences near the SNP.

If the penalties were equal the shortest primer was chosen

because it was the least expensive. If there was no clear

preference the default was the forward primer. Sometimes

SNP primers could not be chosen because of the melting

temperature and length criteria. This could be due to extreme

GC content, a neighboring SNP within the potential primer

site, or a reported SNP that was not a single-base biallelic

polymorphism.

In our experience, FP-TDI assays can be developed for

all SNPs found in unique sequence. Because of the stringent

design of our PCR conditions, 95% of them are successful

[25] and 80% of the FP-TDI genotyping assays work the

first time. Sixty percent of the failed assays work when

repeated with the SNP primer from the other strand, which

gives a final assay success rate of N90%. Our genotyping

error rate is b0.02%.

We periodically develop FP-TDI genotyping assays for

all uniquely mapped SNPs found in public databases. These

assays and the expanded protocols for the methods outlined

above are available at http://snp.wustl.edu.

Linkage disequilibrium analysis

Pair-wise linkage disequilibrium was computed for all

possible two-way comparisons of SNPs, with significance

levels determined by them2 statistic for the corresponding 2�2 table (1 degree of freedom). Haplotype frequencies for

autosomes were estimated using the E-M algorithm of

Excoffier and Slatkin [26] as implemented for pairs of

markers in the LDMAX program of the GOLD software

package [27]. For the X chromosome, haplotypes were

directly observable. SAS (SAS Institute, Cary, NC, USA)

together with LDMAX was used to compute m2 statistics and

linkage disequilibrium measures D, DV, and r2, defined as

follows. For two loci L1 and L2, each with two alleles 1 and 2,

let pi be the frequency of allele 1 and qi = 1 � pi be the

frequency of allele 2, at locus i (i = 1, 2). Assume pi V qi, that

is, allele 1 is the minor allele. Let pjk be the frequency of the

jk haplotype. The coefficient of disequilibrium is D = p11 �p1p2. The labeling of the alleles may affect the sign of D but

not its absolute value. The normalized disequilibrium

coefficient is obtained by dividing D by its maximum

possible (absolute) value given p1 and p2: DV= D/|D|max,

where |D|max = max ( p1p2, q1q2) if D b 0 and |D|max =

min ( q1p2, p1q2) if D N 0. The correlation coefficient

is r ¼ D=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffip1q1p2q2

p(if we define the random variable Xi on

individuals to be 1 if the individual has allele 1 at loci i and

0 otherwise, then D = cov(X1,X2) and var(Xi) = piqi so r is

the usual correlation coefficient of these two random

variables). SNP heterozygosities were computed. Hetero-

zygosity for a biallelic marker with allele frequencies p and

q = 1 � p is given by H = 1 � p2 � q2 = 2p(1 � p).

Power calculations and linkage disequilibrium blocks

We chose to define LD blocks using threshold values for

DVthat correspond to good power for a case-control disease

association study of realistic sample size. The assumption is

that the strength of LD between alleles of the disease gene

and a genotyped marker in the block is similar to the LD

strength within the block. For such a case-control study, the

Pearson m2 test statistic for a 2 � 2 contingency table may

be used to test for allelic association. To calculate the

expected power for particular values of DV, we assumed a

range of disease models specified by disease gene frequen-

cies at a biallelic disease gene and penetrances for the three

genotype classes. To aid interpretation, we also parame-

terized each model in terms of the population prevalence (K)

of the disease, the disease gene frequency, and two of the

three penetrances. We considered both multiplicative

models and more general models, but will focus on the

multiplicative models here. Hence we assume that if d1 is

the disease allele, the penetrances for the genotype classes

are fd1d2 = g fd2d2 and fd1d1 = g2 fd2d2, where g is the

bgenotypic relative risk.Q For each multiplicative model we

computed the corresponding sibling relative risk (ES) usingcorrected formulas derived as in Risch and Merikangas [2].

We assumed marker allele frequencies to be 0.5 and

considered two cases for disease allele frequency pd1: pd1 =

0.5 (which would result in the maximum power for this

marker allele frequency) and pd1 = 0.2, corresponding to a

more realistic, common disease allele frequency. While

these are specific choices and of course do not represent the

full range of scenarios that may occur in an actual disease

association study, they provide a reasonable setting to guide

our LD block definitions. We propose this overall approach

as a general strategy for defining LD blocks for disease

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912 911

association studies; the range of models and desired power

level may then be chosen depending on the specific study.

The power calculations rely on determining the expected

marker allele frequencies in cases versus controls given a

particular disease model (penetrances), disease and marker

allele frequencies, andDVvalue (which determines haplotype

frequencies). Sham [28] derives the necessary formulas for

case and control allele frequencies in terms of haplotype

frequencies (chapter 4, section 4.6). We then used noncentral

m2 distributions to calculate sample sizes to detect associ-

ation between disease and marker in a m2 test, with specified

power and significance level. That is, for a given significance

level a, asymptotically, the m2 statistic follows a m2

distribution with 1 df under the null hypothesis and follows

a noncentral m2 with 1 df under the alternative hypothesis.

The noncentrality parameter is given by the m2 statistic

computed with parameters taking on the values correspond-

ing to the alternative hypothesis and is proportional to the

sample size2. Weir [29] discusses further details of the use of

the noncentral m2 for power in an analogous context; the

Web-interface power calculator of Purcell et al. [30] also

uses the noncentral m2 in the same way as our calculations.

After useful DV threshold values were determined from

our power calculations, LD blocks were defined using all

SNPs having a minor allele frequency of 20% or greater.

Our primary LD blocks were defined by requiring all

marker pairs in a given block to have jDVj greater than or

equal to the chosen threshold. We also studied weak LD

blocks with less restrictive definitions described below.

Algorithms for finding linkage disequilibrium blocks

Our power calculations suggest the following strategy for

reducing the number of markers genotyped for an associ-

ation study. If jDVj z T for a pair of SNPs then select only

one of them for genotyping, where T is a threshold of LD

strength to be determined from the power calculations. If the

same level of LD existed between a given SNP and several

other SNPs, then one might select that single SNP to

represent the entire group. However, by requiring the more

stringent criterion that all pairs of markers within the group

satisfy the LD threshold T, we allow ourselves more choices

for tag SNP selection and also increase our confidence that a

chosen tag SNP is likely to be a good representative, not

2 The noncentrality parameter Enc for the noncentral m2 distribution

corresponding to the alternative hypothesis, computed from the appropriate

2 � 2 table, is

knc ¼ 2N

"P M1 j caseð Þ � P M1 j controlð Þð Þ2

P M1 j caseð Þ þ P M1 j controlð Þ

þ P M2 j caseð Þ � P M2 j controlð Þð Þ2

P M2 j caseð Þ þ P M2 j controlð Þ

#

where N is the number of cases, which equals the number of controls;

P(Mi j case) is the allele frequency of Mi in cases, and P(Mi j case) is theallele frequency of Mi in controls.

only for other markers in the block but also for potentially

untyped markers in the region of the block.

We therefore define a linkage disequilibrium block to be

a consecutive (relative to the map in use) set of at least three

markers along with a cutoff threshold T and a disequilibrium

coefficient L (DVor r2) such that all pairs of markers in the

block satisfy jLj z T. Furthermore, in the interest of relaxing

the threshold requirement potentially to increase chromoso-

mal coverage and reduce the number of tag SNPs required

for a genome screen, we also examine weak LD blocks. By

a weak linkage disequilibrium block we mean a set of at

least three consecutive markers along with three parameters

Tl, Tu, and F and a disequilibrium coefficient L such that all

pairs of markers in the block satisfy jLj z Tl (the lower

threshold), and least F% of the pairs of markers satisfy jLjzTu (the upper threshold).

Once a set of nonoverlapping LD blocks has been

determined we would like to know what proportion of the

chromosome is covered by LD blocks. We could divide the

sum of the widths of the blocks by the width of the

chromosome but this would be in some sense misleading

since the denominator includes blarge gapsQ such as the

centromere and gaps between contigs. Therefore, we omit

these regions and consider only that portion of the chromo-

some where the SNPs are suitably dense. More precisely,

define the densely covered region of the chromosome to be the

union of all the intervals between consecutive SNPs that are

within some distance W of each other. In other words, every

part of the densely covered region must lie between two SNPs

that are withinW.We then define the coverageC by taking the

intersection of blocks with the densely covered region and

divide the total size of this region by the size of the densely

covered region. In our analysis, we chose W to be 100 kb.

We have designed an algorithm that takes ordered

markers and produces a set of nonoverlapping blocks and

attempts to maximize the block coverage. This is consistent

with our goal of using blocks as a tool for reducing SNP

density while retaining power in an initial association

screen. We therefore wish block coverage to be as extensive

as possible to allow for maximum genotyping reduction

according to our strategy, but are not necessarily interested

in block size or blocks as indicators of haplotype diversity.

Beginning with the first marker and then proceeding, in

order, to the following markers, we find the largest block

containing each marker (these blocks may include markers

that either precede or follow the current marker). A block is

ignored if it is contained in the block previously discovered

by the algorithm. If a block overlaps with the previous block

but is not contained in it we consider the following four

options: (1) truncate the previous block by removing a

minimal number of markers so that it no longer overlaps

with the current block, (2) truncate the current block in the

same way, (3) discard the previous block, or (4) discard the

current block. The option that produces the greatest amount

of coverage is chosen, with ties yielding to the order of the

options.

P. Taillon-Miller et al. / Genomics 84 (2004) 899–912912

When haplotypes are known we are interested in finding

subsets of SNPs that determine all observed haplotypes

within blocks. We call these complete subsets, and our

algorithm for finding them consists of testing all subsets of

size 1, 2, 3, and so on, and reporting the first complete

subset found. Since for the autosomes only the genotypes

are known, this algorithm is applied only to chromosome X.

Using blocks to predict LD

One of the applications of determining LD blocks is to

reduce the number of markers required for an association

study by selecting LD tag SNPs from the blocks. An

underlying assumption is that if the disease gene is physically

in the block then it can be expected to be in LDwith a tag SNP

from the block. We have designed a test to determine if LD

blocks work as predictors this way.

Suppose A1, A2, . . ., An are SNPs that form an LD block.

If B is an additional SNP that lies physically between A1 and

An we expect that B would be part of the block; that is, B is in

linkage disequilibrium with A1, A2, . . .., An. More precisely,

we assume that jDVj z T1 for each of pairs in the set A1, A2,

. . ., An. We then determine the empirical conditional

probability P that jDVj z T2, the lower cutoff, for each of

the pairings of B with A1, A2, . . ., An. If this probability is

high then we conclude that LD blocks are good predictors of

LD with additional markers in the region. In other words, if

we already have a block, then the block structure will most

likely be preserved after new markers are added to it. We use

two cutoffs to allow for some flexibility in our test.

Acknowledgment

Some of this work was supported by grants from the

National Institutes of Health: HG01720 (P.Y.K.), DA15129

(N.L.S.), and AA07580 (S.F.S.).

Appendix A. Supplementary data

Supplementary data for this article may be found on

ScienceDirect or at http://snp.wustl.edu/snp_research/

ld_blocks.

References

[1] The International HapMap Consortium, The International HapMap

project, Nature 426 (2003) 789.

[2] N. Risch, K. Merikangas, The future of genetic studies of complex

human diseases, Science 273 (1996) 1516–1517.

[3] P. Taillon-Miller, et al., Juxtaposed regions of extensive and minimal

linkage disequilibrium in human Xq25 and Xq28, Nat. Genet. 25

(2000) 324.

[4] M.J. Daly, J.D. Rioux, S.F. Schaffner, T.J. Hudson, E.S. Lander, High-

resolution haplotype structure in the human genome, Nat. Genet. 29

(2001) 229–232.

[5] D.E. Reich, et al., Linkage disequilibrium in the human genome,

Nature 411 (2001) 199–204.

[6] S.B. Gabriel, et al., The structure of haplotype blocks in the human

genome, Science 296 (2002) 2225–2229.

[7] N. Patil, et al., Blocks of limited haplotype diversity revealed by high-

resolution scanning of human chromosome 21, Science 294 (2001)

1719.

[8] E. Dawson, et al., A first-generation linkage disequilibrium map of

human chromosome 22, Nature 418 (2002) 544.

[9] M.S. Phillips, et al., Chromosome-wide distribution of haplotype

blocks and the role of recombination hot spots, Nat. Genet. 33 (2003)

382.

[10] K.G. Ardlie, L. Kruglyak, M. Seielstad, Patterns of linkage

disequilibrium in the human genome, Nat. Rev. Genet. 3 (2002)

299–309.

[11] A.G. Clark, et al., Linkage disequilibrium and inference of ancestral

recombination in 538 single-nucleotide polymorphism clusters across

the human genome, Am. J. Hum. Genet. 73 (2003) 285–300.

[12] C.S. Carlson, et al., Additional SNPs and linkage-disequilibrium

analyses are necessary for whole-genome association studies in

humans, Nat. Genet. 33 (2003) 518–521.

[13] D.C. Crawford, et al., Evidence for substantial fine-scale variation in

recombination rates across the human genome, Nat. Genet. 36 (2004)

700–706.

[14] T.M. Hsu, P.-Y. Kwok, Homogeneous Primer Extension Assay with

Fluorescence Polarization Detection, in Single Nucleotide Poly-

morphisms: Method And Protocols, Humana Press, Totowa, NJ,

2002, pp. 177–188.

[15] K.T. Zondervan, L.R. Cardon, The complex interplay among

factors that influence allelic association, Nat. Rev. Genet. 5 (2004)

89–100.

[16] R.M. Pfeiffer, M.H. Gail, Sample size calculations for population and

family-based case-control association studies on marker genotypes,

Genet. Epidemiol. 25 (2003) 136–148.

[17] X. Chen, L. Levine, P.-Y. Kwok, Fluorescence polarization in

homogeneous nucleic acid analysis, Genome Res. 9 (1999) 492.

[18] T.M. Hsu, P.-Y. Kwok, Homogeneous primer extension assay with

fluorescence polarization detection, Methods Mol. Biol. 212 (2003)

177.

[19] T.M. Hsu, X. Chen, S. Duan, R.D. Miller, P.-Y. Kwok, Universal SNP

genotyping assay with fluorescence polarization detection, Biotechni-

ques 31 (2001) 560.

[20] P.-Y. Kwok, SNP genotyping with fluorescence polarization detection,

Hum. Mutat. 19 (2002) 315.

[21] R.A. Greene, et al., A Novel Method for SNP Analysis Using

Fluorescence Polarization. P10131, Perkin-Elmer Life Sciences,

Boston, 2001.

[22] S. Rozen, H. Skaletsky, Primer3 on the WWW for general users

and for biologist programmers, Methods Mol. Biol. 132 (2000)

365.

[23] E.F. Vieux, P.-Y. Kwok, R.D. Miller, Primer design for PCR and

sequencing in high-throughput analysis of SNPs, Biotechniques 32

(2002) S28.

[24] R. Owczarzy, et al., Predicting sequence-dependent melting stability of

short duplex DNA oligomers, Biopolymers 44 (1997) 217.

[25] R.D. Miller, S. Duan, E.G. Lovins, E.F. Kloss, P.-Y. Kwok, Efficient

high-throughput resequencing of genomic DNA, Genome Res. 13

(2003) 717.

[26] L. Excoffier, M. Slatkin, Maximum-likelihood estimation of molec-

ular haplotype frequencies in a diploid population, Mol. Biol. Evol. 12

(1995) 921.

[27] G.R. Abecasis, W.O.C. Cookson, GOLD-graphical overview of

linkage disequilibrium, Bioinformatics 16 (2000) 182.

[28] P. Sham, Statistics in Human Genetics, Arnold, London, 1998.

[29] B.S. Weir, Genetic Data Analysis II, Sinauer, Sunderland, MA,

1996.

[30] S. Purcell, S.S. Cherny, P.C. Sham, Genetic power calculator: design

of linkage and association genetic mapping studies of complex traits,

Bioinformatics 19 (2003) 149.