introduction to linkage analysis course “study design and data analysis for genetic studies”,...

Introduction to linkage analysis

Course “Study Design and Data Analysis for Genetic Studies”, Universidad ded Zulia, Maracaibo, Venezuela, 9-10 April 2005

Harald H.H. Göring

“Marker” loci

There are many different types of polymorphisms, e.g.:

• single nucleotide polymorphism (SNP):

AAACATAGACCGGTT

AAACATAGCCCGGTT

• microsatellite/variable number of tandem repeat (VNTR):

AAACATAGCACACA----CCGGTT

AAACATAGCACACACACCGGTT

• insertion/deletion (indel):

AAACATAGACCACCGGTT

AAACATAG--------CCGGTT

• restriction fragment length polymorphism (RFLP)

…

Tracing chromosomal inheritanceusing “marker” locus genotypes

1/2 3/4

1/5 4/5

5/5 5/5

1/4 5/5

Tracing chromosomal inheritance(fully informative situation)

Linkage analysis:locus with known genotypes

1/2 3/3 2/4 1/1

1/3

2/3 1/3

1/2

Where do the observed genotypes “fit”?

Linkage analysis

In linkage analysis, one evaluates statistically whether or not the alleles at 2 loci co-segregate during meiosis more often than expected by chance. If the evidence of increased co-segregation is convincing, one generally concludes that the 2 loci are “linked”, i.e. are located on the same chromosome (“syntenic loci”). The degree of co-segregation provides an estimate of the proximity of the 2 loci, with near complete co-segregation for very tightly linked loci.

Let’s step back…

to Mendel

P1

F1

F2

315 108 101 32

9 : 3 : 3 : 1

Mendel’s law of uniformity

Mendel’s law of independent assortment

observed:

~ ratio:

One of Mendel’s pea crosses

P1 1 1 2 2

F1 1 2 1 2

F2 1 1 1 2 2 2

25% 50% 25% (in expectation)


Mendel’s law of segregation

P1a a b b

F1a b a b

F2a a a b b b



25% 50% 25% (in expectation)

P1

Mendel’s law of independent assortment

1 1a a

2 2b b

F1 1 2a b

1 2a b

F2 1 1a a

1 1a b

1 1b b

1 2a a

1 2a b

1 2b b

2 2a a

2 2a b

2 2b b

6.25%

12.5%

6.25%

12.5%

25 %

12.5%

6.25%

12.5%

6.25%

Assume, we did this experiment and observed the following:

25% 50% 25%non-independent assortment

(in expectation)


1 1 2 2

25 % 50 % 25 %

a a b b

1a

2b

1a

2b

1a

2b

1a

1a

2b

2b

1 1a a

2 2b b

1a

2b



P1 generation (diploid)

F1 generation (diploid)

gametes (haploid)

F2 generation (diploid)

gametes (haploid)

Co-segregation(due to linkage)

Recombination

Recombination between 2 loci is said to have occurred if an individual received, from one parent, alleles (at these 2 loci) that originated in 2 different grandparents.

1/1a/a

2/2b/b

1 2a b

3/3c/c

1 3a c

2 3b c

1 3a c

1 3b c

1 3a c

1 3a c

2 3b c

2 3b c

2 3a c

2 3b c

N N N R N N N N R N? ? ? ? ? ? ? ? ? ?

Who is a recombinant?

1/1a/a

2/2b/b

1 2a b

1a

1b

2a

2b

N R R N

Possible explanations for recombination

I1

a

2

b

1

a

2

b

1

b

2

a

different chromosomes

II1

a

2

b

1

a

2

b

1

b

2

a

homologous recombination during meiosis

2

a

III genotyping errorR

Recombination fraction

The recombination fraction between 2 loci is defined as the proportion of meioses resulting in a recombinant gamete. For loci on different chromosomes (or for loci far apart on the same, large chromosome), the recombination fraction is 0.5. Such loci are said to be unlinked. For loci close together on the same chromosome, the recombination fraction is < 0.5. Such loci are said to be linked. The closer the loci, the smaller the recombination fraction ( 0).

1/1a/a

2/2b/b

1 2a b

3/3c/c

1 3a c

2 3b c

1 3a c

1 3b c

1 3a c

1 3a c

2 3b c

2 3b c

2 3a c

2 3b c

N N N R N N N N R N

Estimation of recombination fraction

€

ˆ ϑ =# N

# N+# R=

2

2 + 8=

2

10= 0.2

1/2a/b

3/3c/c

1 3a c

2 3b c

1 3a c

1 3b c

1 3a c

1 3a c

2 3b c

2 3b c

2 3a c

2 3b c

N N N R N N N N R N

1 2a bphase 1:

1 2b aphase 2:

R R R N R R R R N R

Missing phase information:Who is a recombinant??

?/? 3/3c/c

1 3a c

2 3b c

1 3a c

1 3b c

1 3a c

1 3a c

2 3b c

2 3b c

2 3a c

2 3b c

Missing phase and genotype information:

Who is a recombinant??

1/2a/b

?/? ?/?c/c

1 3a c

2 3b c

1 3a c

1 3b c

1 3a c

1 3a c

2 3b c

2 3b c

2 3a c

2 3b c


Who is a recombinant???

a/b

• The likelihood of a hypothesis (e.g. specific parameter value(s)) on a given dataset, L(hypothesis|data), is defined to be proportional to the probability of the data given the hypothesis, P(data|hypothesis):

L(hypothesis|data) = constant * P(data|hypothesis)

• Because of the proportionality constant, a likelihood by itself has no interpretation.

• The likelihood ratio (LR) of 2 hypotheses is meaningful if the 2 hypotheses are nested (i.e., one hypothesis is contained within the other):

• Under certain conditions, maximum likelihood estimates are asymptotically unbiased and asymptotically efficient. Likelihood theory describes how to interpret a likelihood ratio.

€

LR =L H1 | data( )L H0 | data( )

=cP data | H1( )cP data | H0( )

=P data | H1( )P data | H0( )

Likelihood

The lod (logarithm of odds) score is defined as the logarithm (to the base 10) of the likelihood of 2 hypothesis on a given dataset:

€

lod = log10

L H1 | data( )L H0 | data( )

In linkage analysis, typically the different hypotheses refer to different values of the recombination fraction:

€

Z ϑ( ) = log10

L linkage at specific recombination fraction | data( )L no linkage | data( )

= log10

L ϑ | data( )L ϑ = 0.5 | data( )

Zmax = log10

maxϑ

L ϑ | data( )

L 0.5 | data( )

€

Asymptotically, 2ln 10( )Zmax ~ 0.5χ 1( )2 .

Evaluating the evidence of linkage:lod score

1/1a/a

2/2b/b

1 2a b

3/3c/c

1 3a c

2 3b c

1 3a c

1 3b c

1 3a c

1 3a c

2 3b c

2 3b c

2 3a c

2 3b c

N N N R N N N N R N? ? ? ? ? ? ? ? ? ?

Who is a recombinant?

€

Z ϑ( ) = log10

L ϑ | data( )L ϑ = 0.5 | data( )

= log10

cP data |ϑ( )cP data |ϑ = 0.5( )

= log10

ϑ 2 1−ϑ( )8

0.52 1− 0.5( )8

Example lod score calculation

0

0.1 0.644

0.2 0.837

0.3 0.725

0.4 0.439

0.5 0

€

ϑ

€

Z ϑ( )

€

−∞

1/2a/b

3/3c/c

1 3a c

2 3b c

1 3a c

1 3b c

1 3a c

1 3a c

2 3b c

2 3b c

2 3a c

2 3b c

N N N R N N N N R N

1 2a bphase 1:

1 2b aphase 2:

R R R N R R R R N R

Missing phase information:Who is a recombinant??

P(data|) = P(phase 1) P(data|phase 1, ) + P(phase 2) P(data|phase 2 , )

€

Z ϑ( ) = log10

cP data |ϑ( )cP data |ϑ = 0.5( )

= log10

1

2ϑ 2 1−ϑ( )

8+

1

2ϑ 8 1−ϑ( )

2

1

20.52 1− 0.5( )

8+

1

20.58 1− 0.5( )

2

0

0.1 0.343

0.2 0.536

0.3 0.427

0.4 0.175

0.5 0

€

ϑ

€

Z ϑ( )

€

−∞

Example lod score calculation(missing phase information)

?/? ?/?c/c

1 3a c

2 3b c

1 3a c

1 3b c

1 3a c

1 3a c

2 3b c

2 3b c

2 3a c

2 3b c


Who is a recombinant???

a/b

( ) ( )( )∑ ×

=

phasesandgenotypesparental

phasegenotypepaternaldataP

phasesandgenotypesparentalPsfrequenciealleledataP

,, |

,|

ϑϑ

Z()0 -0.3040.1 0.2040.2 0.3460.3 0.2640.4 0.0960.5 0

Assuming 3 equally frequent alleles , i.e. P(1) = P(2) = P(3) = 0.333:

Z()0 -0.3780.1 0.1830.2 0.3320.3 0.2530.4 0.0910.5 0

Assuming P(1) = 0.495, P(2) = 0.495, P(3) = 0.010:

Example lod score calculation(missing phase and genotype

information)

Lod score curves (for previous example pedigree)

-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 lod score

known phase, known genotypes

3

unknown phase, known genotypes

unknown phase, unknown genotypes

Interpretation of lod score

• The traditional threshold for declaring evidence of linkage statistically significance is a lod score of 3, or a likelihood ratio of 1000:1, meaning the likelihood of linkage on the data is 1000-times higher than the likelihood of no linkage on the data.

• Asymptotically, a lod score of 3 has a point-wise significance level (p-value) of 0.0001. In other words, the probability of obtaining a lod score of at least this magnitude by chance is 0.0001.

• Due to the many linkage tests being conducted as part of a genome-wide linkage scan, a lod score of 3 has a significance level of ~0.05.

The p-value is defined as the probability of obtaining an outcome at least as extreme as observed by chance (i.e. when the null hypothesis is true).

Example: Testing whether a coin is fair

H0: P(head) = 0.5

H1: P(head) 0.5 (2-sided alternative hypothesis).

You observe 1 head out of 10 coin tosses. The p-value then is the probability of observing exactly 1 head in 10 trials (observed outcome), or 0 head in 10 trials (more extreme outcome), or 9 (equally extreme outcome) or 10 (more extreme outcome) heads in 10 trials.

€

p =10

i

⎛

⎝ ⎜

⎞

⎠ ⎟

i= 0,1,9,10

∑ 0.5i 1− 0.5( )10−i

=1+10 +10 +1

1024=

22

1024≈ 0.021

P-value

The p-value is defined as the probability of obtaining an outcome at least as extreme as observed by chance (i.e. when the null hypothesis is true).

Example: Testing whether 2 loci are linked

H0: P(recombination) = 0.5

H1: P(recombination) ≤ 0.5 (1-sided alternative hypothesis).

You observe 0 recombinant and 10 non-recombinant in 10 informative meioses. The p-value then is the probability of observing exactly 0 recombinants in 10 trials (observed outcome; there is no more extreme outcome).

p =10 0

⎛⎝⎜

⎞⎠⎟0.50 1−0.5( )10−0 =

11024

≈0.001

P-value

Example: Testing whether 2 loci are linked

H0: P(recombination) = 0.5

H1: P(recombination) ≤ 0.5 (1-sided alternative hypothesis).

You observe 0 recombinant and 10 non-recombinant in 10 informative meioses. The p-value then is the probability of observing exactly 0 recombinants in 10 trials (observed outcome; there is no more extreme outcome).

Lod score

Zmax =log10

maxϑ

L ϑ |data( )

L 0.5 |data( )=log10

L ϑ =0 |data( )L 0.5 |data( )

=log10

11

1024

=log101024

≈3

In the ideal case, 10 fully informative meioses may suffice to obtain significant evidence of linkage.

Lod score and significance level

lod score (point-wise) p-value

0.588 0.05

1.175 0.01

2.000 ~0.001

3.000 0.0001

4.000 ~0.00001

5.000 ~0.000001

Linkage analysis reducesmultiple testing problem

• Linkage analysis is so useful because it greatly reduces the multiple testing problem: ~3,000,000,000 bp of DNA are interrogated in ~500 independent linkage tests for human data. This is possible because a meiotic recombination event occurs on average only once every 100,000,000 bp.

• No specification of prior hypotheses is therefore necessary, as all possible hypotheses can be screened.

Linkage analysis: trait locus with unknown genotypes

?/? ?/? ?/? ?/?

?/?

?/? ?/?

?/?


Statistical gene mapping with trait phenotypes

genetic distance

(linkage, allelic association)

unobserved trait locus

genotypes

observed marker

genotypes

observed trait

phenotypescorrelation

to be detected

etiology?

Many different types of linkage methods

• penetrance model-based linkage analysis (“classical” linkage analysis)

• penetrance model-free linkage analysis (“model-free” or “non-parametric” linkage analysis– affected sib-pair linkage analysis– affected relative-pair linkage analysis– regression-based linkage analysis– variance components-based linkage analysis– …

Variation with each linkage method

• 2-point analysis vs. multiple 2-point analysis vs. multi-point analysis

• exact calculation vs. approximation (e.g., MCMC)

• qualitative trait vs. quantitative traits

• rare “simple mendelian” diseases vs. common “complex multifactorial diseases”

• …

Penetrance-model-based linkage analysis

Segregation analysis

In segregation analysis, one attempts to characterize the mode of inheritance of a trait, by statistically examining the segregation pattern of the trait through a sample of related individuals.

In a way, heritability analysis is a way of segregation analysis. In heritability analysis, the analysis is not focused on characterization of the segregation pattern per se, but on quantification of inheritance assuming a given mode of inheritance (such as, generally, additivity/co-dominance).

Relationship between genotypes and phenotypes (penetrances) at the ABO

blood group locus

Phenotype (blood group)

Genotype A B AB O

A/A 1 0 0 0

A/B 0 0 1 0

A/O 1 0 0 0

B/B 0 1 0 0

B/O 0 1 0 0

O/O 0 0 0 1

penetrance: P(phenotype given genotype)

Probability model correlating trait phenotypes and trait locus genotypes:

penetrances

penetrance: P(phenotype given genotype)

Genotype unaffected affected

+/+ 1 0

D/+ or +/D 0 1

D/D 0 1

Ex.: fully-penetrant dominant disease without “phenocopies”

Phenotype

Statistical gene mapping with trait phenotypes:

“simple” dominant inheritance model

genetic distance

(linkage, allelic association)


genotypes

observed marker

genotypes

observed trait


to be detected

=affected

not affected

D/+ +/+

Linkage analysis: trait locus (genotypes based on assumed dominant inheritance model)

+/+

D/+

+/+

+/+

D/+

D/+

D/+

+/+


Example of multipoint lod score curve: Pseudoxanthoma elasticum

Multipoint lod scores

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20 25 30

map position in cM

lod score

AR

AD

NPL

From: Le Saux et al (1999) Pseudoxanthoma elasticum maps to an 820 kb region of the p13.1 region of chromosome 16. Genomics 62:1-10

Genetic heterogeneity

timelocus homogeneity, allelic homogeneity

locus homogeneity, allelic heterogeneity

locus heterogeneity, allelic homogeneity (at each locus)

time

locus heterogeneity, allelic heterogeneity (at each locus)

Pros and cons ofpenetrance-model-based linkage

analysis

+ potentially very powerful (under suitable penetrance model)+ statistically well-behaved

- requires specification of penetrance model; not powerful at all under unsuitable penetrance model

dominant inheritance:

recessive inheritance:

P(aff.|DD or D+) = 1

P(aff.|++) = 0

P(aff.|DD) = 1

P(aff.|++ or D+) = 0

1/2 3/4

1/3 1/4 2/3

Effects of model misspecification

+/+ D/+

D/+ +/+ D/+

1/2 3/4

1/3 1/4 2/3

D/+ D/D

D/D D/+ D/D

informativeuninformative

uninformativeinformative

Pros and cons ofpenetrance-model-based linkage

analysis

+ potentially very powerful (under suitable penetrance model)+ statistically well-behaved

- requires specification of penetrance model; not powerful at all under unsuitable penetrance model

- modeling flexibility limited- computationally intensive

“Mendelian” vs. “complex” traits“simple mendelian” disease

•genotypes of a single locus cause disease

•often little genetic (locus) heterogeneity (sometimes even little allelic heterogeneity); little interaction between genotypes at different genes

•often hardly any environmental effects

•often low prevalence

•often early onset

•often clear mode of inheritance

•“good” pedigrees for gene mapping can often be found

•often straightforward to map

“complex multifactorial” disease

•genotypes of a single locus merely increase risk of disease

•genotypes of many different genes (and various environmental factors) jointly and often interactively determine the disease status

•important environmental factors

•often high prevalence

•often late onset

•no clear mode of inheritance

•not easy to find “good” pedigrees for gene mapping

•difficult to map

A quantitative trait is not necessarily complex

observed trait

phenotypes

observed marker

genotypes

correlation to be

detected

genetic distance

(linkage, allelic

association)


genotypes

etiology given ascertainment

€

P Gtrait locus

| Phtrait

⎛

⎝ ⎜

⎞

⎠ ⎟→1

Fundamental problem in complex trait gene mapping

genetic distance

(linkage, allelic

association)

observed marker

genotypes


genotypes

observed trait


to be detected etiology given

ascertainment

€

P Gtrait locus

| Phtrait

⎛

⎝ ⎜

⎞

⎠ ⎟→ P Gtrait

locus

⎛

⎝ ⎜

⎞

⎠ ⎟≈ small

Etiological complexity

trait phenotyp

e

other env. factor(s)

gene 1

gene 2 gene

3

environm. factor 2

environm. factor 1

other gene(s)

environm. factor 3

genotype 1

genotype 2

other genotype

s

genotype 1

genotype 2

other genotype

sgenotype

1genotype

2other

genotypes

How to improve power to detect correlations between trait phenotypes and trait locus

genotypes?


genotypes

observed trait

phenotypes

etiology

0| →⎟⎟⎠

⎞⎜⎜⎝

⎛trait

locustrait PhGP 1

How to simplify the etiological architecture?

• choose tractable trait– Are there sub-phenotypes within trait?

• age of onset• severity• combination of symptoms (syndrome)

– “endophenotype” or “biomarker ” vs. disease• quantitative vs. qualitative (discrete)• Dichotomizing quantitative phenotypes leads to loss of information.• simple/cheap measurement vs. uncertain/expensive diagnosis• not as clinically relevant, but with simpler etiology

• given trait, choose appropriate study design/ascertainment protocol– study population

• genetic heterogeneity• environmental heterogeneity

– “random” ascertainment vs. ascertainment based on phenotype of interest• single or multiple probands• concordant or discordant probands• pedigrees with apparent “mendelian” inheritance?• inbred pedigrees?

– data structures• singletons, small pedigrees, large pedigrees

– account for/stratify by known genetic and environmental risk factors

Affected sib-pair linkage analysis

Identity-by-state (IBS) vs. identity-by-descent (IBD)

1 2 3 4

1 3 1 4

1 2 1 3

1 2 1 3

1 1 2 3

1 3 1 2

1 2 1 2

1 2 1 2

IBD(also IBS)

IBS(not IBD)

? ?(both or

neither IBD)If IBD then necessarily IBS (assuming absence of mutation event).

If IBS then not necessarily IBD (unless a locus is 100% informative, i.e. has an infinite number of alleles, each with infinitesimally small allele frequency).

Probabilistic inference of IBD

1 2 3 4

1 3 1 4

1 2 1 3

1 2 1 3

1 1 2 3

1 3 1 2

1 2 1 2

1 2 1 2

1 0 0.5 1

1 2 1.5 1

0.5 0 0.25 0.5

NIBD

IBD

Rationale ofaffected sib-pair linkage analysis

A pair of sibs affected with the same disorder is expected to share the alleles at the trait locus/loci---and also alleles at linked loci---more often (> 50 %) than a random pair of sibs (50 %).

Basic concept ofaffected sib pair linkage analysis

IBD? IBD?

IBD NIBD

1/2 3/4

1/3 1/4

IBD? IBD?

IBD NIBD

1/2 3/4

1/3 1/4

Affected sib pair linkage analysis(mean test)

IBD? IBD?

IBD NIBD

1/2 3/4

1/3 1/4

? 5.0)Pr(

? )Pr()Pr(

?

>>>

IBDNIBDIBD

nn NIBDIBD

NIBD IBD

counts in example ped.

1 1

total counts in dataset NIBDn IBDn

Conditional on the fact that both sibs are affected, test if:

NIBD IBD

probability

counts in ex. 1 1

total counts

Affected sib pair linkage analysis(mean test)

NIBDn IBDn

IBD? IBD?

IBD NIBD

1/2 3/4

1/3 1/4

φ−1 φ

IBDNIBD

IBDNIBD

nn

nn

L

L

5.0)5.01(

)ˆ1(ln2

)5.0(

)ˆ(ln2

−−

==

ϕϕϕ

ϕ

Penetrance-model based linkage analysis on affected sib pair

1/2 3/4

1/3 1/4

Trait locus genotypes are inferred probabilistically conditional on observed phenotypes according to an assumed inheritance model (number of alleles, allele frequencies and genotypic penetrances).

?/? ?/?

?/? ?/?

Penetrance-model-based linkage analysis on affected sib pair

assuming a rare recessive trait w/o “phenocopies”

1/2 3/4

1/3 1/4

D/+

D/+

D/D D/D

Conditional on the fact that both affected sibs inherited the D allele from each parent, test if:

? 5.0

?)Pr()Pr(

? ))1(2Pr())1(Pr(

? )Pr()Pr(22

<>−

−>−+

>

trecombinantrecombinannon

NIBDIBD

Penetrance-based linkage analysis on affected sib pair

1/2 3/4

1/3 1/4

D/+

D/+

D/D D/D

(assuming a rare, recessive trait w/o

“phenocopies”)

10 5.0)5.01(

ˆ)ˆ1(ln2

)5.0(

)ˆ(ln2

nn

nn IBDNIBD

L

L

−−

=

=

ϕϕϕ

ϕ

and because

22

22

)1(

.)(.)(

ϑϑϕ

−+=

−+= recnonPrecP

)5.0(

)ˆ(ln2

5.0)5.01(

])ˆ1(ˆ[))]ˆ1(ˆ2[ln2

22

==

−−+−

=

ϑ

ϑϑϑϑ

LL

IBDNIBD

IBDNIBD

nn

nn

Relationship of affected sib-pair linkage analysis and penetrance-

model-based linkage analysis

0

0.1

0.2

0.3

0.4

0.5

0.5 0.6 0.7 0.8 0.9 1.0

φ = (2 )P affected sibs share the allele from a parent IBD

= recombination fraction " - " in pseudo marker analysis

For an affected sib-pair of unaffected parents, affected sib-pair linkage analysis and penetrance-model-based linkage analysis assuming a rare recessive trait w/o “phenocopies” are identical.

Penetrance-based linkage analysis on affected sib pair

Assuming a rare, recessive trait w/o “phenocopies”, the father is no longer informative.

Penetrance-based linkage analysis is then no longer equivalent to affected sib pair linkage analysis.

1/2 3/4

1/3 1/4

D/D D/+

D/D D/D

“Pseudo-marker” analog of affected sib pair linkage analysis (mean test)

“pseudo-marker”

genotypes

1/2 3/4

1/3 1/4

D/+

D/+

D/D D/D

1/2 3/4

1/3 1/4

D/+

D/+

D/D D/D

Take home message regarding relationship of penetrance-model-based and “model-free” approaches to gene

mapping:• The perceived differences between penetrance-model based

and many popular “model-free” methods are more related to the underlying study design than the statistical methodology.

• A deterministic “pseudo-marker” genotype assignment algorithm can be used to mimic popular “model-free approaches”, allowing joint analysis of different data structures for linkage and/or LD in a framework identical to penetrance-based analysis.

• These “pseudo-marker” statistics are generally better behaved and more powerful than their conventional “model-free” analogs.

Regression-based methods forlinkage analysis of quantitative traits

The basic rationale behind this approach (in its various forms) is that pairs of individuals (of a given relationship) with similar phenotypes are expected to be more similar to each other genetically at/near loci influencing the trait of interest than pairs of relatives (of the same relationship) who have dissimilar phenotypes. The degree of phenotypic similarity therefore should be reflected in the proportion of alleles that individuals share IBD at/near trait loci.

Haseman-Elston sib pair linkage testfor quantitative traits

2

IBD0 0.5 1

****

**

**

***

squared phenotypic difference

between 2 sibsStatistical inference:

Is the regression slope < 0?

Variance components-basedlinkage analysis

Rationale of variance components-based linkage analysis

The pattern of phenotypic similarity among pedigree members should be reflected by the pattern of IBD sharing among them at chromosomal loci influencing the trait of interest.

Variance components approach:multivariate normal distribution (MVN)

In variance components analysis, the phenotype is generally assumed to follow a multivariate normal distribution:

f x( ) =1

2( )n Ω( )12

exp12

x−μ( )'Ω−1 x−μ( )⎛

⎝⎜⎞

⎠⎟

ln f x( ) =−n2

ln 2( )−12Ω −

12

x−μ( )'Ω−1 x−μ( )

no. of individuals (in a pedigree)

nn covariance matrix

phenotype vector

mean phenotype

vector

Modeling the resemblance among relative

€

Ω =Ισ e2 + 2Φσ g

2

Ω = Ισ e2 + 2Φσ g

2 + ˆ Π σ q2

heritability analysis

linkage analysis

Matrix of estimated allele sharing among relatives

12 33

13 13 13

P M

S1 S2 S3

P M S1 S2 S3

P 1 0 0.5 0.5 0.5

M 1 0.5 0.5 0.5

S1 1 0.5 0.5

S2 1 0.5

S3 1

€

Πexpected = 2Φ

P M S1 S2 S3

P 1 0 0.5 0.5 0.5

M 1 0.5 0.5 0.5

S1 1 0.75 0.75

S2 1 0.75

S3 1

€

ˆ Π estimated

€

lod = log10

L H1 | data( )L H0 | data( )

= log10

maxσ e

2 ,σ g2 ,σ q

2L σ e

2,σ g2,σ q

2,| data( )

maxσ e

2 ,σ g2L σ e

2,σ g2,σ q

2 = 0 | data( )

€

Asymptotically, 2ln 10( )lod ~ 0.5χ 1( )2 .

Variance components-based lod score

100

1,000

10,000

100,000

0 0.1 0.2 0.3 0.4 0.5Heritability due to QTL

Number of Individuals

PedigreeSibship (2)Sibship (4)

Sample size requirements to detect linkage to a QTL with a lod score of ≥ 3

and 80% power

Pros and cons ofvariance-components-based linkage

analysis

+ no need to specify inheritance model+ robust to allelic heterogeneity at a locus+ modeling flexibility+ computationally feasible even on large pedigrees

- generally assumes additive inheritance model- modeling restrictions- not always well-behaved statistically (depending on phenotypic

distribution and ascertainment)- generally less powerful than penetrance-model-based linkage

analysis under suitable model

Choice of covariates

Covariates ought to be included in the likelihood model if they are known to influence the phenotype of interest and if their own genetic regulation does not overlap the genetic regulation of the target phenotype.

Typical examples include sex and age.

In the analysis of height, information on nutrition during childhood should probably be included during analysis. However, known growth hormone levels probably should not be.


σ p2 σ p

2

hq

2without cov =

σq2

σ p2 ≈0.15 > 0.05 ≈

σq2 − σq

2 I σ cov2( )

σ p2 −σ cov

2 =hq2withcov

σ q2

σ q2

σ cov2


σ p2 σ p

2

σ q2 σ q

2

σ cov2

hq

2without cov =

σq2

σ p2 ≈0.15 < 0.2 ≈

σq2 − σq

2 I σ cov2( )

σ p2 −σ cov

2 =hq2withcov

Choice of covariates:special case of treatment/medication

Before treatment/medicationof affected individuals

phenotype

probability density

unaffected affected

After (partially effective) treatment / medication of affected individuals

phenotype

probability density

unaffected affected

apparent effect of covariate

Choice of covariates:special case of treatment/medication

• If medication is ineffective/partially effective, including treatment as a covariate is worse than ignoring it in the analysis.

• If medication is very effective, such that the phenotypic mean of individuals after treatment is equal to the phenotypic mean of the population as a whole, then including medication as a covariate has no effect.

• If medication is extremely effective, such that the phenotypic mean of individuals after treatment is “better” than the phenotypic mean of the population as a whole, then including medication as a covariate is better than ignoring it, but still far from satisfying.

• Either censor individuals or, better, infer or integrate over their phenotypes before treatment, based on information on efficacy etc.

Two-point vs. multi-point linkage analysis

• In linkage analysis, one always examines whether or not the alleles at 2 loci tend to co-segregate during meiosis.

• In “two-point” linkage analysis, chromosomal inheritance is inferred from the observed trait phenotypes on the one hand (locus 1) and from a single (genotyped) marker locus on the other hand (locus 2).

• In “multi-point” linkage analysis, chromosomal inheritance is inferred from the observed trait phenotypes on the one hand (locus 1) and from multiple (genotyped) marker loci on the other hand (locus 2).

Pros and cons of multi-point linkage analysis

+ Genotypes at multiple markers contain at least as much and generally more information to infer chromosomal inheritance than genotypes at a single marker, resulting in greater power to detect linkage.

+ The number of independent tests in genome-wide linkage analysis is somewhat reduced in multi-point linkage analysis vs. two-point linkage analysis.

- Multi-point linkage analysis requires knowledge of the genetic marker map (marker order and inter-marker recombination fractions). If this information is incorrect, power can be reduced and/or the false positive rate can be increased.

- Multi-point linkage analysis is more susceptible to genotyping errors.- Multi-point linkage analysis typically assumes linkage equilibrium between

markers. If this does not hold, power can be reduced and/or the false positive rate can be increased.

- Multi-point linkage analysis is computationally more demanding than two-point linkage analysis.

Genetic map vs. physical map

m1 m2 m3 m4

1223 34

x1 x2 x3 x4 cM

genetic map

physicalmap

y1 y2 y3 y4 Mb

Genetic map distance vs. recombination fraction

Def. of recombination fraction: probability that recombination takes place between 2 chromosomal positions during meiosis

Recombination fractions are not additive, i.e., for 3 loci and recombination fractions 12 and 23, 13 ≠ 12 + 23.

Def. of genetic map distance (Morgan, M): distance in which 1 recombination event is expected to take place or, equivalently, average distance between recombination events. centi-Morgan (cM) is equal to 1/100 Morgan.

Genetic map distances are additive, i.e. for 3 loci and map distances x12 cM and x23 cM, x13 = x12 + x23 cM.

Neither recombination fractions nore genetic map distances are easily converted into physical map distances.

Why a genome-wide linkage scan may fail

• The sample size is too small.• The marker genotypes are not sufficiently informative (low

heterozygosity and/or large gaps in marker map).• There is no major gene.• The chosen analytical approach is unsuitable.• Bad luck!

A fairytale of 2 traits

Heritability estimates

trait A trait B

45-82% 63-92%

Quantitative trait A (sample 1)

large, randomly ascertained pedigrees

no. of phenotyped individuals: 268

trait heritability estimate: 0.55

Quantitative trait B (sample 1)

large, randomly ascertained pedigrees

no. of phenotyped individuals: 324

trait heritability estimate: 0.88

Quantitative trait A (sample 1)

Quantitative trait A (samples 1--2)

Quantitative trait A (samples 1--3)

Quantitative trait A (samples 1--3 + combined)

Quantitative trait B (sample 1)

Quantitative trait B (samples 1--2)

quantitative trait A: lipoprotein A (concentration in serum)

quantitative trait B: height (in adults)

Heritability of adult height(additive heritability, adjusted for sex and age)

study sample sizeheritability estimate

TOPS 2199 0.78

FLS 705 0.83

GAIT 324 0.88

SAFHS 903 0.76

SAFDS 737 0.92

SHFS

AZ 643 0.80

DK 675 0.81

OK 647 0.79

Jiri 616 0.63

total 7449

Polygenic or

oligogenic ?

Height (9 samples)

€

ˆ h q,GAIT2 = 0.29

ˆ h q,TOPS2 = 0.03

ˆ h q,FLS2 = 0

ˆ h q,SAFDS2 = 0.08

ˆ h q,SAFHS2 = 0

ˆ h q,SHFS−AZ2 = 0.05

ˆ h q,SHFS−DK2 = 0.01

ˆ h q,SHFS−OK2 = 0.01

ˆ h q,Jiri2 = 0

introduction to linkage analysis course “study design and data analysis for genetic studies”,...

Documents