introduction to linkage analysis course “study design and data analysis for genetic studies”,...
TRANSCRIPT
Introduction to linkage analysis
Course “Study Design and Data Analysis for Genetic Studies”, Universidad ded Zulia, Maracaibo, Venezuela, 9-10 April 2005
Harald H.H. Göring
“Marker” loci
There are many different types of polymorphisms, e.g.:
• single nucleotide polymorphism (SNP):
AAACATAGACCGGTT
AAACATAGCCCGGTT
• microsatellite/variable number of tandem repeat (VNTR):
AAACATAGCACACA----CCGGTT
AAACATAGCACACACACCGGTT
• insertion/deletion (indel):
AAACATAGACCACCGGTT
AAACATAG--------CCGGTT
• restriction fragment length polymorphism (RFLP)
…
Tracing chromosomal inheritanceusing “marker” locus genotypes
1/2 3/4
1/5 4/5
5/5 5/5
1/4 5/5
Tracing chromosomal inheritance(fully informative situation)
Linkage analysis:locus with known genotypes
1/2 3/3 2/4 1/1
1/3
2/3 1/3
1/2
Where do the observed genotypes “fit”?
Linkage analysis
In linkage analysis, one evaluates statistically whether or not the alleles at 2 loci co-segregate during meiosis more often than expected by chance. If the evidence of increased co-segregation is convincing, one generally concludes that the 2 loci are “linked”, i.e. are located on the same chromosome (“syntenic loci”). The degree of co-segregation provides an estimate of the proximity of the 2 loci, with near complete co-segregation for very tightly linked loci.
Let’s step back…
to Mendel
P1
F1
F2
315 108 101 32
9 : 3 : 3 : 1
Mendel’s law of uniformity
Mendel’s law of independent assortment
observed:
~ ratio:
One of Mendel’s pea crosses
P1 1 1 2 2
F1 1 2 1 2
F2 1 1 1 2 2 2
25% 50% 25% (in expectation)
Mendel’s law of uniformity
Mendel’s law of segregation
P1a a b b
F1a b a b
F2a a a b b b
Mendel’s law of uniformity
Mendel’s law of segregation
25% 50% 25% (in expectation)
P1
Mendel’s law of independent assortment
1 1a a
2 2b b
F1 1 2a b
1 2a b
F2 1 1a a
1 1a b
1 1b b
1 2a a
1 2a b
1 2b b
2 2a a
2 2a b
2 2b b
6.25%
12.5%
6.25%
12.5%
25 %
12.5%
6.25%
12.5%
6.25%
Assume, we did this experiment and observed the following:
25% 50% 25%non-independent assortment
(in expectation)
Mendel’s law of uniformity
1 1 2 2
25 % 50 % 25 %
a a b b
1a
2b
1a
2b
1a
2b
1a
1a
2b
2b
1 1a a
2 2b b
1a
2b
Mendel’s law of uniformity
Mendel’s law of segregation
P1 generation (diploid)
F1 generation (diploid)
gametes (haploid)
F2 generation (diploid)
gametes (haploid)
Co-segregation(due to linkage)
Recombination
Recombination between 2 loci is said to have occurred if an individual received, from one parent, alleles (at these 2 loci) that originated in 2 different grandparents.
1/1a/a
2/2b/b
1 2a b
3/3c/c
1 3a c
2 3b c
1 3a c
1 3b c
1 3a c
1 3a c
2 3b c
2 3b c
2 3a c
2 3b c
N N N R N N N N R N? ? ? ? ? ? ? ? ? ?
Who is a recombinant?
1/1a/a
2/2b/b
1 2a b
1a
1b
2a
2b
N R R N
Possible explanations for recombination
I1
a
2
b
1
a
2
b
1
b
2
a
different chromosomes
II1
a
2
b
1
a
2
b
1
b
2
a
homologous recombination during meiosis
2
a
III genotyping errorR
Recombination fraction
The recombination fraction between 2 loci is defined as the proportion of meioses resulting in a recombinant gamete. For loci on different chromosomes (or for loci far apart on the same, large chromosome), the recombination fraction is 0.5. Such loci are said to be unlinked. For loci close together on the same chromosome, the recombination fraction is < 0.5. Such loci are said to be linked. The closer the loci, the smaller the recombination fraction ( 0).
1/1a/a
2/2b/b
1 2a b
3/3c/c
1 3a c
2 3b c
1 3a c
1 3b c
1 3a c
1 3a c
2 3b c
2 3b c
2 3a c
2 3b c
N N N R N N N N R N
Estimation of recombination fraction
€
ˆ ϑ =# N
# N+# R=
2
2 + 8=
2
10= 0.2
1/2a/b
3/3c/c
1 3a c
2 3b c
1 3a c
1 3b c
1 3a c
1 3a c
2 3b c
2 3b c
2 3a c
2 3b c
N N N R N N N N R N
1 2a bphase 1:
1 2b aphase 2:
R R R N R R R R N R
Missing phase information:Who is a recombinant??
?/? 3/3c/c
1 3a c
2 3b c
1 3a c
1 3b c
1 3a c
1 3a c
2 3b c
2 3b c
2 3a c
2 3b c
Missing phase and genotype information:
Who is a recombinant??
1/2a/b
?/? ?/?c/c
1 3a c
2 3b c
1 3a c
1 3b c
1 3a c
1 3a c
2 3b c
2 3b c
2 3a c
2 3b c
Missing phase and genotype information:
Who is a recombinant???
a/b
• The likelihood of a hypothesis (e.g. specific parameter value(s)) on a given dataset, L(hypothesis|data), is defined to be proportional to the probability of the data given the hypothesis, P(data|hypothesis):
L(hypothesis|data) = constant * P(data|hypothesis)
• Because of the proportionality constant, a likelihood by itself has no interpretation.
• The likelihood ratio (LR) of 2 hypotheses is meaningful if the 2 hypotheses are nested (i.e., one hypothesis is contained within the other):
• Under certain conditions, maximum likelihood estimates are asymptotically unbiased and asymptotically efficient. Likelihood theory describes how to interpret a likelihood ratio.
€
LR =L H1 | data( )L H0 | data( )
=cP data | H1( )cP data | H0( )
=P data | H1( )P data | H0( )
Likelihood
The lod (logarithm of odds) score is defined as the logarithm (to the base 10) of the likelihood of 2 hypothesis on a given dataset:
€
lod = log10
L H1 | data( )L H0 | data( )
In linkage analysis, typically the different hypotheses refer to different values of the recombination fraction:
€
Z ϑ( ) = log10
L linkage at specific recombination fraction | data( )L no linkage | data( )
= log10
L ϑ | data( )L ϑ = 0.5 | data( )
Zmax = log10
maxϑ
L ϑ | data( )
L 0.5 | data( )
€
Asymptotically, 2ln 10( )Zmax ~ 0.5χ 1( )2 .
Evaluating the evidence of linkage:lod score
1/1a/a
2/2b/b
1 2a b
3/3c/c
1 3a c
2 3b c
1 3a c
1 3b c
1 3a c
1 3a c
2 3b c
2 3b c
2 3a c
2 3b c
N N N R N N N N R N? ? ? ? ? ? ? ? ? ?
Who is a recombinant?
€
Z ϑ( ) = log10
L ϑ | data( )L ϑ = 0.5 | data( )
= log10
cP data |ϑ( )cP data |ϑ = 0.5( )
= log10
ϑ 2 1−ϑ( )8
0.52 1− 0.5( )8
Example lod score calculation
0
0.1 0.644
0.2 0.837
0.3 0.725
0.4 0.439
0.5 0
€
ϑ
€
Z ϑ( )
€
−∞
1/2a/b
3/3c/c
1 3a c
2 3b c
1 3a c
1 3b c
1 3a c
1 3a c
2 3b c
2 3b c
2 3a c
2 3b c
N N N R N N N N R N
1 2a bphase 1:
1 2b aphase 2:
R R R N R R R R N R
Missing phase information:Who is a recombinant??
P(data|) = P(phase 1) P(data|phase 1, ) + P(phase 2) P(data|phase 2 , )
€
Z ϑ( ) = log10
cP data |ϑ( )cP data |ϑ = 0.5( )
= log10
1
2ϑ 2 1−ϑ( )
8+
1
2ϑ 8 1−ϑ( )
2
1
20.52 1− 0.5( )
8+
1
20.58 1− 0.5( )
2
0
0.1 0.343
0.2 0.536
0.3 0.427
0.4 0.175
0.5 0
€
ϑ
€
Z ϑ( )
€
−∞
Example lod score calculation(missing phase information)
?/? ?/?c/c
1 3a c
2 3b c
1 3a c
1 3b c
1 3a c
1 3a c
2 3b c
2 3b c
2 3a c
2 3b c
Missing phase and genotype information:
Who is a recombinant???
a/b
( ) ( )( )∑ ×
=
phasesandgenotypesparental
phasegenotypepaternaldataP
phasesandgenotypesparentalPsfrequenciealleledataP
,, |
,|
ϑϑ
Z()0 -0.3040.1 0.2040.2 0.3460.3 0.2640.4 0.0960.5 0
Assuming 3 equally frequent alleles , i.e. P(1) = P(2) = P(3) = 0.333:
Z()0 -0.3780.1 0.1830.2 0.3320.3 0.2530.4 0.0910.5 0
Assuming P(1) = 0.495, P(2) = 0.495, P(3) = 0.010:
Example lod score calculation(missing phase and genotype
information)
Lod score curves (for previous example pedigree)
-1
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 lod score
known phase, known genotypes
3
unknown phase, known genotypes
unknown phase, unknown genotypes
Interpretation of lod score
• The traditional threshold for declaring evidence of linkage statistically significance is a lod score of 3, or a likelihood ratio of 1000:1, meaning the likelihood of linkage on the data is 1000-times higher than the likelihood of no linkage on the data.
• Asymptotically, a lod score of 3 has a point-wise significance level (p-value) of 0.0001. In other words, the probability of obtaining a lod score of at least this magnitude by chance is 0.0001.
• Due to the many linkage tests being conducted as part of a genome-wide linkage scan, a lod score of 3 has a significance level of ~0.05.
The p-value is defined as the probability of obtaining an outcome at least as extreme as observed by chance (i.e. when the null hypothesis is true).
Example: Testing whether a coin is fair
H0: P(head) = 0.5
H1: P(head) 0.5 (2-sided alternative hypothesis).
You observe 1 head out of 10 coin tosses. The p-value then is the probability of observing exactly 1 head in 10 trials (observed outcome), or 0 head in 10 trials (more extreme outcome), or 9 (equally extreme outcome) or 10 (more extreme outcome) heads in 10 trials.
€
p =10
i
⎛
⎝ ⎜
⎞
⎠ ⎟
i= 0,1,9,10
∑ 0.5i 1− 0.5( )10−i
=1+10 +10 +1
1024=
22
1024≈ 0.021
P-value
The p-value is defined as the probability of obtaining an outcome at least as extreme as observed by chance (i.e. when the null hypothesis is true).
Example: Testing whether 2 loci are linked
H0: P(recombination) = 0.5
H1: P(recombination) ≤ 0.5 (1-sided alternative hypothesis).
You observe 0 recombinant and 10 non-recombinant in 10 informative meioses. The p-value then is the probability of observing exactly 0 recombinants in 10 trials (observed outcome; there is no more extreme outcome).
p =10 0
⎛⎝⎜
⎞⎠⎟0.50 1−0.5( )10−0 =
11024
≈0.001
P-value
Example: Testing whether 2 loci are linked
H0: P(recombination) = 0.5
H1: P(recombination) ≤ 0.5 (1-sided alternative hypothesis).
You observe 0 recombinant and 10 non-recombinant in 10 informative meioses. The p-value then is the probability of observing exactly 0 recombinants in 10 trials (observed outcome; there is no more extreme outcome).
Lod score
Zmax =log10
maxϑ
L ϑ |data( )
L 0.5 |data( )=log10
L ϑ =0 |data( )L 0.5 |data( )
=log10
11
1024
=log101024
≈3
In the ideal case, 10 fully informative meioses may suffice to obtain significant evidence of linkage.
Lod score and significance level
lod score (point-wise) p-value
0.588 0.05
1.175 0.01
2.000 ~0.001
3.000 0.0001
4.000 ~0.00001
5.000 ~0.000001
Linkage analysis reducesmultiple testing problem
• Linkage analysis is so useful because it greatly reduces the multiple testing problem: ~3,000,000,000 bp of DNA are interrogated in ~500 independent linkage tests for human data. This is possible because a meiotic recombination event occurs on average only once every 100,000,000 bp.
• No specification of prior hypotheses is therefore necessary, as all possible hypotheses can be screened.
Linkage analysis: trait locus with unknown genotypes
?/? ?/? ?/? ?/?
?/?
?/? ?/?
?/?
Where do the observed genotypes “fit”?
Statistical gene mapping with trait phenotypes
genetic distance
(linkage, allelic association)
unobserved trait locus
genotypes
observed marker
genotypes
observed trait
phenotypescorrelation
to be detected
etiology?
Many different types of linkage methods
• penetrance model-based linkage analysis (“classical” linkage analysis)
• penetrance model-free linkage analysis (“model-free” or “non-parametric” linkage analysis– affected sib-pair linkage analysis– affected relative-pair linkage analysis– regression-based linkage analysis– variance components-based linkage analysis– …
Variation with each linkage method
• 2-point analysis vs. multiple 2-point analysis vs. multi-point analysis
• exact calculation vs. approximation (e.g., MCMC)
• qualitative trait vs. quantitative traits
• rare “simple mendelian” diseases vs. common “complex multifactorial diseases”
• …
Penetrance-model-based linkage analysis
Segregation analysis
In segregation analysis, one attempts to characterize the mode of inheritance of a trait, by statistically examining the segregation pattern of the trait through a sample of related individuals.
In a way, heritability analysis is a way of segregation analysis. In heritability analysis, the analysis is not focused on characterization of the segregation pattern per se, but on quantification of inheritance assuming a given mode of inheritance (such as, generally, additivity/co-dominance).
Relationship between genotypes and phenotypes (penetrances) at the ABO
blood group locus
Phenotype (blood group)
Genotype A B AB O
A/A 1 0 0 0
A/B 0 0 1 0
A/O 1 0 0 0
B/B 0 1 0 0
B/O 0 1 0 0
O/O 0 0 0 1
penetrance: P(phenotype given genotype)
Probability model correlating trait phenotypes and trait locus genotypes:
penetrances
penetrance: P(phenotype given genotype)
Genotype unaffected affected
+/+ 1 0
D/+ or +/D 0 1
D/D 0 1
Ex.: fully-penetrant dominant disease without “phenocopies”
Phenotype
Statistical gene mapping with trait phenotypes:
“simple” dominant inheritance model
genetic distance
(linkage, allelic association)
unobserved trait locus
genotypes
observed marker
genotypes
observed trait
phenotypescorrelation
to be detected
=affected
not affected
D/+ +/+
Linkage analysis: trait locus (genotypes based on assumed dominant inheritance model)
+/+
D/+
+/+
+/+
D/+
D/+
D/+
+/+
Where do the observed genotypes “fit”?
Example of multipoint lod score curve: Pseudoxanthoma elasticum
Multipoint lod scores
0
1
2
3
4
5
6
7
8
9
0 5 10 15 20 25 30
map position in cM
lod score
AR
AD
NPL
From: Le Saux et al (1999) Pseudoxanthoma elasticum maps to an 820 kb region of the p13.1 region of chromosome 16. Genomics 62:1-10
Genetic heterogeneity
timelocus homogeneity, allelic homogeneity
locus homogeneity, allelic heterogeneity
locus heterogeneity, allelic homogeneity (at each locus)
time
locus heterogeneity, allelic heterogeneity (at each locus)
Pros and cons ofpenetrance-model-based linkage
analysis
+ potentially very powerful (under suitable penetrance model)+ statistically well-behaved
- requires specification of penetrance model; not powerful at all under unsuitable penetrance model
dominant inheritance:
recessive inheritance:
P(aff.|DD or D+) = 1
P(aff.|++) = 0
P(aff.|DD) = 1
P(aff.|++ or D+) = 0
1/2 3/4
1/3 1/4 2/3
Effects of model misspecification
+/+ D/+
D/+ +/+ D/+
1/2 3/4
1/3 1/4 2/3
D/+ D/D
D/D D/+ D/D
informativeuninformative
uninformativeinformative
Pros and cons ofpenetrance-model-based linkage
analysis
+ potentially very powerful (under suitable penetrance model)+ statistically well-behaved
- requires specification of penetrance model; not powerful at all under unsuitable penetrance model
- modeling flexibility limited- computationally intensive
“Mendelian” vs. “complex” traits“simple mendelian” disease
•genotypes of a single locus cause disease
•often little genetic (locus) heterogeneity (sometimes even little allelic heterogeneity); little interaction between genotypes at different genes
•often hardly any environmental effects
•often low prevalence
•often early onset
•often clear mode of inheritance
•“good” pedigrees for gene mapping can often be found
•often straightforward to map
“complex multifactorial” disease
•genotypes of a single locus merely increase risk of disease
•genotypes of many different genes (and various environmental factors) jointly and often interactively determine the disease status
•important environmental factors
•often high prevalence
•often late onset
•no clear mode of inheritance
•not easy to find “good” pedigrees for gene mapping
•difficult to map
A quantitative trait is not necessarily complex
observed trait
phenotypes
observed marker
genotypes
correlation to be
detected
genetic distance
(linkage, allelic
association)
unobserved trait locus
genotypes
etiology given ascertainment
€
P Gtrait locus
| Phtrait
⎛
⎝ ⎜
⎞
⎠ ⎟→1
Fundamental problem in complex trait gene mapping
genetic distance
(linkage, allelic
association)
observed marker
genotypes
unobserved trait locus
genotypes
observed trait
phenotypescorrelation
to be detected etiology given
ascertainment
€
P Gtrait locus
| Phtrait
⎛
⎝ ⎜
⎞
⎠ ⎟→ P Gtrait
locus
⎛
⎝ ⎜
⎞
⎠ ⎟≈ small
Etiological complexity
trait phenotyp
e
other env. factor(s)
gene 1
gene 2 gene
3
environm. factor 2
environm. factor 1
other gene(s)
environm. factor 3
genotype 1
genotype 2
other genotype
s
genotype 1
genotype 2
other genotype
sgenotype
1genotype
2other
genotypes
How to improve power to detect correlations between trait phenotypes and trait locus
genotypes?
unobserved trait locus
genotypes
observed trait
phenotypes
etiology
0| →⎟⎟⎠
⎞⎜⎜⎝
⎛trait
locustrait PhGP 1
How to simplify the etiological architecture?
• choose tractable trait– Are there sub-phenotypes within trait?
• age of onset• severity• combination of symptoms (syndrome)
– “endophenotype” or “biomarker ” vs. disease• quantitative vs. qualitative (discrete)• Dichotomizing quantitative phenotypes leads to loss of information.• simple/cheap measurement vs. uncertain/expensive diagnosis• not as clinically relevant, but with simpler etiology
• given trait, choose appropriate study design/ascertainment protocol– study population
• genetic heterogeneity• environmental heterogeneity
– “random” ascertainment vs. ascertainment based on phenotype of interest• single or multiple probands• concordant or discordant probands• pedigrees with apparent “mendelian” inheritance?• inbred pedigrees?
– data structures• singletons, small pedigrees, large pedigrees
– account for/stratify by known genetic and environmental risk factors
Affected sib-pair linkage analysis
Identity-by-state (IBS) vs. identity-by-descent (IBD)
1 2 3 4
1 3 1 4
1 2 1 3
1 2 1 3
1 1 2 3
1 3 1 2
1 2 1 2
1 2 1 2
IBD(also IBS)
IBS(not IBD)
? ?(both or
neither IBD)If IBD then necessarily IBS (assuming absence of mutation event).
If IBS then not necessarily IBD (unless a locus is 100% informative, i.e. has an infinite number of alleles, each with infinitesimally small allele frequency).
Probabilistic inference of IBD
1 2 3 4
1 3 1 4
1 2 1 3
1 2 1 3
1 1 2 3
1 3 1 2
1 2 1 2
1 2 1 2
1 0 0.5 1
1 2 1.5 1
0.5 0 0.25 0.5
NIBD
IBD
Rationale ofaffected sib-pair linkage analysis
A pair of sibs affected with the same disorder is expected to share the alleles at the trait locus/loci---and also alleles at linked loci---more often (> 50 %) than a random pair of sibs (50 %).
Basic concept ofaffected sib pair linkage analysis
IBD? IBD?
IBD NIBD
1/2 3/4
1/3 1/4
IBD? IBD?
IBD NIBD
1/2 3/4
1/3 1/4
Affected sib pair linkage analysis(mean test)
IBD? IBD?
IBD NIBD
1/2 3/4
1/3 1/4
? 5.0)Pr(
? )Pr()Pr(
?
>>>
IBDNIBDIBD
nn NIBDIBD
NIBD IBD
counts in example ped.
1 1
total counts in dataset NIBDn IBDn
Conditional on the fact that both sibs are affected, test if:
NIBD IBD
probability
counts in ex. 1 1
total counts
Affected sib pair linkage analysis(mean test)
NIBDn IBDn
IBD? IBD?
IBD NIBD
1/2 3/4
1/3 1/4
φ−1 φ
IBDNIBD
IBDNIBD
nn
nn
L
L
5.0)5.01(
)ˆ1(ln2
)5.0(
)ˆ(ln2
−−
==
ϕϕϕ
ϕ
Penetrance-model based linkage analysis on affected sib pair
1/2 3/4
1/3 1/4
Trait locus genotypes are inferred probabilistically conditional on observed phenotypes according to an assumed inheritance model (number of alleles, allele frequencies and genotypic penetrances).
?/? ?/?
?/? ?/?
Penetrance-model-based linkage analysis on affected sib pair
assuming a rare recessive trait w/o “phenocopies”
1/2 3/4
1/3 1/4
D/+
D/+
D/D D/D
Conditional on the fact that both affected sibs inherited the D allele from each parent, test if:
? 5.0
?)Pr()Pr(
? ))1(2Pr())1(Pr(
? )Pr()Pr(22
<>−
−>−+
>
trecombinantrecombinannon
NIBDIBD
Penetrance-based linkage analysis on affected sib pair
1/2 3/4
1/3 1/4
D/+
D/+
D/D D/D
(assuming a rare, recessive trait w/o
“phenocopies”)
10 5.0)5.01(
ˆ)ˆ1(ln2
)5.0(
)ˆ(ln2
nn
nn IBDNIBD
L
L
−−
=
=
ϕϕϕ
ϕ
and because
22
22
)1(
.)(.)(
ϑϑϕ
−+=
−+= recnonPrecP
)5.0(
)ˆ(ln2
5.0)5.01(
])ˆ1(ˆ[))]ˆ1(ˆ2[ln2
22
==
−−+−
=
ϑ
ϑϑϑϑ
LL
IBDNIBD
IBDNIBD
nn
nn
Relationship of affected sib-pair linkage analysis and penetrance-
model-based linkage analysis
0
0.1
0.2
0.3
0.4
0.5
0.5 0.6 0.7 0.8 0.9 1.0
φ = (2 )P affected sibs share the allele from a parent IBD
= recombination fraction " - " in pseudo marker analysis
For an affected sib-pair of unaffected parents, affected sib-pair linkage analysis and penetrance-model-based linkage analysis assuming a rare recessive trait w/o “phenocopies” are identical.
Penetrance-based linkage analysis on affected sib pair
Assuming a rare, recessive trait w/o “phenocopies”, the father is no longer informative.
Penetrance-based linkage analysis is then no longer equivalent to affected sib pair linkage analysis.
1/2 3/4
1/3 1/4
D/D D/+
D/D D/D
“Pseudo-marker” analog of affected sib pair linkage analysis (mean test)
“pseudo-marker”
genotypes
1/2 3/4
1/3 1/4
D/+
D/+
D/D D/D
1/2 3/4
1/3 1/4
D/+
D/+
D/D D/D
Take home message regarding relationship of penetrance-model-based and “model-free” approaches to gene
mapping:• The perceived differences between penetrance-model based
and many popular “model-free” methods are more related to the underlying study design than the statistical methodology.
• A deterministic “pseudo-marker” genotype assignment algorithm can be used to mimic popular “model-free approaches”, allowing joint analysis of different data structures for linkage and/or LD in a framework identical to penetrance-based analysis.
• These “pseudo-marker” statistics are generally better behaved and more powerful than their conventional “model-free” analogs.
Regression-based methods forlinkage analysis of quantitative traits
The basic rationale behind this approach (in its various forms) is that pairs of individuals (of a given relationship) with similar phenotypes are expected to be more similar to each other genetically at/near loci influencing the trait of interest than pairs of relatives (of the same relationship) who have dissimilar phenotypes. The degree of phenotypic similarity therefore should be reflected in the proportion of alleles that individuals share IBD at/near trait loci.
Haseman-Elston sib pair linkage testfor quantitative traits
2
IBD0 0.5 1
****
**
**
***
squared phenotypic difference
between 2 sibsStatistical inference:
Is the regression slope < 0?
Variance components-basedlinkage analysis
Rationale of variance components-based linkage analysis
The pattern of phenotypic similarity among pedigree members should be reflected by the pattern of IBD sharing among them at chromosomal loci influencing the trait of interest.
Variance components approach:multivariate normal distribution (MVN)
In variance components analysis, the phenotype is generally assumed to follow a multivariate normal distribution:
f x( ) =1
2( )n Ω( )12
exp12
x−μ( )'Ω−1 x−μ( )⎛
⎝⎜⎞
⎠⎟
ln f x( ) =−n2
ln 2( )−12Ω −
12
x−μ( )'Ω−1 x−μ( )
no. of individuals (in a pedigree)
nn covariance matrix
phenotype vector
mean phenotype
vector
Modeling the resemblance among relative
€
Ω =Ισ e2 + 2Φσ g
2
Ω = Ισ e2 + 2Φσ g
2 + ˆ Π σ q2
heritability analysis
linkage analysis
Matrix of estimated allele sharing among relatives
12 33
13 13 13
P M
S1 S2 S3
P M S1 S2 S3
P 1 0 0.5 0.5 0.5
M 1 0.5 0.5 0.5
S1 1 0.5 0.5
S2 1 0.5
S3 1
€
Πexpected = 2Φ
P M S1 S2 S3
P 1 0 0.5 0.5 0.5
M 1 0.5 0.5 0.5
S1 1 0.75 0.75
S2 1 0.75
S3 1
€
ˆ Π estimated
€
lod = log10
L H1 | data( )L H0 | data( )
= log10
maxσ e
2 ,σ g2 ,σ q
2L σ e
2,σ g2,σ q
2,| data( )
maxσ e
2 ,σ g2L σ e
2,σ g2,σ q
2 = 0 | data( )
€
Asymptotically, 2ln 10( )lod ~ 0.5χ 1( )2 .
Variance components-based lod score
100
1,000
10,000
100,000
0 0.1 0.2 0.3 0.4 0.5Heritability due to QTL
Number of Individuals
PedigreeSibship (2)Sibship (4)
Sample size requirements to detect linkage to a QTL with a lod score of ≥ 3
and 80% power
Pros and cons ofvariance-components-based linkage
analysis
+ no need to specify inheritance model+ robust to allelic heterogeneity at a locus+ modeling flexibility+ computationally feasible even on large pedigrees
- generally assumes additive inheritance model- modeling restrictions- not always well-behaved statistically (depending on phenotypic
distribution and ascertainment)- generally less powerful than penetrance-model-based linkage
analysis under suitable model
Choice of covariates
Covariates ought to be included in the likelihood model if they are known to influence the phenotype of interest and if their own genetic regulation does not overlap the genetic regulation of the target phenotype.
Typical examples include sex and age.
In the analysis of height, information on nutrition during childhood should probably be included during analysis. However, known growth hormone levels probably should not be.
Choice of covariates
σ p2 σ p
2
hq
2without cov =
σq2
σ p2 ≈0.15 > 0.05 ≈
σq2 − σq
2 I σ cov2( )
σ p2 −σ cov
2 =hq2withcov
σ q2
σ q2
σ cov2
Choice of covariates
σ p2 σ p
2
σ q2 σ q
2
σ cov2
hq
2without cov =
σq2
σ p2 ≈0.15 < 0.2 ≈
σq2 − σq
2 I σ cov2( )
σ p2 −σ cov
2 =hq2withcov
Choice of covariates:special case of treatment/medication
Before treatment/medicationof affected individuals
phenotype
probability density
unaffected affected
After (partially effective) treatment / medication of affected individuals
phenotype
probability density
unaffected affected
apparent effect of covariate
Choice of covariates:special case of treatment/medication
• If medication is ineffective/partially effective, including treatment as a covariate is worse than ignoring it in the analysis.
• If medication is very effective, such that the phenotypic mean of individuals after treatment is equal to the phenotypic mean of the population as a whole, then including medication as a covariate has no effect.
• If medication is extremely effective, such that the phenotypic mean of individuals after treatment is “better” than the phenotypic mean of the population as a whole, then including medication as a covariate is better than ignoring it, but still far from satisfying.
• Either censor individuals or, better, infer or integrate over their phenotypes before treatment, based on information on efficacy etc.
Two-point vs. multi-point linkage analysis
• In linkage analysis, one always examines whether or not the alleles at 2 loci tend to co-segregate during meiosis.
• In “two-point” linkage analysis, chromosomal inheritance is inferred from the observed trait phenotypes on the one hand (locus 1) and from a single (genotyped) marker locus on the other hand (locus 2).
• In “multi-point” linkage analysis, chromosomal inheritance is inferred from the observed trait phenotypes on the one hand (locus 1) and from multiple (genotyped) marker loci on the other hand (locus 2).
Pros and cons of multi-point linkage analysis
+ Genotypes at multiple markers contain at least as much and generally more information to infer chromosomal inheritance than genotypes at a single marker, resulting in greater power to detect linkage.
+ The number of independent tests in genome-wide linkage analysis is somewhat reduced in multi-point linkage analysis vs. two-point linkage analysis.
- Multi-point linkage analysis requires knowledge of the genetic marker map (marker order and inter-marker recombination fractions). If this information is incorrect, power can be reduced and/or the false positive rate can be increased.
- Multi-point linkage analysis is more susceptible to genotyping errors.- Multi-point linkage analysis typically assumes linkage equilibrium between
markers. If this does not hold, power can be reduced and/or the false positive rate can be increased.
- Multi-point linkage analysis is computationally more demanding than two-point linkage analysis.
Genetic map vs. physical map
m1 m2 m3 m4
1223 34
x1 x2 x3 x4 cM
genetic map
physicalmap
y1 y2 y3 y4 Mb
Genetic map distance vs. recombination fraction
Def. of recombination fraction: probability that recombination takes place between 2 chromosomal positions during meiosis
Recombination fractions are not additive, i.e., for 3 loci and recombination fractions 12 and 23, 13 ≠ 12 + 23.
Def. of genetic map distance (Morgan, M): distance in which 1 recombination event is expected to take place or, equivalently, average distance between recombination events. centi-Morgan (cM) is equal to 1/100 Morgan.
Genetic map distances are additive, i.e. for 3 loci and map distances x12 cM and x23 cM, x13 = x12 + x23 cM.
Neither recombination fractions nore genetic map distances are easily converted into physical map distances.
Why a genome-wide linkage scan may fail
• The sample size is too small.• The marker genotypes are not sufficiently informative (low
heterozygosity and/or large gaps in marker map).• There is no major gene.• The chosen analytical approach is unsuitable.• Bad luck!
A fairytale of 2 traits
Heritability estimates
trait A trait B
45-82% 63-92%
Quantitative trait A (sample 1)
large, randomly ascertained pedigrees
no. of phenotyped individuals: 268
trait heritability estimate: 0.55
Quantitative trait B (sample 1)
large, randomly ascertained pedigrees
no. of phenotyped individuals: 324
trait heritability estimate: 0.88
Quantitative trait A (sample 1)
Quantitative trait A (samples 1--2)
Quantitative trait A (samples 1--3)
Quantitative trait A (samples 1--3 + combined)
Quantitative trait B (sample 1)
Quantitative trait B (samples 1--2)
Quantitative trait B (samples 1--3)
Quantitative trait B (samples 1--4)
Quantitative trait B (samples 1--5)
Quantitative trait B (samples 1--6)
Quantitative trait B (samples 1--7)
Quantitative trait B (samples 1--8)
Quantitative trait B (samples 1--9)
quantitative trait A: lipoprotein A (concentration in serum)
quantitative trait B: height (in adults)
Heritability of adult height(additive heritability, adjusted for sex and age)
study sample sizeheritability estimate
TOPS 2199 0.78
FLS 705 0.83
GAIT 324 0.88
SAFHS 903 0.76
SAFDS 737 0.92
SHFS
AZ 643 0.80
DK 675 0.81
OK 647 0.79
Jiri 616 0.63
total 7449
Polygenic or
oligogenic ?
Height (9 samples)
€
ˆ h q,GAIT2 = 0.29
ˆ h q,TOPS2 = 0.03
ˆ h q,FLS2 = 0
ˆ h q,SAFDS2 = 0.08
ˆ h q,SAFHS2 = 0
ˆ h q,SHFS−AZ2 = 0.05
ˆ h q,SHFS−DK2 = 0.01
ˆ h q,SHFS−OK2 = 0.01
ˆ h q,Jiri2 = 0