data analysis module code: ca660 lecture block 7
Post on 21-Dec-2015
226 views
TRANSCRIPT
DATA ANALYSIS
Module Code: CA660
Lecture Block 7
2
Examples in Genomics and Trait Models
• Genetic traits may be controlled by No.genes-usually unknown
Taking “genetic effect” as one genotypic term, a simple model for
where yij is the trait value for genotype i in replication j, is the mean, Gi the genetic effect for genotype i and ij the errors.
• If assume Normality (and want Random effects) + assume
zero covariance between genetic effects and error
Note: If same genotype replicated b times in an experiment, with phenotypic means used, error variance averaged over b.
ijiij Gy
222egp
),0(~),,0(~),,(~ 222egp NIDNIDGNIDy
3
Example - Trait Models contd.• What about Environment and GE interactions? Extension to
Simple Model.
ANOVA Table: Randomized Blocks within environment and within sets/blocks in environment = b = replications. Focus - on genotype effect
Source dof Expected MSQ
Environment e-1 know there are differences
Blocks (b-1)e again – know there are differences
Genotypes g-1
GE (g-1)(e-1)
Error (b-1)(g-1)eNote: individuals blocked within each of multiple environments, so environmental
effect intrinsic to error. Model form is standard, but only meaningful comparisons are within environment, hence form of random error = population variance = ; so random effects of interest from additional variances & ratios
222ggee beb
22gee b
2e
ijkijjiijk GEEGy )(
2e
Genotypic effects measured within blocks
4
Example contd.
• HERITABILITY = Ratio genotypic to phenotypic variance
• Depending on relationship among genotypes, interpretation of genotypic variance differs. May contain additive, dominance, other interactions, variances
(Above = heritability in broad terms).
• For some experimental or mating schemes, an additive genetic variance may be calculated. Narrow/specific sense heritability then
• Again, if phenotypic means used, obtain a mean-based heritability for b replications.
22
2
2
2
eg
g
p
gH
2222
2
2
2
elda
a
p
aH
bH
eg
g
p
g
/22
2
2
2
222 ,, lda
5
Extended Example- Two related traits
Have
where 1 and 2 denote traits, i the gene and j an individual in population. Then ‘y’ is the trait value, overall mean, G genetic effect, = random error.
To quantify relationship between the two traits, the variance- covariance matrices for phenotypic, p genetic g and environmental effects e
So correlations between traits in terms of phenotypic, genetic and environmental effects:
jij
jiij
Gy
Gy
2222
111
2221
1221
22
1221
2221
1221
21 ee
ee
gg
gg
eg
pp
pp
p
22
21
12
22
21
12
22
21
12 ;;ee
ee
gg
gg
pp
pp
6
MAXIMUM LIKELIHOOD ESTIMATION
• Recall general points: Estimation, definition of Likelihood function for a vector of parameters and set of values x.
Find most likely value of = maximise the Likelihood fn.
Also defined Log-likelihood (Support fn. S() ) and its derivative, the Score, together with Information content per observation, which for single parameter likelihood is given by
• Why MLE? (Need to know underlying distribution).
Properties: Consistency; sufficiency; asymptotic efficiency (linked to variance); unique maximum; invariance and, hence most convenient parameterisation; usually MVUE; amenable to conventional optimisation methods.
)(log2
2)(log)(2
xLExLEI
7
VARIANCE, BIAS & CONFIDENCE
• Variance of an Estimator - usual form or
for k independent estimates• For a large sample, variance of MLE can be approximated by
can also estimate empirically, using re-sampling* techniques.
• Variance of a linear function (of several estimates) – (common need in genomics analysis), e.g. heritability.
• Recall Bias of the Estimator
then the Mean Square Error is defined to be:
expands to
so we have the basis for C.I. and tests of hypothesis.
)ˆ(E2)ˆ( EMSE
2
11
22 ˆ1ˆˆ
k
i
i
k
i
i k
)(
1ˆ 2
nI
22ˆ
2 ])ˆ([]})ˆ([)]ˆ(ˆ{[ EEEE
8
COMMONLY-USED METHODS of obtaining MLE
• Analytical - solving or when simple solutions exist
• Grid search or likelihood profile approach
• Newton-Raphson iteration methods
• EM (expectation and maximisation) algorithm
N.B. Log.-likelihood, because max. same value as Likelihood
Easier to compute
Close relationship between statistical properties of MLE
and Log-likelihood
0ddL 0d
dS
9
METHODS in brief
Analytical : - recall Binomial example earlier
• Example : For Normal, MLE’s of mean and variance, (taking derivatives w.r.t mean and variance separately), and equivalent to sample mean and actual variance (i.e. /N), -unbiased if mean known, biased if not.
• Invariance : One-to-one relationships preserved
• Used: when MLE has a simple solution
0)(
xnx
d
dSScore
n
x
10
Methods for MLE’s contd.
Grid Search – Computational
Plot likelihood or log-likelihood vs parameter. Various features
• Relative Likelihood =Likelihood/Max. Likelihood (ML set =1).
Peak of R.L. can be visually identified /sought algorithmically. e.g.
Plot likelihood and parameter space range - gives 2 peaks, symmetrical around likelihood profile for the well-known mixed linkage phase problem in linkage analysis.
If e.g. constrain MLE = R.F. between genes (possible mixed linkage phase).
])1()1([)( 20808020 LogS
10
5.0ˆ5.00
2.0ˆ
11
contd.
• Graphic/numerical Implementation - initial estimate of , direction of search determined by evaluating likelihood at both sides of .
Search takes direction giving increase. Initial search increments large, e.g. 0.1, then when likelihood change starts to decrease or become negative, stop and refine increment.
• Multiple peaks – can miss global maximum, computationally intensive
• Multiple Parameters - grid search. Interpretation of Likelihood profiles can be difficult.
12
Example
• Recall Exs 2, Q. 8.
Data used to show a linkage relationship between marker and a “rust-resistant”gene.
Escapes = individuals who are susceptible, but show no disease (rust) phenotype under experimental conditions. So define as proportion escapes and R.F. respectively.
is penetrance for disease trait, i.e. P{ that individual with susceptible genotype has disease phenotype}.
Purpose of expt.-typically to estimate R.F. between marker and gene.
• Use: Support function = Log-Likelihood
,
1
)1log(163)log(52)log(3)1log(168),( S
13
Example contd.
• Setting 1st derivatives (Scores) w.r.t = 0. Expected value of Score (w.r.t. is zero, (see analogies in classical sampling/hypothesis testing). Similarly for . Here, however, No simple analytical solution, so can not solve directly for either.
• Using grid search, likelihood reaches maximum at • In general, this type of experiment tests H0: Independence between
marker and gene and H0: no escapes Uses Likelihood Ratio Test statistics. (MLE 2 equivalent)
• N.B: Moment estimates solve slightly different problem, because no info. on expected frequencies, - (not same as MLE)
,
22.0ˆ,02.0ˆ
)5.0( )0(
14
MLE Estimation Methods contd.Newton-Raphson Iteration
Have Score () = 0 from previously. N-R consists of replacing Score by linear terms of its Taylor expansion, so if ´´ a solution, ´=1st guess
Repeat with ´´ replacing ´ Each iteration - fits a parabola to
Likelihood Fn.
• Problems - Multiple peaks, zero Information, extreme estimates • Multiple parameters – need matrix notation, where S matrix e.g. has
elements = derivatives of S(, ) w.r.t. and respectively. Similarly, Information matrix has terms of form
Estimates are
0)]([
)()()(
2
2
d
Sd
d
dS
d
dS
22 )(
)]([
dSd
dSd
.),(),(2
2
2
etcSESE
)()(1 1
SIN
L.F.
2nd
1st
Variance of Log-L i.e.S()
15
Methods contd.
Expectation-Maximisation Algorithm - Iterative. Incomplete data
(Much genomic data fits this situation e.g. linkage analysis with marker genotypes of F2 progeny. Usually 9 categories observed for 2-locus, 2-allele model, but 16 = complete info., while 14 give info. on linkage. Some hidden, but if linkage parameter known, expected frequencies can be predicted – as you know - and the complete data restored using expectation).
• Steps: (1) Expectation estimates statistics of complete data, given observed incomplete data.
• -(2) Maximisation uses estimated complete data to give MLE. • Iterate till converges (no further change)
16
E-M contd.
Implementation
• Initial guess, ´, chosen (e.g. =0.25 say = R.F.). • Taking this as “true”, complete data is estimated, by distributional
statements e.g. P(individual is recombinant, given observed genotype) for R.F. estimation.
• MLE estimate ´´ computed. • This, for R.F. sum of recombinants/N.
• Thus MLE, for fi observed count,
Convergence ´´ = ´ or
)(1
GRPfN ii
)00001.0(tolerance
17
LIKELIHOOD : C.I. and H.T.• Likelihood Ratio Test – c.f. with 2.
• Principal Advantage of G is Power, as unknown parameters involved in hypothesis test.
Have : Likelihood of taking a value A which maximises
it, i.e. its MLE and likelihood under H0 : N , (e.g. N = 0.5)
• Form of L.R. Test Statistic
or, conventionally
- choose; easier to interpret.• Distribution of G ~ approx. 2 (d.o.f. = difference in dimension of
parameter spaces for L(A), L(N) )
• Goodness of Fit : notation as for 2 , G ~ 2n-1 :
• Independence: notation again as for 2
)(
)(2
xL
xLLogG
N
A
)(
)(2
xL
xLLogG
A
N
i
i
n
i
i E
OLogOG
1
2
ij
ij
r
i
c
j
ij E
OLogOG
1 1
2
18
Power-Example extended• Under H0 :
• At level of significance =0.05, suppose true = 1 = 0.2, so if n=25
(e.g. in genomics might apply where R.F. =0.2 between two genes (as opposed to 0.5). Natural logs. used, though either possible in practice. Hence, generic form “Log” rather than Ln here. Assume Ln throughout for genetic/genomic examples unless otherwise indicated)
• Rejection region at 0.05 level is
• If sketch curves, P{LRTS falls in the acceptance region} = 0.13,
= Prob.of a false negative when actual value of = 0.2
• If sample size increased, e.g. n=50, E{G} = 19 and easy to show that P{False negative} = 0.01
• Generally: Power for these tests given by
0}5.05.05.05.05.0{2}{ LogLogLognGE
6.9}5.08.08.02.02.0{50}{ LogLogLogGE
84.321
}{ 22}{,
unitGnEdfP
19
Likelihood C. I.’s - method • Example: Consider the following Likelihood function is the unknown parameter ; a, b observed counts• For 4 data sets observed, A: (a,b) = (8,2), B: (a,b)=(16,4) C: (a,b)=(80, 20) D: (a,b) = (400, 100)
• Likelihood estimates can be plotted vs possible parameter values, with MLE = peak value.
e.g. MLE = 0.2, Lmax=0.0067 for A, and Lmax=0.0045 for B etc.
Set A: Log Lmax- Log L=Log(0.0067) - Log(0.00091)= 2 gives 95% C.I. so =(0.035,0.496) corresponding to L=0.00091, 95% C.I. for A.
Similarly, manipulating this expression, Likelihood value corresponding to 95% confidence interval given as L = 7.389Lmax
Note: Usually plot Log-likelihood vs parameter, rather than Likelihood. As sample size increases, C.I. narrower and symmetric
baL )1()(
20
Multiple Populations: Extensions to G -Example• Recall Mendel’s data - earlier and Extensions to 2 for same In brief Round Wrinkled Plant O E O E G dof p-value 1 45 42.75 12 14.25 0.49 1 0.49 2 0.09 1 0.77 3 0.10 1 0.75
4 1.30 1 0.26 5 0.01 1 0.93 6 0.71 1 0.40 7 0.79 1 0.38 8 0.63 1 0.43 9 1.06 1 0.30 10 0.17 1 0.68 Total 336 101 5.34 10 Pooled 336 327.75 101 109.25 0.85 1 0.36Heterogeneity 4.50 9 0.88
21
Multiple Populations - summary
• Parallels
• Partitions therefore
and Gheterogeneity = Gtotal - GPooled (n=no. classes, p = no.populations)
Example: Recall Backcross (AaBb x aabb)- Goodness of fit (2- locus model). For each of 4 crosses, a Total GoF statistic can be calculated according to
expected segregation ratio 1:1:1:1 – (assumes no segregation distortion for both loci and no linkage between loci).
For each locus GoF calculated using marginal counts, assuming each genotype segregates 1:1.
Difference between Total and 2 individual locus GoF statistics is L-LRTS (or chi-squared statistic) contributed by association/linkage between 2 loci.
2
p
i
n
j ij
ijiTotal E
OOG
1 1
log2
n
j
p
i
p
i
ij
p
i
ij
ijPooled
E
O
OG1 1
1
1log2
22
Example: Marker Screening
Screening for Polymorphism - (different detectable alleles) – look at stages involved.
Genomic map –based on genome variation at locations (from molecular assay or traditional trait observations).
(1) Screening polymorphic genetic markers is Exptal step 1 - usually assay a large number of possible genetic markers in
small progeny set = random sample of mapping population.
If a marker does not show polymorphism for set of progeny, then marker non-informative ; will not be used for data analysis).
23
Example contd.
(2) Progeny size for screening – based on power, convenience etc.,
e.g. False positive = monomorphic marker determined to be polymorphic. Rare since m-m cannot produce segregating genotypes if these determined accurately.
False negatives high particularly for small sample. e.g. for markers segregating 1:1 – (i)Backcross, recombinant inbred lines, doubled haploid lines, or (ii)F2 with codominant markers,
So, e.g. (i) P{sampling all individuals with same genotype) = 2(0.5)n
(ii) P{false negative for single marker, n=5} = 2(0.25)5+0.55=0.0332 Hence Power curves as before.
24
Example contd. S.R 1:1 vs 3:1- use LRTS
• Detection of departure from S.R. of 1:1
n = sample size, O1, O2 observed counts of 2 genotypic classes.
• For true S.R. 3:1, O1 genotypic frequency of dominant genotype, T.S. parametric value is approx.
n
OO
n
OOG
5.0log
5.0log2 2
21
1
)]5.0([2 2211 nnLogLogOOLogOO
n
OELogOE
n
OELogOEGE 5.0
)()(
5.0
)()(2 2
21
1
n
nnLog
n
nnLog
5.0
25.025.0
5.0
75.075.02
n2616.0
25
Example contd.
• To reject a S.R. of 1:1 at 0.05 significance level, a LogLRTS of at least 3.84 (critical value for rejection) is required.
• Statistical Power
• For n=15 then, power is
• For a power of 90%, n 40 needed
• If problem expressed other way. i.e. calculating Expected LRTS (for rejecting a 3:1 S.R. when true value is S.R. 1:1), this is 0.2877n and n 35 needed.
}84.3{ 21, EGP
}84.3{ 21,924.3 P
26
Maximum Likelihood Benefits
• Good Confidence Intervals Coverage probability realised and interval biologically
meaningful • MLE Good estimator of a CI MSE consistent Absence of Bias - does not “stand-alone” – minimum variance important
Asymptotically Normal Precise – large sample Biological inference valid Biological range realistic
0)ˆ( 2 ELimn
)ˆ(E
nasN )1,0(~ˆ