generalized linear mixed models claudia von brömssen dept of economics unit of applied statistics...

GENERALIZED LINEAR MIXED MODELS

Claudia von Brömssen

Dept of economics

Unit of applied statistics and mathematics

Introduction

Methods used yesterday all depend on the independence of observations. All collected data should be

- true replicates

- not clustered

- not measured several times over a time period

Independence and replicates

1. Does fish length of species A vary with land use?

river 1 (lies in forest): 25, 27, 34, 22, 26

river 2 (lies in agricultural area): 42, 36, 29, 35

river 3 (lies in a mixed area): 34, 27, 32, 41

2. Does fish length of species A vary between rivers?

river 1: 25, 27, 34, 22, 26

river 2: 42, 36, 29, 35

river 3: 34, 27, 32, 41

Why can question 2 be answered with statistical methods but not 1.?


2. Does fish length of species A vary between rivers?

river 1:

river 2:

river 3:

Population 1: all fish in river 1



Observations: individual fish of species A in each river = independent fish representing the population



river 1: lies in forest

river 2: lies in agricultural area

river 3: lies in a mixed area

Population 1: fish in rivers in forests

Population 2: fish in rivers in agricultural areas

Population 3: fish in rivers in mixed ares

Observations: 5 observations in river 1 represents the river but not the population. There are no true replicates, but pseudoreplicates.



forest area: rivers 1a, 1b and 1c

agricultural area: rivers 2a, 2b, 2c and 2d

mixed area: rivers 3a, 3b and 3c

Population 1: fish in rivers in forests

Population 2: fish in rivers in agricultural areas

Population 3: fish in rivers in mixed ares

Rivers 1a, 1b and 1c are replicates and represent population 1.

Experimental units

When conducting experiments experimental units are the smallest unit that can get a individual treatment:

If you have cows in a box each cow can get its own diet -> to compare diets cows are the experimental units, several cows getting the same diet are replicates

If you treat plots in a forest with a special treatment the plots are the experimental units. If instead each leave can get different treatments, the leaves are the experimental units.

Experimental units

Experimental units = independent observations are needed to quantify the variation in the data

How much variation can we expect from completely unconnected individuals/subjects/sites

Dependent data

Often it is easier or of special interest to collect dependent data

Time series/repeated measurements: we are interested how the treatment effects the experimental unit over time

Clustered/hierarchical data: it is easier and gives a better representation to collect several leaves from several trees within the same experimental plot.

Dependent data

If dependent data is ignored in the analysis this can lead to bias in the estimates and an underestimation of variation, leading to low, but false p-values.

If you want to make a study that includes dependent data plan this thoroughly before data collection.

Dependent data

Observe that I am talking about dependencies/ independence of observations.

Dependecies between variables is desirable for multivariate methods.

Dependecies between explanatory variables in general or generalised linear models can be a problem if correlations are very high.

Models for dependent observations - examples

If it is important to follow a treatment over time we could make observations on the same plot several times (several days after the treatment, several month after the treatment,…)

Data for each plot has a time series structure and measurements on the same plot are not independent.

The time series structure is incorporated in the model. We often call these models ’repeated measures models’

Models for dependent observations - examples

To make estimates better we could choose to take measurements several times on the same plot (but at the same time point).

This data structure is called clustered or hierarchical and we can use the data to get some idea of how large the variation within the plot is.

Mixed models

Data with such structures are analysed with mixed models where different types of random factors or random effects account for the dependencies in the data.

Mixed models in R can be run in different functions/packages all with some restrictions. We will use the function glmer and glmmPQL.

Examples - Lophodermium

For the Lophodermium data set there were actually 2 forests observed at each site:

sample site forest Latitud veg_period vegetation_zone status 1 Sk1G07 1 1 55.9 205 Nemoral Healthy 2 Th1G07 2 1 56.7 205 Nemoral Healthy 3 Th2G07 2 2 56.5 205 Nemoral Healthy 4 Bo1G07 3 1 58.6 205 Nemoral Healthy 5 Bo2G07 3 2 58.6 205 Nemoral Healthy 6 Asa1G07 4 1 57.2 185 Hemi Healthy 7 Asa2G07 4 2 57.2 185 Hemi Healthy

Examples - Lophodermium

Since we now for most sites have 2 forests observed, the two forests at the same site cannot really be regarded to be independent of each other.

Probably the results from these two forests are similar due to their being close geographically.

We can assume a hierachical structure. In the model this resolves to estimating variance components for the site and the forests within each site.

Fixed and random effects

Where is a factor effect (e.g. healthy/sick) and is a

random effect (e.g. of the forest within each site).

Generelly the factor effects or fixed effects are the one that we are interested to model, whereas the random effects are there to reconstruct the design of the study or experimental design.


If we only look at the random effect, site:

𝜇1=𝜇+𝑎1

𝜇2=𝜇+𝑎2

𝜇3=𝜇+𝑎3

=0

gives different values for each site.

𝜎 𝐴2

Variation in the proportion of X6 between the sites

The different sites are included in the experiment since they represent different conditions.


It is usually not intersting to learn more about the different levels of a random factor.

If we would use site as a fixed factor, we would estimate the level of mean proportion for species 6 for each of the sites. We would make 18 estimates, one for each site except for one.

When we treat site as a random factor, we only estimate one parameter – the variance between the different sites.


The hierarchical structure

site 2site 1 site 3experimental unit

several measurements on the same unit


Since the forests can be affected by the common factor site we do not see them as independent.

Forests within the same site can be more similar than forests from different sites.

We model this effect by including the random factor site in the model.


We also make several measurements on each forest = we measure both sick and healthy needles in each of the forests

and

we observe all forests both 2006 and 2007.

The fixed factors status and needle_cohort are nested within forest. This type of model is in agricultural experiments often called split-plot model.


The factor ’site’ is on the large scale level. It coincides with latitude and to some extend with vegetation_zone. Both forests are observed on the same level of ’site’.

The factors status and year are on the small scale level. They can be oberserved separately for each of the forests.

Loph: Consideration regarding the factor site

In this study design data was collected at different sites. At each site both healthy and diseased needles were collected during both 2006 and 2007. Some measurements are however missing.

The correct model yesterday would also need to include the site variable to adjust for local levels. In our model, however, this part was taken by the latitude variable.

We could choose to replace the latitude with the site variable (which gives less information) or use the site variable as a random factor and keep latitude in the model as well.

Mixed models for Lophodermium

With the type of model we use now we can include the factor ’site’ easily as random variable.

Also forest is included as random variable.

We assume that both sites and forests are randomly selected from all sites and forests available.


We need now to change to an R packages that can do mixed models. There are several of them, but we start with the glmer function.

I glmer we write the model basically the same as in glm, but we can include random variables by setting them into a paranthesis:

(1|site) for a random site

(1|site/forest) for a random forest within a random site


Model1 <- glmer(cbind(X6_reads, reads-X6_reads)~ Latitud+status + needle_cohort + (1|site/forest), family=binomial, data=Loph2)

Model3<-glmer(X6_reads~Latitud+status + needle_cohort + (1|site/forest), family=poisson, offset=log_reads, data=Loph2)


Random effects: Groups Name Variance Std.Dev. forest:site (Intercept) 1.807 1.344 site (Intercept) 0.000 0.000 Number of obs: 69, groups: forest:site, 20; site, 10

Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -28.06088 5.62657 -4.99 6.13e-07 ***Latitud 0.41701 0.09307 4.48 7.44e-06 ***statusHealthy -5.33618 0.13961 -38.22 < 2e-16 ***needle_cohort2007 0.46179 0.03535 13.06 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Quasi-binomial and quasipoisson does not work with glmer. Instead we need to include overdispersion with yet another method: Random residual

The idea is to just estimate a separate variance for the residuals of the model and adjust p-value for that.

Model1b<-glmer(cbind(X6_reads, reads-X6_reads)~Latitud+status + needle_cohort + (1|site/forest)+ (1|sample), family=binomial, data=Loph2)

Model1a<-glmer(cbind(X6_reads, reads-X6_reads)~Latitud+status + needle_cohort + (1|site/forest/sample), family=binomial, data=Loph2)


Random effects: Groups Name Variance Std.Dev. sample:(forest:site) (Intercept) 1.996e+00 1.4127259 forest:site (Intercept) 8.703e-01 0.9329018 site (Intercept) 1.508e-08 0.0001228Number of obs: 69, groups: sample:(forest:site), 69; forest:site, 20; site, 10

Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -28.61700 5.62712 -5.086 3.67e-07 ***Latitud 0.42339 0.09242 4.581 4.63e-06 ***statusHealthy -6.34987 0.54205 -11.715 < 2e-16 ***needle_cohort2007 0.40321 0.42026 0.959 0.337 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Manyglm and edgeR do not seem to have any possibility to account for hierachical or other mixed structures.

Mixed models for Funghi

In the second example of yesterdays computer lab we analysed funghi data at a number of sites.

The sites in this example are the replicates made in the experiment, the experimental units

At each site we observe a specific combination of tree type, CO2 (yes/no) and Warmed (yes/no). At each experimental unit we make 3 observations (=the three different horizons).


The structure is similar to the Lophodermium example, where we also had several measurements at each site, but in this case these measurements also have meaning – they represent different soil layers.

This means that measurements have a meaning and a specific order.


In such cases we usually assume that there is a correlation between measurements at the same site.

If a measurement is made at a site with high probability of species 3 it will be so at all levels.

Part of this correlation between horizons is described by the model – we include horizon as factor.

There can, however, still be correlations in the residuals of the model = data is not independent.


correlated also correlated, but less

We can assume that the observations made at the same site are correlated with each other. Observations made close to each other are more correlated than observations longer apart.


Correlation between layers can be estimated and the standard errors and p-values are adjusted accordingly.

The correlations are estimated on the residuals, i.e. after the model is fitted, to see if there is any remaining dependence between the layers.


Since horizon actually is a rather important factor in the model we should also consider interactions between the other factors and horizon.

The effect of tree type could be different at different soil horizons. For X3 however we will not be able to estimate this interaction.

Generalised linear models - overview

We use logistic regression or Poisson regression as base models.

For DNA sequencing data or similar data specific procedures often use the negative binomial distribution, since overdispersion is almost always observed.

Generalised linear models - overview

For these types of models you need to have the data observed as counts.

If your response variable is a propoportion and cannot be traced back to counts, you use general linear models with a normal distribution for the error term. Sometimes this will demand transformation for the observed data before the model can be fitted. (Look at residual plots to check it residuals are normally distributed and have equal variances)

If normality does not hold use nonparametric metods.

Overdispersion - overview

There are several ways to handle overdispersion in data

- to use quasidistributions (this does often not work in mixed settings)

- to use the negative binomial distribution (not availabe in all packages, e.g. not in glm)

- use a random residual (demand the use of mixed models even if the model itself is not mixed)

Overdispersion - overview

Always control that the design is well represented in the model.

Leaving out design variables (factors that are used to define the data collection) will almost always lead to overdispersion.

Mixed models - overview

If your data is collected according to a specific experimental plan or study design you need to account for this structure in the analysis.

If you do not do this it will leave you with faulty variation estimates = wrong pvalues (usually to low pvalues).

Leaving out the study design variables can also lead to overdispersion.

Mixed models - overview

Typical mixed models are

- repeated measures models, where an experimental unit is observed several times (in time or space)

- hierarchical models, where several observations are made within the experimental unit (but with no specific order)

generalized linear mixed models claudia von brömssen dept of economics unit of applied statistics...

Documents

forest river

agricultural area river

independent fish

fish length of species

mixed area population

individual fish of species

c population

forests population