generalized linear mixed models claudia von brömssen dept of economics unit of applied statistics...

44
GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Upload: muriel-tate

Post on 14-Dec-2015

230 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

GENERALIZED LINEAR MIXED MODELS

Claudia von Brömssen

Dept of economics

Unit of applied statistics and mathematics

Page 2: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Introduction

Methods used yesterday all depend on the independence of observations. All collected data should be

- true replicates

- not clustered

- not measured several times over a time period

Page 3: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Independence and replicates

1. Does fish length of species A vary with land use?

river 1 (lies in forest): 25, 27, 34, 22, 26

river 2 (lies in agricultural area): 42, 36, 29, 35

river 3 (lies in a mixed area): 34, 27, 32, 41

2. Does fish length of species A vary between rivers?

river 1: 25, 27, 34, 22, 26

river 2: 42, 36, 29, 35

river 3: 34, 27, 32, 41

Why can question 2 be answered with statistical methods but not 1.?

Page 4: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Independence and replicates

2. Does fish length of species A vary between rivers?

river 1:

river 2:

river 3:

Population 1: all fish in river 1

Population 2: all fish in river 2

Population 3: all fish in river 3

Observations: individual fish of species A in each river = independent fish representing the population

Page 5: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Independence and replicates

1. Does fish length of species A vary with land use?

river 1: lies in forest

river 2: lies in agricultural area

river 3: lies in a mixed area

Population 1: fish in rivers in forests

Population 2: fish in rivers in agricultural areas

Population 3: fish in rivers in mixed ares

Observations: 5 observations in river 1 represents the river but not the population. There are no true replicates, but pseudoreplicates.

Page 6: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Independence and replicates

1. Does fish length of species A vary with land use?

forest area: rivers 1a, 1b and 1c

agricultural area: rivers 2a, 2b, 2c and 2d

mixed area: rivers 3a, 3b and 3c

Population 1: fish in rivers in forests

Population 2: fish in rivers in agricultural areas

Population 3: fish in rivers in mixed ares

Rivers 1a, 1b and 1c are replicates and represent population 1.

Page 7: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Experimental units

When conducting experiments experimental units are the smallest unit that can get a individual treatment:

If you have cows in a box each cow can get its own diet -> to compare diets cows are the experimental units, several cows getting the same diet are replicates

If you treat plots in a forest with a special treatment the plots are the experimental units. If instead each leave can get different treatments, the leaves are the experimental units.

Page 8: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Experimental units

Experimental units = independent observations are needed to quantify the variation in the data

How much variation can we expect from completely unconnected individuals/subjects/sites

Page 9: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Dependent data

Often it is easier or of special interest to collect dependent data

Time series/repeated measurements: we are interested how the treatment effects the experimental unit over time

Clustered/hierarchical data: it is easier and gives a better representation to collect several leaves from several trees within the same experimental plot.

Page 10: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Dependent data

If dependent data is ignored in the analysis this can lead to bias in the estimates and an underestimation of variation, leading to low, but false p-values.

If you want to make a study that includes dependent data plan this thoroughly before data collection.

Page 11: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Dependent data

Observe that I am talking about dependencies/ independence of observations.

Dependecies between variables is desirable for multivariate methods.

Dependecies between explanatory variables in general or generalised linear models can be a problem if correlations are very high.

Page 12: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Models for dependent observations - examples

If it is important to follow a treatment over time we could make observations on the same plot several times (several days after the treatment, several month after the treatment,…)

Data for each plot has a time series structure and measurements on the same plot are not independent.

The time series structure is incorporated in the model. We often call these models ’repeated measures models’

Page 13: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Models for dependent observations - examples

To make estimates better we could choose to take measurements several times on the same plot (but at the same time point).

This data structure is called clustered or hierarchical and we can use the data to get some idea of how large the variation within the plot is.

Page 14: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models

Data with such structures are analysed with mixed models where different types of random factors or random effects account for the dependencies in the data.

Mixed models in R can be run in different functions/packages all with some restrictions. We will use the function glmer and glmmPQL.

Page 15: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Examples - Lophodermium

For the Lophodermium data set there were actually 2 forests observed at each site:

sample site forest Latitud veg_period vegetation_zone status 1 Sk1G07 1 1 55.9 205 Nemoral Healthy 2 Th1G07 2 1 56.7 205 Nemoral Healthy 3 Th2G07 2 2 56.5 205 Nemoral Healthy 4 Bo1G07 3 1 58.6 205 Nemoral Healthy 5 Bo2G07 3 2 58.6 205 Nemoral Healthy 6 Asa1G07 4 1 57.2 185 Hemi Healthy 7 Asa2G07 4 2 57.2 185 Hemi Healthy

Page 16: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Examples - Lophodermium

Since we now for most sites have 2 forests observed, the two forests at the same site cannot really be regarded to be independent of each other.

Probably the results from these two forests are similar due to their being close geographically.

We can assume a hierachical structure. In the model this resolves to estimating variance components for the site and the forests within each site.

Page 17: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Fixed and random effects

Where is a factor effect (e.g. healthy/sick) and is a

random effect (e.g. of the forest within each site).

Generelly the factor effects or fixed effects are the one that we are interested to model, whereas the random effects are there to reconstruct the design of the study or experimental design.

Page 18: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Fixed and random effects

If we only look at the random effect, site:

𝜇1=𝜇+𝑎1

𝜇2=𝜇+𝑎2

𝜇3=𝜇+𝑎3

=0

gives different values for each site.

𝜎 𝐴2

Variation in the proportion of X6 between the sites

The different sites are included in the experiment since they represent different conditions.

Page 19: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Fixed and random effects

It is usually not intersting to learn more about the different levels of a random factor.

If we would use site as a fixed factor, we would estimate the level of mean proportion for species 6 for each of the sites. We would make 18 estimates, one for each site except for one.

When we treat site as a random factor, we only estimate one parameter – the variance between the different sites.

Page 20: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Fixed and random effects

The hierarchical structure

site 2site 1 site 3experimental unit

several measurements on the same unit

Page 21: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Fixed and random effects

Since the forests can be affected by the common factor site we do not see them as independent.

Forests within the same site can be more similar than forests from different sites.

We model this effect by including the random factor site in the model.

Page 22: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Fixed and random effects

We also make several measurements on each forest = we measure both sick and healthy needles in each of the forests

and

we observe all forests both 2006 and 2007.

The fixed factors status and needle_cohort are nested within forest. This type of model is in agricultural experiments often called split-plot model.

Page 23: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Fixed and random effects

The factor ’site’ is on the large scale level. It coincides with latitude and to some extend with vegetation_zone. Both forests are observed on the same level of ’site’.

The factors status and year are on the small scale level. They can be oberserved separately for each of the forests.

Page 24: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Loph: Consideration regarding the factor site

In this study design data was collected at different sites. At each site both healthy and diseased needles were collected during both 2006 and 2007. Some measurements are however missing.

The correct model yesterday would also need to include the site variable to adjust for local levels. In our model, however, this part was taken by the latitude variable.

We could choose to replace the latitude with the site variable (which gives less information) or use the site variable as a random factor and keep latitude in the model as well.

Page 25: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Lophodermium

With the type of model we use now we can include the factor ’site’ easily as random variable.

Also forest is included as random variable.

We assume that both sites and forests are randomly selected from all sites and forests available.

Page 26: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Lophodermium

We need now to change to an R packages that can do mixed models. There are several of them, but we start with the glmer function.

I glmer we write the model basically the same as in glm, but we can include random variables by setting them into a paranthesis:

(1|site) for a random site

(1|site/forest) for a random forest within a random site

Page 27: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Lophodermium

Model1 <- glmer(cbind(X6_reads, reads-X6_reads)~ Latitud+status + needle_cohort + (1|site/forest), family=binomial, data=Loph2)

Model3<-glmer(X6_reads~Latitud+status + needle_cohort + (1|site/forest), family=poisson, offset=log_reads, data=Loph2)

Page 28: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Lophodermium

Random effects: Groups Name Variance Std.Dev. forest:site (Intercept) 1.807 1.344 site (Intercept) 0.000 0.000 Number of obs: 69, groups: forest:site, 20; site, 10

Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -28.06088 5.62657 -4.99 6.13e-07 ***Latitud 0.41701 0.09307 4.48 7.44e-06 ***statusHealthy -5.33618 0.13961 -38.22 < 2e-16 ***needle_cohort2007 0.46179 0.03535 13.06 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 29: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Lophodermium

Quasi-binomial and quasipoisson does not work with glmer. Instead we need to include overdispersion with yet another method: Random residual

The idea is to just estimate a separate variance for the residuals of the model and adjust p-value for that.

Model1b<-glmer(cbind(X6_reads, reads-X6_reads)~Latitud+status + needle_cohort + (1|site/forest)+ (1|sample), family=binomial, data=Loph2)

Model1a<-glmer(cbind(X6_reads, reads-X6_reads)~Latitud+status + needle_cohort + (1|site/forest/sample), family=binomial, data=Loph2)

Page 30: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Lophodermium

Random effects: Groups Name Variance Std.Dev. sample:(forest:site) (Intercept) 1.996e+00 1.4127259 forest:site (Intercept) 8.703e-01 0.9329018 site (Intercept) 1.508e-08 0.0001228Number of obs: 69, groups: sample:(forest:site), 69; forest:site, 20; site, 10

Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) -28.61700 5.62712 -5.086 3.67e-07 ***Latitud 0.42339 0.09242 4.581 4.63e-06 ***statusHealthy -6.34987 0.54205 -11.715 < 2e-16 ***needle_cohort2007 0.40321 0.42026 0.959 0.337 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 31: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Lophodermium

Manyglm and edgeR do not seem to have any possibility to account for hierachical or other mixed structures.

Page 32: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Funghi

In the second example of yesterdays computer lab we analysed funghi data at a number of sites.

The sites in this example are the replicates made in the experiment, the experimental units

At each site we observe a specific combination of tree type, CO2 (yes/no) and Warmed (yes/no). At each experimental unit we make 3 observations (=the three different horizons).

Page 33: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Funghi

The structure is similar to the Lophodermium example, where we also had several measurements at each site, but in this case these measurements also have meaning – they represent different soil layers.

This means that measurements have a meaning and a specific order.

Page 34: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Funghi

In such cases we usually assume that there is a correlation between measurements at the same site.

If a measurement is made at a site with high probability of species 3 it will be so at all levels.

Part of this correlation between horizons is described by the model – we include horizon as factor.

There can, however, still be correlations in the residuals of the model = data is not independent.

Page 35: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Funghi

correlated also correlated, but less

We can assume that the observations made at the same site are correlated with each other. Observations made close to each other are more correlated than observations longer apart.

Page 36: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Funghi

Correlation between layers can be estimated and the standard errors and p-values are adjusted accordingly.

The correlations are estimated on the residuals, i.e. after the model is fitted, to see if there is any remaining dependence between the layers.

Page 37: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Funghi

Since horizon actually is a rather important factor in the model we should also consider interactions between the other factors and horizon.

The effect of tree type could be different at different soil horizons. For X3 however we will not be able to estimate this interaction.

Page 38: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models for Funghi

Page 39: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Generalised linear models - overview

We use logistic regression or Poisson regression as base models.

For DNA sequencing data or similar data specific procedures often use the negative binomial distribution, since overdispersion is almost always observed.

Page 40: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Generalised linear models - overview

For these types of models you need to have the data observed as counts.

If your response variable is a propoportion and cannot be traced back to counts, you use general linear models with a normal distribution for the error term. Sometimes this will demand transformation for the observed data before the model can be fitted. (Look at residual plots to check it residuals are normally distributed and have equal variances)

If normality does not hold use nonparametric metods.

Page 41: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Overdispersion - overview

There are several ways to handle overdispersion in data

- to use quasidistributions (this does often not work in mixed settings)

- to use the negative binomial distribution (not availabe in all packages, e.g. not in glm)

- use a random residual (demand the use of mixed models even if the model itself is not mixed)

Page 42: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Overdispersion - overview

Always control that the design is well represented in the model.

Leaving out design variables (factors that are used to define the data collection) will almost always lead to overdispersion.

Page 43: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models - overview

If your data is collected according to a specific experimental plan or study design you need to account for this structure in the analysis.

If you do not do this it will leave you with faulty variation estimates = wrong pvalues (usually to low pvalues).

Leaving out the study design variables can also lead to overdispersion.

Page 44: GENERALIZED LINEAR MIXED MODELS Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Mixed models - overview

Typical mixed models are

- repeated measures models, where an experimental unit is observed several times (in time or space)

- hierarchical models, where several observations are made within the experimental unit (but with no specific order)