**Quebec Centre for Biodiversity Science**

R Workshop Series

R Workshop Series

**Workshop 6: Linear mixed effects models**

**Website**

:

http://qcbs.ca/wiki/r/workshop6

:

http://qcbs.ca/wiki/r/workshop6

**Objectives**

1. What is a linear mixed model (LMM) and why should I care?

2. How do I implement, check assumptions, compare, and present the results of my LMM's in R?

1. What is a LMM and why should I care?

Ecological and biological data can be messy!

Inherent structure to data

Many covariates

Low sample sizes

1. What is a LMM and why should I care?

Introduction to the data set

1. What is a LMM and why should I care?

Q: Does fish trophic position increase with fish size?

1. What is a LMM and why should I care?

Throughout the workshop there will be a series of challenges denoted by a rubix cube.

During challenges collaborate with your neighbors!

1. What is a LMM and why should I care?

Introduction to the data set

Open the workshop script in R studio

Make plots 1-3, observe the data and try to get a sense of what is occurring.

1. What is a LMM and why should I care?

Solution

1. What is a LMM and why should I care?

Solution

1. What is a LMM and why should I care?

Solution

1. What is a LMM and why should I care?

Group discussion

Do we expect these relationships to be the same across lakes? What might differ?

For our question we just want to know if there is an overall effect of length on trophic position.

This effect might vary slightly by species due to things like differences in growth rates or by lake because of differences in food availability, but we don't care about these unmeasured factors we just want to control for them.

1. What is a LMM and why should I care?

1. Separate:

Estimate 6 intercepts and slopes for each species (i.e. 6 lakes)

Sample size (n) = 10 for each analysis (i.e. 10 fish/species/lake)

Low chance of finding an effect due to low n.

1. What is a LMM and why should I care?

2. Lump:

Huge sample size!!

What about pseudo replication? (i.e. fish within a lake and within a species will be correlated)

Look at all that noise in the data! I bet some of it is due to differences among species and lakes!

1. What is a LMM and why should I care?

How might we analyze this data? You might:

1. Separate:

Run separate analyses for each species in each lake.

2. Lump:

Run one analysis ignoring lake and species.

...

LMM's are a balance between separating and lumping. They:

1. Estimate slope and intercept parameters for each species and lake (separating).

2. Use all the data available (lumping) while accounting for pseudoreplication and controlling for differences among lakes and species.

1. What is a LMM and why should I care?

But how?

A. Intercepts and/or slopes are allowed to vary by lake and species.

B. Intercepts, slopes, and associated confidence intervals are adjusted to account for the structure of data.

1. What is a LMM and why should I care?

A. Allow intercepts and or slopes to vary for a given factor.

Only need to estimate mean and standard deviation of ND instead of *3 intercepts

Assume intercepts come from a normal distribution (ND)

This is done including species from all lakes

A. Allow intercepts and or slopes to vary for a given factor.

Same thing for lake

Estimate 2 parameters (mean and SD) instead of 6 intercepts - saves degrees of freedom

This is done including individuals of all species

A. Allow intercepts and or slopes to vary for a given factor.

Same concept for slopes just harder to visualize

Again, only estimate mean and SD instead of 3 slopes

B. Adjust intercepts and slopes to account for the structure of data.

B. Adjust confidence intervals of intercepts and slopes to account for pseudoreplication.

Based on:

Intraclass correlation coefficient (ICC)

- How much variation is there among groups versus within groups?

Will treat these more like one observation

Will treat individual observations more like they are independent

smaller effective sample size larger confidence intervals

larger effective sample size smaller confidence intervals

**Challenge**

The mixed model protocol in R

Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

Step 3: Model validation

Step 4: Interpreting results

and visualizing the model

The mixed model protocol in R

Step 1: A priori model building and data exploration

Always make sure you've done "Housekeeping" on your data before you start building models!

The mixed model protocol in R

Step 4: Interpreting results

>summary(M8)

<--Variation in Lake intercepts

<--Variation in Species intercepts

<--Variation in Species slopes

<-- Residual variation

The mixed model protocol in R

--> Data exploration

Is the data in the right structure?

>str(data)

The mixed model protocol in R

--> Data exploration

Look at the distribution of samples across factors:

>table(data$Lake)

>table(data$Species)

*This data set is perfectly balanced, but mixed models can deal with unbalanced designs (like we so often have in Ecology!).

The mixed model protocol in R

Step 1: A priori model building and data exploration

We are interested in finding out if trophic position can be predicted by length, while accounting for variation between species and lakes

What we know:

So we now want a model that looks

something like this:

Trophic Positionijk ~ Fish Lengthi + Lakej + Speciesk

Look at the distribution of continuous variables

>hist(data$Fish_Length

>hist(data$Trophic_Pos)

Major skews can lead to problems with model homogeneity down the road. So if necessary, make transformations. In this case, the data seems OK.

The mixed model protocol in R

--> Data exploration

Check for collinearity between variables

>plot(data)

>cor()

No risk of collinearity with only 1 continuous explanatory variable

It is important not to include 2 collinear variables in the same model, otherwise their effects on the response variable will be confounded.

The mixed model protocol in R

--> Data exploration

Consider the scale of your data

If two variables in the same model have different ranges of scale, the criteria the mixed models use to come up with parameter estimates are likely to return 'convergence errors'.

Z-correction standardizes your variables and can adjust for this scaling problem

z = (x - mean(x))/sd(x)

The mixed model protocol in R

--> Data exploration

Consider the scale of your data

Length:

Long scale

Trophic position:

Short scale

The mixed model protocol in R

--> Data exploration

Consider the scale of your data

Because the data we are dealing with has such variable scales, we z-correct it.

> data$Z_Length<-(data$Fish_Length-mean(data$Fish_Length))/sd(data$Fish_Length)

> data$Z_TP<-(data$Trophic_Pos-mean(data$Trophic_Pos))/sd(data$Trophic_Pos)

z correct Length:

z correct Trophic position:

--> Data exploration

The mixed model protocol in R

--> Data exploration

To know if a mixed model is necessary for your data set you must establish if it is actually important to account for variation in the factors that might be affecting the relationship that you're interested in

Lake and Species in this case

If a certain species or lake is data poor it will rely more heavily on the group level model for the intercept and slope in this species or lake.

The mixed model protocol in R

--> Data exploration

> lm.test<-lm(Z_TP~Z_Length, data=data)

How?

Create a linear model without factors

Calculate the residuals of this linear model

Plot the residuals of the linear model against the factors of interest

> lm.test.resid<-rstandard(lm.test)

The mixed model protocol in R

--> Data exploration

> plot(lm.test.resid~ data$Fish_Species,

xlab = "Species", ylab="Standardized residuals")

abline(0,0, lty=2)

> plot(lm.test.resid~ data$Lake, xlab = "Lake", ylab="Standardized residuals")

abline(0,0, lty=2)

#Species Effect

#Lake Effect

The mixed model protocol in R

Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

Step 3: Model validation

Step 4: Interpreting results

and visualizing the model

The mixed model protocol in R

Step 2: Coding potential models and model selection

We know we want our model to say something like this:

Trophic Positionijk ~ Fish Lengthi + Lakej + Speciesk

#Species Effect

#Lake Effect

Plot the residuals of the linear model against the factors of interest

B. Adjust confidence intervals of intercepts and slopes to account for pseudoreplication.

*The patterns suggest there likely is some residual variance that could be explained by these factors, so they should definitely be included

B. Adjust confidence intervals of intercepts and slopes to account for pseudoreplication.

In R, this turns into:

> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

The mixed model protocol in R

Step 2: Coding potential models and model selection

Coding mixed models in R:

> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

"linear mixed model" function

indicates varying intercepts

Estimation method

The mixed model protocol in R

Step 2: Coding potential models and model selection

Coding mixed models in R:

> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

But what if we also want the slopes to vary?

The mixed model protocol in R

Step 2: Coding potential models and model selection

Coding mixed models in R:

REML (Restricted Maximum Likelihood) is the default estimation method in the "lmer" function. Generally, REML is preferred to Maximum Likelihood (ML) to compare nested random effect models, but it is safest to use ML when comparing fixed effect models. (Which we will do shortly!)

Note on estimation methods

(1|factor)

(1+ variable|factor)

The mixed model protocol in R

How will the ICC and confidence intervals be affected in these two scenarios?

1. Fish trophic positions are not variable among lakes?

2. Fish trophic positions are similar within lakes but variable among lakes?

Challenge!

> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

Re-write the following code so that the slopes of the trophic position/length relationship also vary by lake and species

The mixed model protocol in R

Challenge Solution!

> lmer(Z_TP~Z_Length + (1+Z_Length|Lake) + (1+Z_Length|Species), data = data, REML=TRUE)

**Solution**

Q1. Fish trophic positions are not variable among lakes?

A1. Low ICC smaller confidence intervals

Q2. Fish trophic positions are similar within lakes but variable among lakes?

A2. High ICC larger confidence intervals

The mixed model protocol in R

Step 2: Coding potential models and model selection

To determine if you've built the best linear mixed model based on your a priori knowledge you must compare this "a-priori" model to possible alternative models

With the data set we are working with there are many potential models that might best fit the data

The mixed model protocol in R

Challenge!

Make a list of 7 alternative models to the following model that can be built and compared:

M1<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1|Lake), data=data)

M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake), data=data)

M3<-lmer(Z_TP~Z_Length + (1|Fish_Species), data=data)

M4<-lmer(Z_TP~Z_Length + (1|Lake), data=data)

M5<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species), data=data)

M6<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake), data=data)

M7<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1+Z_Length|Lake), data=data)

M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data)

The mixed model protocol in R

Challenge Solution!

#full model with varying intercepts

#full model with varying intercepts and slopes

#no Lake, varying intercepts only

#no Species, varying intercepts only

#No Lake, varying intercepts and slopes

#No Species, varying intercepts and slopes

#Full model with varying intercepts and slopes only varying by lake

#Full model with varying intercepts and slopes only varying by species

The mixed model protocol in R

Step 2: Coding potential models and model selection

Now that we have a list of potential models, we can

compare them

to get a better idea of what is going on in our data set and to select the one(s) with the most predictive power.

> M0<-lm(Z_TP~Z_Length,data=data)

#Bonus Model!

It is always useful to build the simple linear model without any varying intercept and slope factors to see the difference in AICc values. Although "lm" does not use the same estimation method as lmer, the AICc values can be compared between the two if REML = FALSE in all lmer models

*Note - If we had different fixed effects among models we would have to indicate “REML=FALSE” to compare with likelihood methods like AIC.

The mixed model protocol in R

Step 2: Coding potential models and model selection

> AICc(M1)

To find out of the AICc value of a model use:

To group all the AICc values of the different models into one table that is easy to read use:

> AICc<-c(AICc(M0), AICc(M1), AICc(M2), AICc(M3), AICc(M4), AICc(M5), AICc(M6), AICc(M7), AICc(M8))

> Model<-c("M0", "M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8")

> AICtable<-data.frame(Model=Model, AICc=AICc)

The mixed model protocol in R

Step 2: Coding potential models and model selection

> AICtable

What do these AICc values tell us?

The mixed model protocol in R

Step 2: Coding potential models and model selection

> AICtable

What is the structure of the best-fit model?

The model with the lowest AIC value has the most predictive power given the data.

Some suggest that if models are within 2 AICc units they are equally plausible.

In this case, we can take a closer look at M8 and M2, but all other models have such high AICc values that we know that they are not explaining as much variation in the data as M8 and should not be considered as plausible options to explain the patterns in the data

M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=FALSE)

This tells us that both the slopes and intercepts of the relationship between TP and Length vary at species level factor in the best fit-model.

The mixed model protocol in R

Step 2: Coding potential models and model selection

What is the structure of the best-fit model?

M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=TRUE)

The mixed model protocol in R

Challenge!

Take 2 minutes with your neighbour to draw out the model structure of M2. Biologically, how does it differ from M8? Why is it not surprising that it's AICc value was 2nd best?

> M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake) + (1+Z_Length|Fish_Species), data = data, REML=FALSE)

The mixed model protocol in R

Challenge Solution!

In-class group discussion

The mixed model protocol in R

Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

Step 3: Model validation

Step 4: Interpreting results

and visualizing the model

The mixed model protocol in R

A. Look at homogeneity

-plot models fitted values vs residuals values

B. Look at independence

-plot residuals vs each covariate in the model

-plot residuals vs each covariate not in the model

C. Look at normality

-histogram

The mixed model protocol in R

Step 3: Model validation

A. Look at homogeneity

-plot models fitted values vs residuals values

>E1 <- resid(M8)

>F1<-fitted(M8)

>plot(x = F1, y = E1, xlab = "Fitted Values",

ylab = "Normalized residuals")

>abline(h = 0, lty = 2)

*Even spread of the residuals suggest that the model is a good fit for the data.

*A plot like this would suggest that there variation in the dataset that the model was unable to account for

The mixed model protocol in R

Step 3: Model validation

B. Look at independence

i) plot residuals vs each covariate in the model

>plot(x = data$Z_Length, y = E1, xlab = "Z Length",ylab = "Normalized residuals")

>abline(h = 0, lty = 2)

The mixed model protocol in R

Step 3: Model validation

B. Look at independence

ii) plot residuals vs each covariate not in the model

If you observe patterns in these plots you will know that there is variation in your dataset that could be accounted for by these covariates that were not included in the model, and so you should re-consider the inclusion of this variable in your model

Seeing as how there are no covariates we measured that we decided not to include in the model this point is not applicable to this analysis.

The equal spread above and below zero indicate that there are no independence problems with this variable

Clustering is due to the structure of the data, in which only fish of certain length categories were collected

The mixed model protocol in R

Step 3: Model validation

B. Look at independence

i) plot residuals vs each covariate in the model

>boxplot(E1 ~ Fish_Species,

ylab = "Normalized residuals",

data = data, xlab = "Species")

>abline(h = 0, lty = 2)

>boxplot(E1 ~ Lake,

ylab = "Normalized residuals",

data = data, xlab = "Lake")

>abline(h = 0, lty = 2)

No visible patterns observed, which suggests there are no problems with independence in relation to the Lake and Species variables

Step 3: Model validation

You must to check to make sure the model has met all assumptions it makes in its calculations

The mixed model protocol in R

C. Look at normality of residuals

Step 3: Model validation

Normally distributed residuals indicate that your model predictions are not biased high or low.

> E1 <- resid(M8)

> hist(E1)

The mixed model protocol in R

A. Look at homogeneity

-plot models fitted values vs residuals values

B. Look at independence

-plot residuals vs each covariate in the model

-plot residuals vs each covariate not in the model

C. Look at normality

-histogram

Step 3: Model validation

The mixed model protocol in R

Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

Step 3: Model validation

Step 4: Interpreting results

and visualizing the model

What is "variance"?

SD^2

What is a "p-value"?

The estimated slope +/- the 95% confidence interval

SE*2

Insignificant slope

Significant slope

The mixed model protocol in R

Challenge!

a) What is the slope and confidence interval of the Z_Length variable in the M8 model?

b) Is the Z_Length slope significantly different from 0?

The mixed model protocol in R

Challenge solution!

>summary(M8)

a) What is the slope and confidence interval of the Z_Length variable in the M8 model?

-Slope = 0.422

-CI = 0.09*2 = 0.18

b) Is the Z_Length slope significantly different to 0?

-Yes, because the CI does not overlap with 0

The mixed model protocol in R

Step 4: Interpreting results

and visualizing the model

Mixed model resources

Quebec Centre for Biodiversity Science

R Workshop Series

Workshop 6: Linear mixed effects models

Feedback: https://docs.google.com/spreadsheet/ccc?key=0AhCQzc0AsZ0OdHZoWE1PUi1kNmttZV96VEViY0sxVEE#gid=0

Differences between nlme and lme4

Challenge!

Challenge!

Discuss the data set your currently working on with you neighbour and determine if a mixed model would be appropriate for it.

If so, work with your neighbour to write out the code you would use in R for this model.

If not, come up with a fictive ecological data set for which a mixed model could be useful and write out the model

Challenge!

Solution

Challenge!

Solution

Challenge!

Solution

Challenge!

Solution

> lmer(Mercury~Length*Habitat_Type+ (1|Site)

Situation:

You collected 200 fish from 12 different sites evenly distributed across 4 habitat types found within 1 lake. You measured the length of each fish and the quantity of mercury in its tissue. You are mainly interested in knowing if habitat is a good predictor of mercury concentration.

-What mixed model could you use for this data set?

Classroom discussion of examples

Challenge!

Solution

Challenge!

Solution

Challenge!

Challenge!

Challenge!

Challenge!

Mixed models

and ecological data

Mixed models

and ecological data

Mixed models are really good at accounting for variation in ecological data while not loosing too many degrees of freedom

*Note the more levels you have in a factor the more accurately you can estimate the mean and standard deviation of the ND. Three may be a little low but it is easier to visualize!

Challenge!

Challenge!

Take two minutes to sketch out different ways you could plot out the results of M8.

*hint: consider the different "levels" of the model

Challenge!

Solution

Challenge!

Solution

The various intercepts and slopes generated by the model can be placed into figures to better visualize the results of a mixed model.

a) All data grouped

b) The species level figure

c) The lake level figure

The mixed model protocol in R

Step 4: Interpreting results

and visualizing the model

To make these figures you must obtain the coefficients (intercept and slope) of each component of the model

Overall model coefficients (aka: "coefs") can be found in the summary of the model

Coefs for each of the model levels (in this case: Lake and Species) can be obtained using the "coef" function

a) All data grouped

b) The species level figure

c) The lake level figure

The mixed model protocol in R

How to visualize model results

All data grouped

1-Obtain coefficients of interest

Intercept = -0.0009059

Slope = 0.4222697

> plot <- ggplot(aes(Z_Length,Z_TP),

data=data)

> Plot_AllData <- plot + geom_point()

+ xlab("Length (mm)")

+ ylab("Trophic Position")

+ labs(title="All Data") + fig

> Plot_AllData

+ geom_abline(intercept = -0.0009059,

slope =0.4222697)

#Plot all data into one figure

#Add an abline line with coefs

2- Illustrate coefs in a figure

>summary(M8)

The Species level figure

1-Obtain coefficients of interest

>coef(M8)

2-Plot the data color coded by species and add regression lines for each species using extracted coefs

*See code for details concerning the figure

The Lake level figure

1-Obtain coefficients of interest

>coef(M8)

2-Plot the data color coded by species and add regression lines for each species using extracted coefs

*See code for details concerning the figure

Situation:

You collected biodiversity estimates from 1000 quadrats that are within 10 different sites which are within 10 different forests. You measured the productivity in each quadrat as well. You are mainly interested in knowing if productivity is a good predictor of biodiversity.

-What mixed model could you use for this data set?

> lmer(Bio_Div ~ Productivity + (1|Forest/Site))

# Here the random effects are nested (i.e. Sites within forest) and not crossed

1. What is a LMM and why should I care?

Group discussion

Do we expect all species to increase in trophic position with length? In the exact same way?

The mixed model protocol in R

--> Data exploration

Check for collinearity between variables

What are some additional measures we could have gotten in the wild that might have been

collinear with length?

The mixed model protocol in R

Create a linear model without factors

Calculate the residuals of this linear model

Plot the residuals of the linear model against the factors of interest

How?

The mixed model protocol in R

Step 2: Coding potential models and model selection

Models can be compared using the "AICc" function from the "AICcmodavg" package

The Akaike Information Criterion (AIC) is a

measure of model quality

that can be used to compare models

AICc corrects for bias created by small sample size when estimating AIC

Fitted values

Normalized residuals

Fixed VS Random effect

In the literature of LMMs, you will meet those terms often.

There are many possible definitions of fixed and random effects and we chose to present definitions that are easy to apply to your data.

1. What is a LMM and why should I care?

Fixed effect

data is usually gathered from all it's possible levels

making conclusions about the levels for which the data was gathered

1. What is a LMM and why should I care?

Random effect

only categorical variables =

Random factors

data only includes a random sample of the factor's many possible levels, which are all of interest.

Often grouping factors

1. What is a LMM and why should I care?