Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

QCBS R Workshop 6

No description
by

CSBQ QCBS

on 5 September 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of QCBS R Workshop 6

Quebec Centre for Biodiversity Science

R Workshop Series

Workshop 6: Linear mixed effects models
Website
:
http://qcbs.ca/wiki/r/workshop6

Objectives
1. What is a linear mixed model (LMM) and why should I care?

2. How do I implement, check assumptions, compare, and present the results of my LMM's in R?
1. What is a LMM and why should I care?
Ecological and biological data can be messy!

Inherent structure to data
Many covariates
Low sample sizes
1. What is a LMM and why should I care?
Introduction to the data set
1. What is a LMM and why should I care?
Q: Does fish trophic position increase with fish size?
1. What is a LMM and why should I care?
Throughout the workshop there will be a series of challenges denoted by a rubix cube.
During challenges collaborate with your neighbors!
1. What is a LMM and why should I care?
Introduction to the data set

Open the workshop script in R studio

Make plots 1-3, observe the data and try to get a sense of what is occurring.
1. What is a LMM and why should I care?
Solution
1. What is a LMM and why should I care?
Solution
1. What is a LMM and why should I care?
Solution
1. What is a LMM and why should I care?



Group discussion

Do we expect these relationships to be the same across lakes? What might differ?
For our question we just want to know if there is an overall effect of length on trophic position.

This effect might vary slightly by species due to things like differences in growth rates or by lake because of differences in food availability, but we don't care about these unmeasured factors we just want to control for them.
1. What is a LMM and why should I care?
1. Separate:
Estimate 6 intercepts and slopes for each species (i.e. 6 lakes)

Sample size (n) = 10 for each analysis (i.e. 10 fish/species/lake)

Low chance of finding an effect due to low n.

1. What is a LMM and why should I care?
2. Lump:
Huge sample size!!

What about pseudo replication? (i.e. fish within a lake and within a species will be correlated)

Look at all that noise in the data! I bet some of it is due to differences among species and lakes!

1. What is a LMM and why should I care?
How might we analyze this data? You might:

1. Separate:
Run separate analyses for each species in each lake.

2. Lump:
Run one analysis ignoring lake and species.
...
LMM's are a balance between separating and lumping. They:

1. Estimate slope and intercept parameters for each species and lake (separating).

2. Use all the data available (lumping) while accounting for pseudoreplication and controlling for differences among lakes and species.
1. What is a LMM and why should I care?
But how?

A. Intercepts and/or slopes are allowed to vary by lake and species.

B. Intercepts, slopes, and associated confidence intervals are adjusted to account for the structure of data.

1. What is a LMM and why should I care?
A. Allow intercepts and or slopes to vary for a given factor.
Only need to estimate mean and standard deviation of ND instead of *3 intercepts
Assume intercepts come from a normal distribution (ND)
This is done including species from all lakes
A. Allow intercepts and or slopes to vary for a given factor.
Same thing for lake
Estimate 2 parameters (mean and SD) instead of 6 intercepts - saves degrees of freedom
This is done including individuals of all species
A. Allow intercepts and or slopes to vary for a given factor.
Same concept for slopes just harder to visualize
Again, only estimate mean and SD instead of 3 slopes
B. Adjust intercepts and slopes to account for the structure of data.

B. Adjust confidence intervals of intercepts and slopes to account for pseudoreplication.

Based on:

Intraclass correlation coefficient (ICC)
- How much variation is there among groups versus within groups?
Will treat these more like one observation
Will treat individual observations more like they are independent
smaller effective sample size larger confidence intervals
larger effective sample size smaller confidence intervals
Challenge
The mixed model protocol in R
Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

Step 3: Model validation


Step 4: Interpreting results
and visualizing the model
The mixed model protocol in R
Step 1: A priori model building and data exploration
Always make sure you've done "Housekeeping" on your data before you start building models!
The mixed model protocol in R
Step 4: Interpreting results
>summary(M8)
<--Variation in Lake intercepts
<--Variation in Species intercepts
<--Variation in Species slopes
<-- Residual variation
The mixed model protocol in R
--> Data exploration
Is the data in the right structure?
>str(data)
The mixed model protocol in R
--> Data exploration
Look at the distribution of samples across factors:
>table(data$Lake)

>table(data$Species)

*This data set is perfectly balanced, but mixed models can deal with unbalanced designs (like we so often have in Ecology!).

The mixed model protocol in R
Step 1: A priori model building and data exploration
We are interested in finding out if trophic position can be predicted by length, while accounting for variation between species and lakes
What we know:
So we now want a model that looks
something like this:
Trophic Positionijk ~ Fish Lengthi + Lakej + Speciesk
Look at the distribution of continuous variables
>hist(data$Fish_Length
>hist(data$Trophic_Pos)
Major skews can lead to problems with model homogeneity down the road. So if necessary, make transformations. In this case, the data seems OK.
The mixed model protocol in R
--> Data exploration
Check for collinearity between variables
>plot(data)
>cor()
No risk of collinearity with only 1 continuous explanatory variable

It is important not to include 2 collinear variables in the same model, otherwise their effects on the response variable will be confounded.
The mixed model protocol in R
--> Data exploration
Consider the scale of your data
If two variables in the same model have different ranges of scale, the criteria the mixed models use to come up with parameter estimates are likely to return 'convergence errors'.

Z-correction standardizes your variables and can adjust for this scaling problem
z = (x - mean(x))/sd(x)
The mixed model protocol in R
--> Data exploration
Consider the scale of your data
Length:
Long scale
Trophic position:
Short scale
The mixed model protocol in R
--> Data exploration
Consider the scale of your data
Because the data we are dealing with has such variable scales, we z-correct it.
> data$Z_Length<-(data$Fish_Length-mean(data$Fish_Length))/sd(data$Fish_Length)

> data$Z_TP<-(data$Trophic_Pos-mean(data$Trophic_Pos))/sd(data$Trophic_Pos)


z correct Length:
z correct Trophic position:
--> Data exploration
The mixed model protocol in R
--> Data exploration
To know if a mixed model is necessary for your data set you must establish if it is actually important to account for variation in the factors that might be affecting the relationship that you're interested in

Lake and Species in this case
If a certain species or lake is data poor it will rely more heavily on the group level model for the intercept and slope in this species or lake.
The mixed model protocol in R
--> Data exploration
> lm.test<-lm(Z_TP~Z_Length, data=data)

How?
Create a linear model without factors
Calculate the residuals of this linear model

Plot the residuals of the linear model against the factors of interest

> lm.test.resid<-rstandard(lm.test)
The mixed model protocol in R
--> Data exploration

> plot(lm.test.resid~ data$Fish_Species,
xlab = "Species", ylab="Standardized residuals")
abline(0,0, lty=2)


> plot(lm.test.resid~ data$Lake, xlab = "Lake", ylab="Standardized residuals")
abline(0,0, lty=2)
#Species Effect

#Lake Effect

The mixed model protocol in R
Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

Step 3: Model validation


Step 4: Interpreting results
and visualizing the model
The mixed model protocol in R
Step 2: Coding potential models and model selection
We know we want our model to say something like this:
Trophic Positionijk ~ Fish Lengthi + Lakej + Speciesk
#Species Effect

#Lake Effect

Plot the residuals of the linear model against the factors of interest
B. Adjust confidence intervals of intercepts and slopes to account for pseudoreplication.

*The patterns suggest there likely is some residual variance that could be explained by these factors, so they should definitely be included
B. Adjust confidence intervals of intercepts and slopes to account for pseudoreplication.

In R, this turns into:
> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)
The mixed model protocol in R
Step 2: Coding potential models and model selection
Coding mixed models in R:
> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)
"linear mixed model" function
indicates varying intercepts
Estimation method
The mixed model protocol in R
Step 2: Coding potential models and model selection
Coding mixed models in R:
> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)
But what if we also want the slopes to vary?
The mixed model protocol in R
Step 2: Coding potential models and model selection
Coding mixed models in R:
REML (Restricted Maximum Likelihood) is the default estimation method in the "lmer" function. Generally, REML is preferred to Maximum Likelihood (ML) to compare nested random effect models, but it is safest to use ML when comparing fixed effect models. (Which we will do shortly!)
Note on estimation methods
(1|factor)
(1+ variable|factor)
The mixed model protocol in R
How will the ICC and confidence intervals be affected in these two scenarios?

1. Fish trophic positions are not variable among lakes?

2. Fish trophic positions are similar within lakes but variable among lakes?
Challenge!
> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)
Re-write the following code so that the slopes of the trophic position/length relationship also vary by lake and species
The mixed model protocol in R
Challenge Solution!
> lmer(Z_TP~Z_Length + (1+Z_Length|Lake) + (1+Z_Length|Species), data = data, REML=TRUE)
Solution
Q1. Fish trophic positions are not variable among lakes?

A1. Low ICC smaller confidence intervals

Q2. Fish trophic positions are similar within lakes but variable among lakes?

A2. High ICC larger confidence intervals
The mixed model protocol in R
Step 2: Coding potential models and model selection
To determine if you've built the best linear mixed model based on your a priori knowledge you must compare this "a-priori" model to possible alternative models

With the data set we are working with there are many potential models that might best fit the data
The mixed model protocol in R
Challenge!
Make a list of 7 alternative models to the following model that can be built and compared:

M1<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1|Lake), data=data)


M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake), data=data)


M3<-lmer(Z_TP~Z_Length + (1|Fish_Species), data=data)


M4<-lmer(Z_TP~Z_Length + (1|Lake), data=data)


M5<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species), data=data)


M6<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake), data=data)


M7<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1+Z_Length|Lake), data=data)


M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data)

The mixed model protocol in R
Challenge Solution!
#full model with varying intercepts
#full model with varying intercepts and slopes
#no Lake, varying intercepts only
#no Species, varying intercepts only
#No Lake, varying intercepts and slopes
#No Species, varying intercepts and slopes
#Full model with varying intercepts and slopes only varying by lake
#Full model with varying intercepts and slopes only varying by species
The mixed model protocol in R
Step 2: Coding potential models and model selection



Now that we have a list of potential models, we can
compare them
to get a better idea of what is going on in our data set and to select the one(s) with the most predictive power.

> M0<-lm(Z_TP~Z_Length,data=data)
#Bonus Model!
It is always useful to build the simple linear model without any varying intercept and slope factors to see the difference in AICc values. Although "lm" does not use the same estimation method as lmer, the AICc values can be compared between the two if REML = FALSE in all lmer models
*Note - If we had different fixed effects among models we would have to indicate “REML=FALSE” to compare with likelihood methods like AIC.
The mixed model protocol in R
Step 2: Coding potential models and model selection
> AICc(M1)
To find out of the AICc value of a model use:
To group all the AICc values of the different models into one table that is easy to read use:
> AICc<-c(AICc(M0), AICc(M1), AICc(M2), AICc(M3), AICc(M4), AICc(M5), AICc(M6), AICc(M7), AICc(M8))

> Model<-c("M0", "M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8")

> AICtable<-data.frame(Model=Model, AICc=AICc)

The mixed model protocol in R
Step 2: Coding potential models and model selection
> AICtable
What do these AICc values tell us?
The mixed model protocol in R
Step 2: Coding potential models and model selection
> AICtable
What is the structure of the best-fit model?
The model with the lowest AIC value has the most predictive power given the data.

Some suggest that if models are within 2 AICc units they are equally plausible.

In this case, we can take a closer look at M8 and M2, but all other models have such high AICc values that we know that they are not explaining as much variation in the data as M8 and should not be considered as plausible options to explain the patterns in the data
M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=FALSE)
This tells us that both the slopes and intercepts of the relationship between TP and Length vary at species level factor in the best fit-model.
The mixed model protocol in R
Step 2: Coding potential models and model selection
What is the structure of the best-fit model?
M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=TRUE)
The mixed model protocol in R
Challenge!
Take 2 minutes with your neighbour to draw out the model structure of M2. Biologically, how does it differ from M8? Why is it not surprising that it's AICc value was 2nd best?
> M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake) + (1+Z_Length|Fish_Species), data = data, REML=FALSE)
The mixed model protocol in R
Challenge Solution!
In-class group discussion
The mixed model protocol in R
Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

Step 3: Model validation


Step 4: Interpreting results
and visualizing the model
The mixed model protocol in R
A. Look at homogeneity
-plot models fitted values vs residuals values

B. Look at independence
-plot residuals vs each covariate in the model
-plot residuals vs each covariate not in the model

C. Look at normality
-histogram
The mixed model protocol in R
Step 3: Model validation

A. Look at homogeneity
-plot models fitted values vs residuals values

>E1 <- resid(M8)
>F1<-fitted(M8)
>plot(x = F1, y = E1, xlab = "Fitted Values",
ylab = "Normalized residuals")
>abline(h = 0, lty = 2)
*Even spread of the residuals suggest that the model is a good fit for the data.
*A plot like this would suggest that there variation in the dataset that the model was unable to account for
The mixed model protocol in R
Step 3: Model validation

B. Look at independence
i) plot residuals vs each covariate in the model



>plot(x = data$Z_Length, y = E1, xlab = "Z Length",ylab = "Normalized residuals")
>abline(h = 0, lty = 2)
The mixed model protocol in R
Step 3: Model validation

B. Look at independence

ii) plot residuals vs each covariate not in the model

If you observe patterns in these plots you will know that there is variation in your dataset that could be accounted for by these covariates that were not included in the model, and so you should re-consider the inclusion of this variable in your model

Seeing as how there are no covariates we measured that we decided not to include in the model this point is not applicable to this analysis.
The equal spread above and below zero indicate that there are no independence problems with this variable
Clustering is due to the structure of the data, in which only fish of certain length categories were collected
The mixed model protocol in R
Step 3: Model validation

B. Look at independence
i) plot residuals vs each covariate in the model


>boxplot(E1 ~ Fish_Species,
ylab = "Normalized residuals",
data = data, xlab = "Species")
>abline(h = 0, lty = 2)

>boxplot(E1 ~ Lake,
ylab = "Normalized residuals",
data = data, xlab = "Lake")
>abline(h = 0, lty = 2)
No visible patterns observed, which suggests there are no problems with independence in relation to the Lake and Species variables
Step 3: Model validation
You must to check to make sure the model has met all assumptions it makes in its calculations


The mixed model protocol in R
C. Look at normality of residuals

Step 3: Model validation

Normally distributed residuals indicate that your model predictions are not biased high or low.
> E1 <- resid(M8)
> hist(E1)
The mixed model protocol in R
A. Look at homogeneity
-plot models fitted values vs residuals values

B. Look at independence
-plot residuals vs each covariate in the model
-plot residuals vs each covariate not in the model

C. Look at normality
-histogram
Step 3: Model validation


The mixed model protocol in R
Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

Step 3: Model validation


Step 4: Interpreting results
and visualizing the model
What is "variance"?
SD^2
What is a "p-value"?
The estimated slope +/- the 95% confidence interval
SE*2
Insignificant slope
Significant slope
The mixed model protocol in R
Challenge!
a) What is the slope and confidence interval of the Z_Length variable in the M8 model?

b) Is the Z_Length slope significantly different from 0?
The mixed model protocol in R
Challenge solution!
>summary(M8)
a) What is the slope and confidence interval of the Z_Length variable in the M8 model?
-Slope = 0.422
-CI = 0.09*2 = 0.18

b) Is the Z_Length slope significantly different to 0?
-Yes, because the CI does not overlap with 0
The mixed model protocol in R
Step 4: Interpreting results
and visualizing the model
Mixed model resources
Quebec Centre for Biodiversity Science

R Workshop Series
Workshop 6: Linear mixed effects models
Feedback: https://docs.google.com/spreadsheet/ccc?key=0AhCQzc0AsZ0OdHZoWE1PUi1kNmttZV96VEViY0sxVEE#gid=0
Differences between nlme and lme4

Challenge!

Discuss the data set your currently working on with you neighbour and determine if a mixed model would be appropriate for it.

If so, work with your neighbour to write out the code you would use in R for this model.
If not, come up with a fictive ecological data set for which a mixed model could be useful and write out the model



Challenge!
Solution


Challenge!
Solution



> lmer(Mercury~Length*Habitat_Type+ (1|Site)

Situation:
You collected 200 fish from 12 different sites evenly distributed across 4 habitat types found within 1 lake. You measured the length of each fish and the quantity of mercury in its tissue. You are mainly interested in knowing if habitat is a good predictor of mercury concentration.
-What mixed model could you use for this data set?


Classroom discussion of examples

Challenge!
Solution


Challenge!


Challenge!


Mixed models
and ecological data

Mixed models are really good at accounting for variation in ecological data while not loosing too many degrees of freedom

*Note the more levels you have in a factor the more accurately you can estimate the mean and standard deviation of the ND. Three may be a little low but it is easier to visualize!

Challenge!

Take two minutes to sketch out different ways you could plot out the results of M8.


*hint: consider the different "levels" of the model

Challenge!
Solution

The various intercepts and slopes generated by the model can be placed into figures to better visualize the results of a mixed model.
a) All data grouped
b) The species level figure
c) The lake level figure
The mixed model protocol in R
Step 4: Interpreting results
and visualizing the model
To make these figures you must obtain the coefficients (intercept and slope) of each component of the model
Overall model coefficients (aka: "coefs") can be found in the summary of the model
Coefs for each of the model levels (in this case: Lake and Species) can be obtained using the "coef" function
a) All data grouped
b) The species level figure
c) The lake level figure
The mixed model protocol in R
How to visualize model results
All data grouped
1-Obtain coefficients of interest
Intercept = -0.0009059
Slope = 0.4222697

> plot <- ggplot(aes(Z_Length,Z_TP),
data=data)
> Plot_AllData <- plot + geom_point()
+ xlab("Length (mm)")
+ ylab("Trophic Position")
+ labs(title="All Data") + fig


> Plot_AllData
+ geom_abline(intercept = -0.0009059,
slope =0.4222697)


#Plot all data into one figure
#Add an abline line with coefs
2- Illustrate coefs in a figure
>summary(M8)
The Species level figure
1-Obtain coefficients of interest
>coef(M8)
2-Plot the data color coded by species and add regression lines for each species using extracted coefs
*See code for details concerning the figure
The Lake level figure
1-Obtain coefficients of interest
>coef(M8)
2-Plot the data color coded by species and add regression lines for each species using extracted coefs
*See code for details concerning the figure
Situation:
You collected biodiversity estimates from 1000 quadrats that are within 10 different sites which are within 10 different forests. You measured the productivity in each quadrat as well. You are mainly interested in knowing if productivity is a good predictor of biodiversity.

-What mixed model could you use for this data set?


> lmer(Bio_Div ~ Productivity + (1|Forest/Site))

# Here the random effects are nested (i.e. Sites within forest) and not crossed


1. What is a LMM and why should I care?


Group discussion

Do we expect all species to increase in trophic position with length? In the exact same way?
The mixed model protocol in R
--> Data exploration
Check for collinearity between variables

What are some additional measures we could have gotten in the wild that might have been
collinear with length?
The mixed model protocol in R
Create a linear model without factors
Calculate the residuals of this linear model
Plot the residuals of the linear model against the factors of interest
How?
The mixed model protocol in R
Step 2: Coding potential models and model selection

Models can be compared using the "AICc" function from the "AICcmodavg" package

The Akaike Information Criterion (AIC) is a
measure of model quality
that can be used to compare models

AICc corrects for bias created by small sample size when estimating AIC
Fitted values
Normalized residuals

Fixed VS Random effect

In the literature of LMMs, you will meet those terms often.

There are many possible definitions of fixed and random effects and we chose to present definitions that are easy to apply to your data.

1. What is a LMM and why should I care?

Fixed effect

data is usually gathered from all it's possible levels
making conclusions about the levels for which the data was gathered


1. What is a LMM and why should I care?

Random effect

only categorical variables =
Random factors
data only includes a random sample of the factor's many possible levels, which are all of interest.
Often grouping factors


1. What is a LMM and why should I care?
Full transcript