Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

1. What is a LMM and why should I care?

Objectives

1. What is a LMM and why should I care?

Introduction to the data set

Q: Does fish trophic position increase with fish size?

Throughout the workshop there will be a series of challenges denoted by a rubix cube.

Ecological and biological data can be messy!

  • Inherent structure to data
  • Many covariates
  • Low sample sizes

Introduction to the data set

Open the workshop script in R studio

Make plots 1-3, observe the data and try to get a sense of what is occurring.

1. What is a linear mixed model (LMM) and why should I care?

2. How do I implement, check assumptions, compare, and present the results of my LMM's in R?

...

During challenges collaborate with your neighbors!

Quebec Centre for Biodiversity Science

R Workshop Series

1. What is a LMM and why should I care?

Solution

1. Separate:

Group discussion

  • Do we expect these relationships to be the same across lakes? What might differ?

Group discussion

  • Do we expect all species to increase in trophic position with length? In the exact same way?

How might we analyze this data? You might:

1. Separate: Run separate analyses for each species in each lake.

2. Lump: Run one analysis ignoring lake and species.

  • Estimate 6 intercepts and slopes for each species (i.e. 6 lakes)

  • Sample size (n) = 10 for each analysis (i.e. 10 fish/species/lake)

  • Low chance of finding an effect due to low n.

Workshop 6: Linear mixed effects models

1. What is a LMM and why should I care?

A. Allow intercepts and or slopes to vary for a given factor.

A. Allow intercepts and or slopes to vary for a given factor.

2. Lump:

Assume intercepts come from a normal distribution (ND)

Same thing for lake

Same concept for slopes just harder to visualize

Only need to estimate mean and standard deviation of ND instead of *3 intercepts

  • Huge sample size!!

  • What about pseudo replication? (i.e. fish within a lake and within a species will be correlated)

  • Look at all that noise in the data! I bet some of it is due to differences among species and lakes!

Fixed effect

  • data is usually gathered from all it's possible levels
  • making conclusions about the levels for which the data was gathered

But how?

A. Intercepts and/or slopes are allowed to vary by lake and species.

B. Intercepts, slopes, and associated confidence intervals are adjusted to account for the structure of data.

Estimate 2 parameters (mean and SD) instead of 6 intercepts - saves degrees of freedom

LMM's are a balance between separating and lumping. They:

1. Estimate slope and intercept parameters for each species and lake (separating).

2. Use all the data available (lumping) while accounting for pseudoreplication and controlling for differences among lakes and species.

Website: http://qcbs.ca/wiki/r/workshop6

Fixed VS Random effect

In the literature of LMMs, you will meet those terms often.

There are many possible definitions of fixed and random effects and we chose to present definitions that are easy to apply to your data.

Random effect

  • only categorical variables = Random factors
  • data only includes a random sample of the factor's many possible levels, which are all of interest.
  • Often grouping factors

Again, only estimate mean and SD instead of 3 slopes

This is done including species from all lakes

This is done including individuals of all species

For our question we just want to know if there is an overall effect of length on trophic position.

This effect might vary slightly by species due to things like differences in growth rates or by lake because of differences in food availability, but we don't care about these unmeasured factors we just want to control for them.

*Note the more levels you have in a factor the more accurately you can estimate the mean and standard deviation of the ND. Three may be a little low but it is easier to visualize!

Solution

Challenge

B. Adjust intercepts and slopes to account for the structure of data.

B. Adjust confidence intervals of intercepts and slopes to account for pseudoreplication.

smaller effective sample size larger confidence intervals

larger effective sample size smaller confidence intervals

If a certain species or lake is data poor it will rely more heavily on the group level model for the intercept and slope in this species or lake.

Q1. Fish trophic positions are not variable among lakes?

A1. Low ICC smaller confidence intervals

Q2. Fish trophic positions are similar within lakes but variable among lakes?

A2. High ICC larger confidence intervals

How will the ICC and confidence intervals be affected in these two scenarios?

1. Fish trophic positions are not variable among lakes?

2. Fish trophic positions are similar within lakes but variable among lakes?

Based on:

Intraclass correlation coefficient (ICC) - How much variation is there among groups versus within groups?

Will treat individual observations more like they are independent

Will treat these more like one observation

The mixed model protocol in R

--> Data exploration

Step 1: A priori model building and data exploration

Check for collinearity between variables

Look at the distribution of continuous variables

Step 1: A priori model building and data exploration

--> Data exploration

Check for collinearity between variables

What we know:

Look at the distribution of samples across factors:

>hist(data$Fish_Length

>hist(data$Trophic_Pos)

Always make sure you've done "Housekeeping" on your data before you start building models!

  • It is important not to include 2 collinear variables in the same model, otherwise their effects on the response variable will be confounded.

Step 2: Coding potential models and model selection

We are interested in finding out if trophic position can be predicted by length, while accounting for variation between species and lakes

>table(data$Species)

>table(data$Lake)

Is the data in the right structure?

  • What are some additional measures we could have gotten in the wild that might have been

collinear with length?

Step 3: Model validation

>plot(data)

>cor()

>str(data)

So we now want a model that looks

something like this:

Step 4: Interpreting results

and visualizing the model

Trophic Positionijk ~ Fish Lengthi + Lakej + Speciesk

  • No risk of collinearity with only 1 continuous explanatory variable

Major skews can lead to problems with model homogeneity down the road. So if necessary, make transformations. In this case, the data seems OK.

*This data set is perfectly balanced, but mixed models can deal with unbalanced designs (like we so often have in Ecology!).

The mixed model protocol in R

--> Data exploration

How?

--> Data exploration

Consider the scale of your data

How?

  • Plot the residuals of the linear model against the factors of interest
  • Create a linear model without factors
  • Because the data we are dealing with has such variable scales, we z-correct it.

Trophic position:

Short scale

Length:

Long scale

> lm.test<-lm(Z_TP~Z_Length, data=data)

  • Create a linear model without factors
  • Calculate the residuals of this linear model
  • Plot the residuals of the linear model against the factors of interest

#Species Effect

#Lake Effect

  • Calculate the residuals of this linear model

> lm.test.resid<-rstandard(lm.test)

To know if a mixed model is necessary for your data set you must establish if it is actually important to account for variation in the factors that might be affecting the relationship that you're interested in

  • Lake and Species in this case

z correct Length:

  • If two variables in the same model have different ranges of scale, the criteria the mixed models use to come up with parameter estimates are likely to return 'convergence errors'.

  • Z-correction standardizes your variables and can adjust for this scaling problem
  • z = (x - mean(x))/sd(x)
  • Plot the residuals of the linear model against the factors of interest

> data$Z_Length<-(data$Fish_Length-mean(data$Fish_Length))/sd(data$Fish_Length)

#Species Effect

#Lake Effect

z correct Trophic position:

> plot(lm.test.resid~ data$Lake, xlab = "Lake", ylab="Standardized residuals")

abline(0,0, lty=2)

> plot(lm.test.resid~ data$Fish_Species,

xlab = "Species", ylab="Standardized residuals")

abline(0,0, lty=2)

> data$Z_TP<-(data$Trophic_Pos-mean(data$Trophic_Pos))/sd(data$Trophic_Pos)

*The patterns suggest there likely is some residual variance that could be explained by these factors, so they should definitely be included

The mixed model protocol in R

Challenge!

Step 2: Coding potential models and model selection

  • Coding mixed models in R:

Step 1: A priori model building and data exploration

Step 2: Coding potential models and model selection

  • We know we want our model to say something like this:

Step 2: Coding potential models and model selection

Trophic Positionijk ~ Fish Lengthi + Lakej + Speciesk

> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

  • In R, this turns into:

indicates varying intercepts

"linear mixed model" function

Re-write the following code so that the slopes of the trophic position/length relationship also vary by lake and species

Step 3: Model validation

But what if we also want the slopes to vary?

Note on estimation methods

> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

Step 4: Interpreting results

and visualizing the model

Estimation method

REML (Restricted Maximum Likelihood) is the default estimation method in the "lmer" function. Generally, REML is preferred to Maximum Likelihood (ML) to compare nested random effect models, but it is safest to use ML when comparing fixed effect models. (Which we will do shortly!)

> lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

(1|factor)

(1+ variable|factor)

The mixed model protocol in R

The mixed model protocol in R

Challenge Solution!

The mixed model protocol in R

Challenge!

Step 2: Coding potential models and model selection

Challenge Solution!

#full model with varying intercepts

To find out of the AICc value of a model use:

#full model with varying intercepts and slopes

Step 2: Coding potential models and model selection

  • Now that we have a list of potential models, we can compare them to get a better idea of what is going on in our data set and to select the one(s) with the most predictive power.

> AICc(M1)

Make a list of 7 alternative models to the following model that can be built and compared:

Step 2: Coding potential models and model selection

  • Models can be compared using the "AICc" function from the "AICcmodavg" package

  • The Akaike Information Criterion (AIC) is a measure of model quality that can be used to compare models

  • AICc corrects for bias created by small sample size when estimating AIC

#no Lake, varying intercepts only

#no Species, varying intercepts only

To group all the AICc values of the different models into one table that is easy to read use:

Step 2: Coding potential models and model selection

  • To determine if you've built the best linear mixed model based on your a priori knowledge you must compare this "a-priori" model to possible alternative models

  • With the data set we are working with there are many potential models that might best fit the data

M1<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1|Lake), data=data)

M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake), data=data)

M3<-lmer(Z_TP~Z_Length + (1|Fish_Species), data=data)

M4<-lmer(Z_TP~Z_Length + (1|Lake), data=data)

M5<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species), data=data)

M6<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake), data=data)

M7<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1+Z_Length|Lake), data=data)

M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data)

> lmer(Z_TP~Z_Length + (1+Z_Length|Lake) + (1+Z_Length|Species), data = data, REML=TRUE)

#No Lake, varying intercepts and slopes

#No Species, varying intercepts and slopes

#Full model with varying intercepts and slopes only varying by lake

> AICc<-c(AICc(M0), AICc(M1), AICc(M2), AICc(M3), AICc(M4), AICc(M5), AICc(M6), AICc(M7), AICc(M8))

> Model<-c("M0", "M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8")

> AICtable<-data.frame(Model=Model, AICc=AICc)

#Full model with varying intercepts and slopes only varying by species

*Note - If we had different fixed effects among models we would have to indicate “REML=FALSE” to compare with likelihood methods like AIC.

#Bonus Model!

It is always useful to build the simple linear model without any varying intercept and slope factors to see the difference in AICc values. Although "lm" does not use the same estimation method as lmer, the AICc values can be compared between the two if REML = FALSE in all lmer models

> M0<-lm(Z_TP~Z_Length,data=data)

The mixed model protocol in R

The mixed model protocol in R

Challenge!

The mixed model protocol in R

Challenge Solution!

Step 2: Coding potential models and model selection

Step 1: A priori model building and data exploration

What is the structure of the best-fit model?

What do these AICc values tell us?

What is the structure of the best-fit model?

> AICtable

M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=TRUE)

Step 2: Coding potential models and model selection

Take 2 minutes with your neighbour to draw out the model structure of M2. Biologically, how does it differ from M8? Why is it not surprising that it's AICc value was 2nd best?

In-class group discussion

M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=FALSE)

Step 3: Model validation

  • The model with the lowest AIC value has the most predictive power given the data.

  • Some suggest that if models are within 2 AICc units they are equally plausible.

  • In this case, we can take a closer look at M8 and M2, but all other models have such high AICc values that we know that they are not explaining as much variation in the data as M8 and should not be considered as plausible options to explain the patterns in the data

Step 4: Interpreting results

and visualizing the model

This tells us that both the slopes and intercepts of the relationship between TP and Length vary at species level factor in the best fit-model.

> M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake) + (1+Z_Length|Fish_Species), data = data, REML=FALSE)

The mixed model protocol in R

Step 3: Model validation

B. Look at independence

ii) plot residuals vs each covariate not in the model

B. Look at independence

i) plot residuals vs each covariate in the model

Step 3: Model validation

B. Look at independence

i) plot residuals vs each covariate in the model

A. Look at homogeneity

-plot models fitted values vs residuals values

Step 3: Model validation

  • You must to check to make sure the model has met all assumptions it makes in its calculations

>boxplot(E1 ~ Lake,

ylab = "Normalized residuals",

data = data, xlab = "Lake")

>abline(h = 0, lty = 2)

C. Look at normality of residuals

>boxplot(E1 ~ Fish_Species,

ylab = "Normalized residuals",

data = data, xlab = "Species")

>abline(h = 0, lty = 2)

>plot(x = data$Z_Length, y = E1, xlab = "Z Length",ylab = "Normalized residuals")

>abline(h = 0, lty = 2)

  • Normally distributed residuals indicate that your model predictions are not biased high or low.

>E1 <- resid(M8)

>F1<-fitted(M8)

>plot(x = F1, y = E1, xlab = "Fitted Values",

ylab = "Normalized residuals")

>abline(h = 0, lty = 2)

  • If you observe patterns in these plots you will know that there is variation in your dataset that could be accounted for by these covariates that were not included in the model, and so you should re-consider the inclusion of this variable in your model

  • Seeing as how there are no covariates we measured that we decided not to include in the model this point is not applicable to this analysis.

> E1 <- resid(M8)

> hist(E1)

A. Look at homogeneity

-plot models fitted values vs residuals values

B. Look at independence

-plot residuals vs each covariate in the model

-plot residuals vs each covariate not in the model

C. Look at normality

-histogram

*Even spread of the residuals suggest that the model is a good fit for the data.

  • The equal spread above and below zero indicate that there are no independence problems with this variable
  • Clustering is due to the structure of the data, in which only fish of certain length categories were collected

Normalized residuals

No visible patterns observed, which suggests there are no problems with independence in relation to the Lake and Species variables

Fitted values

*A plot like this would suggest that there variation in the dataset that the model was unable to account for

The mixed model protocol in R

The mixed model protocol in R

Challenge solution!

The mixed model protocol in R

The mixed model protocol in R

Challenge!

The mixed model protocol in R

Step 4: Interpreting results

>summary(M8)

Step 4: Interpreting results

and visualizing the model

Step 1: A priori model building and data exploration

Step 3: Model validation

a) What is the slope and confidence interval of the Z_Length variable in the M8 model?

-Slope = 0.422

-CI = 0.09*2 = 0.18

b) Is the Z_Length slope significantly different to 0?

-Yes, because the CI does not overlap with 0

  • What is "variance"?

SD^2

  • The various intercepts and slopes generated by the model can be placed into figures to better visualize the results of a mixed model.

Step 2: Coding potential models and model selection

a) What is the slope and confidence interval of the Z_Length variable in the M8 model?

b) Is the Z_Length slope significantly different from 0?

<--Variation in Lake intercepts

A. Look at homogeneity

-plot models fitted values vs residuals values

B. Look at independence

-plot residuals vs each covariate in the model

-plot residuals vs each covariate not in the model

C. Look at normality

-histogram

<--Variation in Species intercepts

Step 3: Model validation

<--Variation in Species slopes

<-- Residual variation

Step 4: Interpreting results

and visualizing the model

>summary(M8)

  • What is a "p-value"?

The estimated slope +/- the 95% confidence interval

SE*2

Insignificant slope

Significant slope

Challenge!

Challenge!

Solution

The mixed model protocol in R

How to visualize model results

The mixed model protocol in R

Step 4: Interpreting results

a) All data grouped

and visualizing the model

a) All data grouped

All data grouped

1-Obtain coefficients of interest

>summary(M8)

Intercept = -0.0009059

Slope = 0.4222697

2- Illustrate coefs in a figure

#Plot all data into one figure

Take two minutes to sketch out different ways you could plot out the results of M8.

b) The species level figure

  • To make these figures you must obtain the coefficients (intercept and slope) of each component of the model
  • Overall model coefficients (aka: "coefs") can be found in the summary of the model
  • Coefs for each of the model levels (in this case: Lake and Species) can be obtained using the "coef" function

The Species level figure

1-Obtain coefficients of interest

>coef(M8)

The Lake level figure

1-Obtain coefficients of interest

>coef(M8)

2-Plot the data color coded by species and add regression lines for each species using extracted coefs

*See code for details concerning the figure

c) The lake level figure

*hint: consider the different "levels" of the model

c) The lake level figure

2-Plot the data color coded by species and add regression lines for each species using extracted coefs

*See code for details concerning the figure

Challenge!

Mixed models

and ecological data

Challenge!

Solution

Mixed models are really good at accounting for variation in ecological data while not loosing too many degrees of freedom

> lmer(Bio_Div ~ Productivity + (1|Forest/Site))

# Here the random effects are nested (i.e. Sites within forest) and not crossed

Situation:

You collected biodiversity estimates from 1000 quadrats that are within 10 different sites which are within 10 different forests. You measured the productivity in each quadrat as well. You are mainly interested in knowing if productivity is a good predictor of biodiversity.

-What mixed model could you use for this data set?

Challenge!

Mixed model resources

Challenge!

Solution

Quebec Centre for Biodiversity Science

R Workshop Series

  • Differences between nlme and lme4

> lmer(Mercury~Length*Habitat_Type+ (1|Site)

Classroom discussion of examples

Workshop 6: Linear mixed effects models

Situation:

You collected 200 fish from 12 different sites evenly distributed across 4 habitat types found within 1 lake. You measured the length of each fish and the quantity of mercury in its tissue. You are mainly interested in knowing if habitat is a good predictor of mercury concentration.

-What mixed model could you use for this data set?

Discuss the data set your currently working on with you neighbour and determine if a mixed model would be appropriate for it.

If so, work with your neighbour to write out the code you would use in R for this model.

If not, come up with a fictive ecological data set for which a mixed model could be useful and write out the model

Feedback: https://docs.google.com/spreadsheet/ccc?key=0AhCQzc0AsZ0OdHZoWE1PUi1kNmttZV96VEViY0sxVEE#gid=0

> plot <- ggplot(aes(Z_Length,Z_TP),

data=data)

> Plot_AllData <- plot + geom_point()

+ xlab("Length (mm)")

+ ylab("Trophic Position")

+ labs(title="All Data") + fig

> Plot_AllData

+ geom_abline(intercept = -0.0009059,

slope =0.4222697)

#Add an abline line with coefs

Learn more about creating dynamic, engaging presentations with Prezi