Logit transformation of the data

often used with lm(m) for percentages or proportions when the binomial distribution is not appropriate. When not selections from fixed quantities (e.g. percent cover, school grades, etc).

Log-normal distribution in glm

, avoids having to log-transform the data.

Gamma distribution

. Similar to log-normal, more versatile.

Tweedie distribution

. Versatile family of distributions. Useful for data with a mix of zeros and positive values (not necessarily counts).

Zero-inflated Poisson or zero-inflated negative binomial

. When the data comprise an excess number of zeros, that arise from a different process than the process that generates the counts.

**Quebec Centre for Biodiversity Science**

R Workshop Series

R Workshop Series

**Workshop 7: Generalized linear (mixed) models**

**Website**

:

http://qcbs.ca/wiki/r/workshop7

:

http://qcbs.ca/wiki/r/workshop7

Poisson GLM behind the scenes

Modeling count data

positive values: you do not count -7 individuals

integer values: you do not count 7.56 individuals

exhibits larger variance for large values

The Poisson distribution

What is count data?

Count data are characterized by:

The Poisson distribution specifies the probability of a discrete random variable Y and is given by:

If µ is known we can compute the probability for different values of y

Properties

µ

is the parameter of the Poisson distribution

specifies the probably only for integer values

probability for negative values is null (P(Y<0) = 0)

mean = variance (allows for heterogeneity)

A Poisson GLM will model the value of µ as a function of different explanatory variables

Step 1.

We assume Yi follows a Poisson distribution with mean and variance µi

Fitting a Poisson GLM in R

Three steps

µ

i corresponds to the expected number of individuals

Poisson GLM behind the scenes

Step 2.

We specify the systematic part of the model just as in a linear model

Step 3.

The link between the mean of Yi and the systematic part is a logarithmic

Or

glm.poisson = glm(Faramea.occidentalis~Elevation, data=faramea, family=poisson)

Specify the distribution and link function

summary(glm.poisson)

The function

glm()

allows you to specify a Poisson GLM

As with

lm()

you can access the outputs of the model using the function summary()

Call:

glm(formula = Faramea.occidentalis ~ Elevation, family = poisson,

data = faramea)

Deviance Residuals:

Min 1Q Median 3Q Max

-3.3319 -2.7509 -1.5451 0.1139 11.3995

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.7687001 0.1099136 16.092 < 2e-16 ***

Elevation -0.0027375 0.0006436 -4.253 2.11e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 414.81 on 42 degrees of freedom

Residual deviance: 388.12 on 41 degrees of freedom

AIC: 462.01

Parameter estimates

The deviance

In our model the unknown parameters are the intercept (alpha) and the slope of elevation (Beta)

Remember that to estimate the unknown parameters maximum likelihood estimation is used

Model summary

The residual deviance is defined as twice the difference between the log likelihood of a model that provides a perfect fit and the log likelihood of our model

Res dev

= 2log(

L

(

y

;

y

)) - 2log(

L

(

y

;

µ

))

In a Poisson GLM, the residual deviance should equal the residual degrees of freedoms

388.12 >> 41

Overdispersion

When the residual deviance is higher than the residual degrees of freedom we say that the model is

overdispersed

Occurs when the variance in the data is even higher than the mean, hence the Poisson distribution is not the best choice (many zeros, many very high values, missing covariates, etc)

Solutions

2.

Choose another distribution:

the negative binomial

1.

Correct for overdispersion using

quasi-Poisson GLM

Quasi-Poisson GLM

The variance of the distribution will account for

overdispersion

:

is the dispersion parameter. It will be estimated prior to estimate parameters. Correcting for overdispersion will not affect parameter estimates but will affect their

significance

. Indeed, the standard errors of the parameters are multiplied by

the

systematic part

and the

link function

remain the same

Some marginally significant p-values may no longer hold!

Dispersion parameter

1

5

Poisson GLM

10

>15-20

Quasi-Poisson GLM

Negative binomial

Fitting a quasi-Poisson GLM in R

glm.quasipoisson = glm(Faramea.occidentalis~Elevation, data=faramea, family=quasipoisson)

Specify the new family

Call:

glm(formula = Faramea.occidentalis ~ Elevation, family = quasipoisson,

data = faramea)

Deviance Residuals:

Min 1Q Median 3Q Max

-3.3319 -2.7509 -1.5451 0.1139 11.3995

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.768700 0.439233 4.027 0.000238 ***

Elevation -0.002738 0.002572 -1.064 0.293391

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 15.96936)

Null deviance: 414.81 on 42 degrees of freedom

Residual deviance: 388.12 on 41 degrees of freedom

AIC: NA

Number of Fisher Scoring iterations: 10

summary(glm.quasipoisson)

The standard errors of the parameters are multiplied by

Same estimates but

AIC is not defined

Negative binomial GLM

Fitting a negative binomial in R

Negative binomial GLMs are favor when overdispersion is high

NB is not in the

glm()

function so you need to install and charge the MASS library

install.packages("MASS")

glm.negbin = glm.nb(Faramea.occidentalis~Elevation, data=faramea)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.369226 0.473841 5.00 5.73e-07 ***

Elevation -0.007038 0.002496 -2.82 0.00481 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.2593) family taken to be 1)

Null deviance: 41.974 on 42 degrees of freedom

Residual deviance: 36.343 on 41 degrees of freedom

AIC: 182.51

Number of Fisher Scoring iterations: 1

Theta: 0.2593

Std. Err.: 0.0755

2 x log-likelihood: -176.5090

Plotting the final GLM model

Step 1

Plot the data and the use estimates of the parameters to draw model line

Use

summary()

to get the parameters

Step 2

Use the standard errors to build the confidence envelope

It corresponds to a combination of two distributions (

Poisson

and

gamma

)

It assumes that the

Yi

are Poisson distributed with the mean μ assumed to follow a gamma distribution!

It has

two parameters

µ

and

k.

k

controls for the dispersion parameter (smaller

k

indicates higher dispersion)

**GLM with count data**

**GLM with binary data**

Binary variables

Presence/Absence of a species

Presence/Absence of a disease

Success/Failure to observe behaviour

Survival/Death of organisms

Wish to determine if P/A ~ Environment***

*** Called a logistic regression or logit model

Binary variables

In R, binary variables are coded with 1 and 0:

Site Presence

A 1

B 0

C 1

D 1

E 0

F 1

Species is present!

Species is absent!

**GLMMs**

**Why be normal?**

Logistic regression

The function!

glm()

logit.reg <- glm(formula, data, family)

Binary variables

Clearly not normally distributed!

Binary variables

Expected values can be out of the [0,1] range with lm():

To move away from traditional linear models, need to specify two things:

1) probability distribution

AND

2) a link function

Probability distribution

The Bernoulli distribution is well suited for binary response variables

E(

Y

) =

p

Var(

Y

) =

p

* (1-

p

)

The Link Function

For a

simple linear model

of a normally distributed continuous response variable, the equation for the expected values is:

μ = Xβ

where μ is the expected value of the response variable

X is the model matrix (

i.e.

your data)

β is the vector of estimated parameters

(

i.e.

the intercept & slope)

Xβ is called the

linear predictor

The Link Function

μ = Xβ

is only true for normally distributed data

If this is not the case, must use a transformation on the expected values μ

g

(μ) = Xβ

The Link Function

g

(μ) = log

For binary data, the link function is called the

logit:

1-μ

μ

1) Get the odds (μ / 1-μ)

2) log-transform them

The Link Function

g

(μ) = log

1-μ

μ

1) Get the odds (μ / 1-μ)

2) log-transform them

The odds puts our expected values on a 0 to +Inf scale

The log transformation puts our expected values on a -Inf to +Inf

The expected values can now be

linearly

related to the linear predictor

μ = expected values (probability that Y = 1)

Exercise 1

Build a logistic regression model using the mites data

setwd(~/workshop6)

mites <- read.csv("mites.csv")

str(mites)

str(mites)

'data.frame': 70 obs. of 9 variables:

$ Galumna : int 8 3 1 1 2 1 1 1 2 5 ...

$ pa : int 1 1 1 1 1 1 1 1 1 1 ...

$ totalabund: int 140 268 186 286 199 209 162 126 123 166 ...

$ prop : num 0.05714 0.01119 0.00538 0.0035 0.01005 ...

$ SubsDens : num 39.2 55 46.1 48.2 23.6 ...

$ WatrCont : num 350 435 372 360 204 ...

$ Substrate : Factor w/ 7 levels "Barepeat","Interface",..: 4 3 2 4 4 4 4 2 3 4 ...

$ Shrub : Factor w/ 3 levels "Few","Many","None": 1 1 1 1 1 1 1 2 2 2 ...

$ Topo : Factor w/ 2 levels "Blanket","Hummock": 2 2 2 2 2 2 2 1 1 2 ...

Exercise 1

Build a model of the presence of Galumna sp. as a function of water content and topography

logit.reg <- glm(pa ~ WatrCont + Topo, data=mites,

family = binomial(link = "logit"))

summary(logit.reg)

Call:

glm(formula = pa ~ WatrCont + Topo, family = binomial, data = mites)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.0387 -0.5589 -0.1594 0.4112 2.0252

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 4.464402 1.670622 2.672 0.007533 **

WatrCont -0.015813 0.004535 -3.487 0.000489 ***

TopoHummock 2.090757 0.735348 2.843 0.004466 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 91.246 on 69 degrees of freedom

Residual deviance: 48.762 on 67 degrees of freedom

AIC: 54.762

Number of Fisher Scoring iterations: 6

summary(logit.reg)

Challenge 1

Using the 'bacteria' dataset, model the presence of

H. influenzae

as a function of treatment and week of test. Start with a full model and reduce it to the most parsimonious model.

install.packages("MASS")

library(MASS)

data(bacteria)

str(bacteria)

?bacteria

Solution

model.bact1 <- glm(y ~ trt * week,

data = bacteria, family = binomial)

model.bact2 <- glm(y ~ trt + week,

data = bacteria, family = binomial)

anova(model.bact1, model.bact2, model.bact3,

test = "LRT")

model.bact3 <- glm(y ~ week, data = bacteria,

family = binomial)

Analysis of Deviance Table

Model 1: y ~ trt * week

Model 2: y ~ trt + week

Model 3: y ~ trt

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 214 203.12

2 216 203.81 -2 -0.6854 0.709836

3 217 210.72 -1 -6.9140 0.008552 **

Based on these results, we choose model #2 as the best candidate to model the data.

Interpreting the output

Let's go back to the summary of our 'logit.reg' model to see the coefficients:

summary(logit.reg)$coefficients

Estimate Std. Error z value Pr(>|z|)

(Intercept) 4.46440 1.67062 2.67230 0.00753

WatrCont -0.01581 0.00454 -3.48673 0.00049

TopoHummock 2.09076 0.73535 2.84322 0.00447

Interpreting the output

Remember we used a logit transformation on the expected values!

To properly interpret the regression parameters, we have to use a 'reverse' function:

The natural exponential function to obtain the odds: e

x

The inverse logit function to obtain the probabilities:

logit = 1 / (1 + 1/e )

-1

x

Interpreting the output

On the odds scale for water content:

exp(logit.reg$coefficient[2])

WatrCont

0.9843118

On the probability scale for water content:

1 / (1 + 1/exp(logit.reg$coefficient[2]))

WatrCont

0.4960469

Predictive Power and goodness of fit

Get the pseudo-R , the analogue of the R for models fitted by maximum likelihood:

null deviance - residual deviance

null deviance

pseudo-R =

2

2

2

variance explained by the model

Exercise 2

Compute the coefficient of discrimination D

Predictive Power and goodness of fit

To assess goodness of fit, diagnostic plots are not useful, instead must use

Hosmer-Lemeshow test:

Compare observed and expected number of outcomes

Similar to a Chi square test

Perform a Hosmer-Lemeshow test*

HLtest(Rsq(logit.reg))

Challenge 2

Using the model created with bacteria dataset, assess goodness of fit and predictive power.

Think how predictive power could be improved for this model.

1)

2)

Solution

null.d <- model.bact2$null.deviance

resid.d <- model.bact2$deviance

bact.pseudoR2 <- (null.d -resid.d) / null.d

1)

2)

Adding

informative

explanatory variables could increase the explanatory power of the model

HLtest(Rsq(model.bact2)

Modeling count data

What is count data?

faramea <- read.csv(‘faramea.csv’, header = TRUE)

Import the 'faramea.csv' into R.

Let's investigate the histogram of the number of

Faramea occidentalis

The number of trees of the species

Faramea occidentalis

was assessed in 43 quadrats in Barro Colorado Island (Panama). For each quadrat, environmental characteristics were also recorded such as elevation or precipitation.

Modeling count data

the

Poisson distribution

seems to be the perfect choice to model this data, hence

Poisson GLMs

are usually the good way to start modeling count data.

How to model count data?

Does elevation influence the abundance of

F. occidentalis

?

Slope of '

Elevation

'

One intercept

Solution

# Poisson GLM

glm.p = glm(Galumna~WatrCont+SubsDens, data=mites, family=poisson)

# quasi-Poisson GLM

glm.qp = update(glm.p,family=quasipoisson)

# model selection

drop1(glm.qp, test = "Chi")

# or

glm.qp2 = glm(Galumna~WatrCont, data=mites, family=quasipoisson)

anova(glm.qp2, glm.qp, test="Chisq")

Residual deviance

Residual degrees of freedom

glm.quasipoisson = update(glm.poisson,family=quasipoisson)

Create a new GLM using the '

quasipoisson

' family or update the previous one

= 4!

0.0006436*4 =

0.00257

null.model <- glm(Faramea.occidentalis~1, data=faramea, family=quasipoisson)

anova(null.model, glm.quasipoisson, test="Chisq")

Try also deviance analysis to test for the effect of Elevation

Model 1: Faramea.occidentalis ~ 1

Model 2: Faramea.occidentalis ~ Elevation

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 42 414.81

2 41 388.12 1 26.686 0.1961

Fitting a negative binomial in R

summary(glm.negbin)

k

summary(glm.negbin)$coefficients[1,1]

summary(glm.negbin)$coefficients[2,1]

summary(glm.negbin)$coefficients[1,2]

summary(glm.negbin)$coefficients[2,2]

Plotting the final GLM model

The effect of elevation on the number of

F. occidentalis

is significant using the NB GLM but the relationship is weak.

Challenge 3

Use the

mites

dataset! Model the abundance of the species

Galumna

as a function of the substrate characteristics (water content [

WatrCont

] and density [

SubsDens

])

mites <- read.csv("mites.csv", header = TRUE)

N.S.

Upper limit

Lower limit

Do you need to account for overdispersion?

Which covariates have a significant effect?

Select the best model!

Drop each term in turn and compare the full model with a nested model using the command:

Tips

Model selection for GLMs

Specify manually a nested model, call it for example MyGLM2, and use the command:

drop1(MyGLM, test = "Chi")

anova(MyGLM, MyGLM2, test = "Chi")

setwd("~/Desktop")

mites <- read.csv('mites.csv')

head(mites)

str(mites)

Limitations of linear (mixed) models

Load dataset and fit a LM:

The dataset that you just loaded is a subset of the 'Oribatid mite dataset':

Exploring relationships

Can we see any relationship(s) between

Galumna

and the 5 environmental variables?

Exploring relationships

plot(mites)

Exploring relationships

Can we see any relationship(s) between

Galumna

and the 5 environmental variables?

A negative relationship between

Galumna

and water content?

par(mfrow=c(1,3))

plot(Galumna ~ WatrCont, data = mites, xlab = 'Water content', ylab='Abundance')

boxplot(WatrCont ~ pa, data = mites, xlab='Presence/Absence', ylab = 'Water content')

plot(prop ~ WatrCont, data = mites, xlab = 'Water content', ylab='Proportion')

par(mfrow=c(1,1))

Testing linearity

Fit linear models to test whether abund, pa, and/or prop varies as a function of water content.

Testing linearity

Fit linear models to test whether abund, pa, and/or prop varies as a function of water content.

lm.abund <- lm(Galumna ~ WatrCont, data = mites)

summary(lm.abund)

lm.pa <- lm(pa ~ WatrCont, data = mites)

summary(lm.pa)

lm.prop <- lm(prop ~ WatrCont, data = mites)

summary(lm.prop)

Testing linearity

Significant relationship in all models!

Wait a minute...

plot(Galumna ~ WatrCont, data = mites)

abline(lm.abund)

plot(lm.abund)

Testing linearity

Even worse for other models:

# Proportion

plot(prop ~ WatrCont, data = mites)

abline(lm.prop)

plot(lm.prop)

# Presence/Absence

plot(pa ~ WatrCont, data = mites)

abline(lm.pa)

plot(lm.pa)

Model assumptions

Common in Ecology that assumptions of homogeneity of variance and normality are not met.

Main reason why we need GLMs!

Let's revisit the assumptions of lm...

Model assumptions

Equation of lm:

y =

β

+

β

x +

ε

where:

yi = predicted value of response variable

β0 = intercept

β1 = slope

xi = explanatory variable

εi

= model residuals drawn from a normal distribution with a varying mean but a constant variance

**

** Key point!

Residuals (the distance between each observation and the regression line) can be predicted by drawing random values from a normal distribution.

Normally distributed residuals

Recall:

Normal distributions have two parameters, μ (mean) and σ (variance):

i

i

i

0

1

2

Normally distributed residuals

Another way to write the lm equation is:

yi ~ N(μ = β0 + β1xi, σ ),

Which literally means that yi is drawn from a normal distribution with parameters μ (which depends on xi) and σ (which has the same value for all Ys)

Lets predict

Galumna

abund as a function of water content using the lm we fitted earlier...

2

2

Model prediction

We need regression coefficients (betas) and sigma:

coef(lm.abund)

summary(lm.abund)$sigma

What are the parameters of the normal distribution used to model y when water content = 300?

yi ~ N(μ = β0 + β1xi, σ )

2

Model prediction

At x = 300, residuals should follow a normal distribution with μ = 1.63 and σ = 1.51.

At x = 400, we get μ = 1.02 and σ = 1.51, etc.

Graphically, this is our model:

Biological data & distributions

Problems:

σ is not homogeneous, yet lm forces a constant σ

Predicted values should be integers

Statisticians have described a multitude of distributions that correspond to different types of data

A distribution provides the probability of observing each possible outcome of an experiment or survey (e.g. abund = 8

Galumna

)

Distributions can be

discrete

(only includes integers

or

continuous

(includes fractions)

All distributions have

parameters

that dictate the shape of the distribution (e.g. μ and σ for the normal)

Biological data & distributions

Galumna

abund follows a discrete distribution (can only take integer values).

A useful distribution to model abundance data is the “Poisson” distribution:

a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution:

Biological data & distributions

Galumna

seems to follow a Poisson distribution with a low value of λ:

hist(mites$Galumna)

mean(mites$Galumna)

Biological data & distributions

Presence-absence takes yet another form:

only 0s and 1s

Poisson distribution would not be appropriate to model this variable

hist(mites$pa)

Biological data & distributions

“Bernoulli” distribution

Only two possible outcomes in its range: success (1) or failure (0)

One parameter, p, the probability of success

We can use the Bernouilli distribution to calculate the probability

Galumna

present (1) vs. absent (0)

Biological data & distributions

Binomial distribution:

When there are multiple trials (each with a success/failure), the Bernoulli distribution expands into the binomial

Additional parameter, n, for number of trials

Predicts the probability of observing a given proportion of successes, p, out of a known total number of trials, n.

Biological data & distributions

Binomial distribution:

Used to model data where the number of successes are integers and where the number of trials, n, is known.

Main difference with Poisson distribution:

The binomial has an upper limit to its range, corresponding to n. Consequently, it is right-skewed at low p values but left-skewed at high p values

vs.

Biological data & distributions

Getting back to our problem... can switch the distribution of error terms (εi) from normal to Poisson:

yi ~ Poisson(λ = β0 + β1xi)

Problems solved!

1. λ varies with x (water content)

which means residual variance will also vary with x, which means that we just relaxed the homogeneity of variance assumption!

2. Predicted values will now be integers instead of fractions

3. The model will never predict negative values (Poisson is strictly positive)

Biological data & distributions

This is

almost

a Poisson GLM, which looks like this:

Probabilities (in orange) are now integers, and both the variance and the mean of the distribution decline as λ decreases with increasing water content.

Review of LMM Workshop:

Structure in the dataset or correlation among observations can result in

lack of independence among observations

sampled from same sites or time points

Account for this by including

random effect terms

Random effects:

Parameter is a sample from the population, i.e. the subjects you happen to be working with

Explains the variance of the response variable

Fixed effects:

Parameter is reproducible, i.e. would be the same across studies

Explain the mean of the response variable

Review: Linear Mixed Models

Extension of GLMs to account for additional structure in dataset

Follows similar steps introduced in LMM Workshop

1) LMMs incorporate random effects

2) GLMs handle non-normal data (letting errors take on different distribution families - e.g. Poisson or negative binomial)

Generalized Linear Mixed Models

Shrinkage estimates

How to run a GLMM in R

dat.tf <- read.csv("Banta_TotalFruits.csv")

str(dat.tf)

Import the Arabidopsis dataset 'Banta_TotalFruits.csv' into R.

The effect of nutrient availability and herbivory (

fixed effects

)

on the fruit production of the mouse-ear cress (

Arabidopsis thaliana

) was evaluated by measuring 625 plants across 9 different populations, each comprised of 2 to 3 different genotypes (

random effects

)

# popu factor with a level for each population

# gen factor with a level for each genotype

# nutrient factor with levels for low (value = 1) or high (value = 8)

# amd factor with levels for no damage or simulated herbivory

# total.fruits integer indicating the number of fruits per plant

Exploring variance

dat.tf <- within(dat.tf,

{

# genotype x nutrient x clipping

gna <- interaction(gen,nutrient,amd)

gna <- reorder(gna, total.fruits, mean)

# population x nutrient x clipping

pna <- interaction(popu,nutrient,amd)

pna <- reorder(pna, total.fruits, mean)

})

To illustrate heterogeneity in variance we will first create boxplots of the response variable versus different environmental factors

Let's create new variables that represents every combination of

nutrient

x

clipping

x

random factor

# Boxplot of total fruits vs genotype x nutrient x clipping interaction

library(ggplot2)

ggplot(data = dat.tf, aes(factor(x = gna),y = log(total.fruits + 1))) +

geom_boxplot(colour = "skyblue2", outlier.shape = 21,

outlier.colour = "skyblue2") +

theme_bw() + theme(axis.text.x=element_text(angle=90)) +

stat_summary(fun.y=mean, geom="point", colour = "red")

* Similarly, the boxplot of total fruits vs population x nutrient x clipping interaction shows a large amount of heterogeneity among populations.

Choosing error distribution

The response variable is count data which suggests to that a

Poisson distribution

should be used (i.e. the variance is equal to the mean)

However, as we will soon see, the variance increases with the mean much more rapidly than expected under the Poisson distribution

Choosing error distribution

As we just saw, there is a large amount of heterogeneity among group variances even when the response variable is transformed

If we plot the

group variances vs group means

(

example with genotype x nutrient x clipping grouping shown here

), we can appreciate that the Poisson family is the least appropriate distribution (i.e. variances increase much faster than the means)

Poisson GLMM

Given the mean-variance relationship, we will most likely need a model with overdispersion

but let's start with a Poisson model:

To run a GLMM in R we simply need to use the

glmer

function of the lme4 package

library(lme4)

mp1 <- glmer(total.fruits ~ nutrient*amd + rack + status +

(1|popu)+

(1|gen),

data=dat.tf, family="poisson")

Specify the family

Specify the random effects

We can check for overdispersion using the

overdisp_fun

function (Bolker et al. 2011) which divides the Pearson residuals by the residual degrees of freedom and tests whether ratio is greater than unity

Overdispersion check

# Download the glmm_funs.R code from the wiki page and source it to run the function in R

source("glmm_funs.R")

# Overdispersion?

overdisp_fun(mp1)

Random effects are often called

shrinkage estimates

because they represent a weighted average of the data and the overall fit (fixed effect)

The random effect shrinkage toward the overall fit (fixed effect) is more severe if the within-group variability is large compared to the among-group variability

Review: LMMs

Exploring variance

factor(genotype x nutrient x clipping)

> overdisp_fun(mp1)

chisq ratio p logp

15755.86833 25.57771 0.00000 -6578.47027

Ratio is significantly >> 1

As expected, we need to try a different distribution where the variance increase more rapidly than the mean

Negative binomial GLMM

(Poisson-gamma)

Recall that the negative binomial (or Poisson-gamma) distribution meets the assumption that the

variance is proportional to the square of the mean

mnb1 <- glmer.nb(total.fruits ~ nutrient*amd + rack + status +

(1|popu)+

(1|gen),

data=dat.tf, control=glmerControl(optimizer="bobyqa"))

# Control argument specifies the way we optimize the parameter values

# Overdispersion?

overdisp_fun(mnb1)

Ratio is now much closer to 1 although p-value is still less than 0.05

Poisson-lognormal GLMM

Another option is the

Poisson-lognormal

distribution.

This can be achieved simply by placing an observation-level random effect in the formula.

mpl1 <- glmer(total.fruits ~ nutrient*amd + rack + status +

(1|X) +

(1|popu)+

(1|gen),

data=dat.tf, family="poisson", control=glmerControl(optimizer="bobyqa"))

overdisp_fun(mpl1)

chisq ratio p logp

1.775361e+02 2.886766e-01 1.000000e+00 -3.755374e-73

Ratio now meets our criterion

Model selection

The same methods can be used with a glmm or lmm to choose between models with various random intercepts and/or random slopes and to choose fixed effects to keep in final model.

an

information theoretic approach

(e.g., AICc - Workshop 5),

a

frequentist approach

(where the significance of each term is evaluated using ''anova()'' and the likelihood ratio test; LRT)

Visualization the model parameters

A graphical representation of the model parameters can be obtained using the

coefplot2

funtion:

library(coefplot2)

# Variance terms

coefplot2(mpl1,ptype="vcov",intercept=TRUE)

# Fixed effects

coefplot2(mpl1,intercept=TRUE)

Visualizing the random effects

You can extract the random effect predictions using

ranef

and plot them using a

dotplot

Regional variability among populations:

Spanish populations (SP) larger values than Swedish (SW) or Dutch (NL)

Difference among genotypes largely driven by genotype 34

Note:

the

grid.arrange()

function from the {gridExtra} library was used to make this graphic (see script for details).

Model selection

Alternatively, we can use

drop1

and

dfun

functions to evaluate our fixed effects (dfun converts the AIC values returned by the drop1 into ΔAIC values)

(dd_LRT <- drop1(mpl1,test="Chisq"))

(dd_AIC <- dfun(drop1(mpl1)))

Df dAIC

<none> 0.000

rack 1 55.083

status 2 1.612

nutrient:amd 1 1.444

Strong

rack

effect (dAIC = 55.08 if we remove this variable)

Effects of

status

and

interaction

term are weak (dAIC < 2)

Start by

removing the non-significant interaction

term to test main effects of nutrient and clipping

Proportion data and GLM

Sometimes, proportion data is more similar to logistic regression than you think...

If we measure the number of occurrences and we know the total sample size, it is not count data!

Suppose we measure disease prevalence in ten deer populations on 10 deer individuals per population:

x infected deer

10 deer

}

always bound between 0 an 1!

Exercise 3

In R, we have to specify the number of times something happened and the number of times something did not happen:

> prop.reg <- glm(cbind(Galumna, totalabund - Galumna) ~ Topo + WatrCont, data = mites, family = binomial)

> summary(prop.reg)

Call:

glm(formula = cbind(Galumna, totalabund - Galumna) ~ WatrCont +

Topo, family = binomial, data = mites)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.4808 -0.9699 -0.6327 -0.1798 4.1688

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.288925 0.422109 -7.792 6.61e-15 ***

WatrCont -0.005886 0.001086 -5.420 5.97e-08 ***

TopoHummock 0.578332 0.274928 2.104 0.0354 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 140.702 on 69 degrees of freedom

Residual deviance: 85.905 on 67 degrees of freedom

AIC: 158.66

NB = negative binomial

QP = quasi-Poisson

loess = Locally weighted

regression smoothing

Note:

error bars are only shown for the fixed effects because glmer doesn't provide information on uncertainty of variance terms

Random effect variance

Fixed effect coefficient

library(binomTools)

We can also code the model directly with proportions :

> prop.reg2 <- glm(prop ~ Topo + WatrCont, data = mites, family = binomial, weights = totalabund)

Model selection

We first code potential models and compare them using AICc

*

:

mpl2 <- update(mpl1, . ~ . - rack)

# model without rack

mpl3 <- update(mpl1, . ~ . - status)

# model without status

mpl4 <- update(mpl1, . ~ . - amd:nutrient)

# without amd:nutrient interaction

ICtab(mpl1,mpl2,mpl3,mpl4, type = c("AICc"))

dAICc df

mpl1 0.0 10

mpl4 1.4 9

mpl3 1.5 8

mpl2 55.0 9

* NB:

We do not cover all possible models above, however, the interaction amd: nutrient can only be evaluated if both amd and nutrient (i.e., the main effects) are included in the model.

Common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its “absence”

Mean of distribution

Variance of distribution

Probability

p

of observing an outcome

Variance decreases as p is close to 0 or 1

where

g

(μ) is the link function

This allows us to relax the normality assumption

The output indicates that both water content and topography are significant, but how do we interpret the slope coefficients?

Comparing deviance of your model (residual deviance) to the deviance of a null model (null deviance)

The

null model

is a model without explanatory variables

null.model <- glm(Response.variable ~ 1, family = binomial)

In R, we can extract the residual and null deviances directly from the glm object:

objects(logit.reg)

New statistic -

coefficient of discrimination (D)

evaluates the predictive power of logistic regression

Measure of how well logistic regression classifies an outcome as a success or a failure

In R these tests are available in the 'binomTools' package

pseudoR2 <- (logit.reg$null.deviance –

logit.reg$deviance) / logit.reg$null.deviance

pseudoR2

[1] 0.4655937

Hence, the model explains 46.6% of the variability in the data

fit <- Rsq(object = logit.reg)

fit

*A non significant value indicates an adequate fit!

Deals with overdisp. by adding

observation-level random effects

# 70 moss and mite samples

# 5 environmental measurements and abundance of the mite ”Galumna sp.”

Goal:

Model the abundance ($abund), occurrence ($pa), and proportion ($prop) of Galumna as a function of the 5 environmental variables.

μ = 3.44 + (-0.006 x 300) = 1.63

σ = 1.51

2

2

2

2

2

**(your data is ok; it's the model that's wrong)**

Model selection

mpl2 <- update(mpl1, . ~ . - and:nutrient)

# Use AIC

mpl3 <- update(mpl2, . ~ . - rack) # no rack or interaction

mpl4 <- update(mpl2, . ~ . - status) # no status or interaction

mpl5 <- update(mpl2, . ~ . - nutrient) # no nutrient or interaction

mpl6 <- update(mpl2, . ~ . - amd) # no clipping or interaction

ICtab(mpl2,mpl3,mpl4,mpl5,mpl6, type = c("AICc"))

# Or use drop1

(dd_LRT2 <- drop1(mpl2,test="Chisq"))

(dd_AIC2 <- dfun(drop1(mpl2)))

Both the main effects of

nutrient

and

clipping

are strong (large change in AIC of 135.6 (mpl5) and 10.2 (mpl6) if either nutrient or clipping are dropped, respectively).

Final model includes the fixed nutrient and clipping effects, rack, and observation-level random e ffects (1 | X) to account for over-dispersion

Up for a challenge?

Use the

inverts

dataset (larval development times (PLD) of 74 marine invertebrate and vertebrate species reared at different temperatures and time), answer the following questions:

What is the effect of feeding type and climate (fixed effects) on PLD?

Does this relationship vary among taxa (random effects)?

What is the best distribution family for this count data?

Finally, once you determined the best distribution family, re-evaluate your random and fixed effects.

Additional GLMM resources

Books

B. Bolker (2009)

Ecological Models and Data in R

. Princeton University Press.

A. Zuur et al. (2009) M

ixed Effects Models and Extensions in Ecology with R

. Springer.

Websites

GLMM for ecologists (

http://glmm.wikidot.com

)

A great website on GLMM with a Q&A section!