GLM with count data
Why be normal?
(your data is ok; it's the model that's wrong)
The Poisson distribution
Modeling count data
Poisson GLM behind the scenes
The Poisson distribution specifies the probability of a discrete random variable Y and is given by:
A Poisson GLM will model the value of µ as a function of different explanatory variables
Step 2. We specify the systematic part of the model just as in a linear model
Does elevation influence the abundance of F. occidentalis?
Import the 'faramea.csv' into R.
faramea <- read.csv(‘faramea.csv’, header = TRUE)
Step 1. We assume Yi follows a Poisson distribution with mean and variance µi
Step 3. The link between the mean of Yi and the systematic part is a logarithmic
Exploring relationships
Testing linearity
Exploring relationships
Limitations of linear (mixed) models
Exploring relationships
The number of trees of the species Faramea occidentalis was assessed in 43 quadrats in Barro Colorado Island (Panama). For each quadrat, environmental characteristics were also recorded such as elevation or precipitation.
Load dataset and fit a LM:
Fit linear models to test whether abund, pa, and/or prop varies as a function of water content.
A negative relationship between Galumna and water content?
Can we see any relationship(s) between Galumna and the 5 environmental variables?
- µ is the parameter of the Poisson distribution
- specifies the probably only for integer values
- probability for negative values is null (P(Y<0) = 0)
- mean = variance (allows for heterogeneity)
Count data are characterized by:
setwd("~/Desktop")
mites <- read.csv('mites.csv')
head(mites)
str(mites)
Let's investigate the histogram of the number of Faramea occidentalis
plot(mites)
the Poisson distribution seems to be the perfect choice to model this data, hence Poisson GLMs are usually the good way to start modeling count data.
- positive values: you do not count -7 individuals
- integer values: you do not count 7.56 individuals
- exhibits larger variance for large values
If µ is known we can compute the probability for different values of y
µi corresponds to the expected number of individuals
The dataset that you just loaded is a subset of the 'Oribatid mite dataset':
par(mfrow=c(1,3))
plot(Galumna ~ WatrCont, data = mites, xlab = 'Water content', ylab='Abundance')
boxplot(WatrCont ~ pa, data = mites, xlab='Presence/Absence', ylab = 'Water content')
plot(prop ~ WatrCont, data = mites, xlab = 'Water content', ylab='Proportion')
par(mfrow=c(1,1))
# 70 moss and mite samples
# 5 environmental measurements and abundance of the mite ”Galumna sp.”
Fitting a Poisson GLM in R
Goal: Model the abundance ($abund), occurrence ($pa), and proportion ($prop) of Galumna as a function of the 5 environmental variables.
The function glm() allows you to specify a Poisson GLM
glm.poisson = glm(Faramea.occidentalis~Elevation, data=faramea, family=poisson)
Model summary
Specify the distribution and link function
The deviance
Parameter estimates
As with lm() you can access the outputs of the model using the function summary()
Remember that to estimate the unknown parameters maximum likelihood estimation is used
In our model the unknown parameters are the intercept (alpha) and the slope of elevation (Beta)
Model assumptions
Testing linearity
summary(glm.poisson)
Model assumptions
The residual deviance is defined as twice the difference between the log likelihood of a model that provides a perfect fit and the log likelihood of our model
Even worse for other models:
Fit linear models to test whether abund, pa, and/or prop varies as a function of water content.
Call:
glm(formula = Faramea.occidentalis ~ Elevation, family = poisson,
data = faramea)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3319 -2.7509 -1.5451 0.1139 11.3995
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.7687001 0.1099136 16.092 < 2e-16 ***
Elevation -0.0027375 0.0006436 -4.253 2.11e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 414.81 on 42 degrees of freedom
Residual deviance: 388.12 on 41 degrees of freedom
AIC: 462.01
Significant relationship in all models!
Wait a minute...
Res dev = 2log(L(y;y)) - 2log(L(y;µ))
Common in Ecology that assumptions of homogeneity of variance and normality are not met.
- Main reason why we need GLMs!
Let's revisit the assumptions of lm...
Equation of lm:
y = β + β x + ε
where:
- yi = predicted value of response variable
- β0 = intercept
- β1 = slope
- xi = explanatory variable
- εi = model residuals drawn from a normal distribution with a varying mean but a constant variance**
Quebec Centre for Biodiversity Science
R Workshop Series
# Proportion
plot(prop ~ WatrCont, data = mites)
abline(lm.prop)
plot(lm.prop)
# Presence/Absence
plot(pa ~ WatrCont, data = mites)
abline(lm.pa)
plot(lm.pa)
In a Poisson GLM, the residual deviance should equal the residual degrees of freedoms
lm.abund <- lm(Galumna ~ WatrCont, data = mites)
summary(lm.abund)
lm.pa <- lm(pa ~ WatrCont, data = mites)
summary(lm.pa)
lm.prop <- lm(prop ~ WatrCont, data = mites)
summary(lm.prop)
** Key point! Residuals (the distance between each observation and the regression line) can be predicted by drawing random values from a normal distribution.
Model prediction
Biological data & distributions
Normally distributed residuals
Fitting a quasi-Poisson GLM in R
Overdispersion
Quasi-Poisson GLM
Dispersion parameter
Solution
Create a new GLM using the 'quasipoisson' family or update the previous one
We need regression coefficients (betas) and sigma:
Recall: Normal distributions have two parameters, μ (mean) and σ (variance):
When the residual deviance is higher than the residual degrees of freedom we say that the model is overdispersed
At x = 300, residuals should follow a normal distribution with μ = 1.63 and σ = 1.51.
At x = 400, we get μ = 1.02 and σ = 1.51, etc.
Graphically, this is our model:
The variance of the distribution will account for overdispersion:
coef(lm.abund)
summary(lm.abund)$sigma
glm.quasipoisson = glm(Faramea.occidentalis~Elevation, data=faramea, family=quasipoisson)
glm.quasipoisson = update(glm.poisson,family=quasipoisson)
Residual degrees of freedom
Statisticians have described a multitude of distributions that correspond to different types of data
A distribution provides the probability of observing each possible outcome of an experiment or survey (e.g. abund = 8 Galumna)
Distributions can be discrete (only includes integers
or continuous (includes fractions)
All distributions have parameters that dictate the shape of the distribution (e.g. μ and σ for the normal)
summary(glm.quasipoisson)
Another way to write the lm equation is:
yi ~ N(μ = β0 + β1xi, σ ),
Which literally means that yi is drawn from a normal distribution with parameters μ (which depends on xi) and σ (which has the same value for all Ys)
Lets predict Galumna abund as a function of water content using the lm we fitted earlier...
# Poisson GLM
glm.p = glm(Galumna~WatrCont+SubsDens, data=mites, family=poisson)
# quasi-Poisson GLM
glm.qp = update(glm.p,family=quasipoisson)
# model selection
drop1(glm.qp, test = "Chi")
# or
glm.qp2 = glm(Galumna~WatrCont, data=mites, family=quasipoisson)
anova(glm.qp2, glm.qp, test="Chisq")
the systematic part and the link function remain the same
What are the parameters of the normal distribution used to model y when water content = 300?
Occurs when the variance in the data is even higher than the mean, hence the Poisson distribution is not the best choice (many zeros, many very high values, missing covariates, etc)
The standard errors of the parameters are multiplied by
yi ~ N(μ = β0 + β1xi, σ )
Try also deviance analysis to test for the effect of Elevation
null.model <- glm(Faramea.occidentalis~1, data=faramea, family=quasipoisson)
anova(null.model, glm.quasipoisson, test="Chisq")
is the dispersion parameter. It will be estimated prior to estimate parameters. Correcting for overdispersion will not affect parameter estimates but will affect their significance. Indeed, the standard errors of the parameters are multiplied by
Model 1: Faramea.occidentalis ~ 1
Model 2: Faramea.occidentalis ~ Elevation
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 42 414.81
2 41 388.12 1 26.686 0.1961
Call:
glm(formula = Faramea.occidentalis ~ Elevation, family = quasipoisson,
data = faramea)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3319 -2.7509 -1.5451 0.1139 11.3995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.768700 0.439233 4.027 0.000238 ***
Elevation -0.002738 0.002572 -1.064 0.293391
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 15.96936)
Null deviance: 414.81 on 42 degrees of freedom
Residual deviance: 388.12 on 41 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 10
μ = 3.44 + (-0.006 x 300) = 1.63
σ = 1.51
Some marginally significant p-values may no longer hold!
1. Correct for overdispersion using quasi-Poisson GLM
2. Choose another distribution: the negative binomial
Biological data & distributions
Galumna seems to follow a Poisson distribution with a low value of λ:
Presence-absence takes yet another form:
- only 0s and 1s
- Poisson distribution would not be appropriate to model this variable
Binomial distribution:
When there are multiple trials (each with a success/failure), the Bernoulli distribution expands into the binomial
- Additional parameter, n, for number of trials
- Predicts the probability of observing a given proportion of successes, p, out of a known total number of trials, n.
Galumna abund follows a discrete distribution (can only take integer values).
A useful distribution to model abundance data is the “Poisson” distribution:
- a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution:
hist(mites$Galumna)
mean(mites$Galumna)
hist(mites$pa)
“Bernoulli” distribution
- Only two possible outcomes in its range: success (1) or failure (0)
- One parameter, p, the probability of success
We can use the Bernouilli distribution to calculate the probability Galumna present (1) vs. absent (0)
Fitting a negative binomial in R
Negative binomial GLM
Plotting the final GLM model
Challenge 3
Tips
Negative binomial GLMs are favor when overdispersion is high
Plot the data and the use estimates of the parameters to draw model line
NB is not in the glm() function so you need to install and charge the MASS library
- It has two parameters µ and k. k controls for the dispersion parameter (smaller k indicates higher dispersion)
Use the mites dataset! Model the abundance of the species Galumna as a function of the substrate characteristics (water content [WatrCont] and density [SubsDens])
install.packages("MASS")
- Drop each term in turn and compare the full model with a nested model using the command:
Use summary() to get the parameters
glm.negbin = glm.nb(Faramea.occidentalis~Elevation, data=faramea)
- It corresponds to a combination of two distributions (Poisson and gamma)
summary(glm.negbin)$coefficients[1,1]
summary(glm.negbin)$coefficients[2,1]
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.369226 0.473841 5.00 5.73e-07 ***
Elevation -0.007038 0.002496 -2.82 0.00481 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.2593) family taken to be 1)
Null deviance: 41.974 on 42 degrees of freedom
Residual deviance: 36.343 on 41 degrees of freedom
AIC: 182.51
Number of Fisher Scoring iterations: 1
Theta: 0.2593
Std. Err.: 0.0755
2 x log-likelihood: -176.5090
drop1(MyGLM, test = "Chi")
- Do you need to account for overdispersion?
- Which covariates have a significant effect?
- Select the best model!
summary(glm.negbin)
Use the standard errors to build the confidence envelope
- It assumes that the Yi are Poisson distributed with the mean μ assumed to follow a gamma distribution!
Biological data & distributions
- Specify manually a nested model, call it for example MyGLM2, and use the command:
Biological data & distributions
mites <- read.csv("mites.csv", header = TRUE)
This is almost a Poisson GLM, which looks like this:
summary(glm.negbin)$coefficients[1,2]
summary(glm.negbin)$coefficients[2,2]
anova(MyGLM, MyGLM2, test = "Chi")
The effect of elevation on the number of F. occidentalis is significant using the NB GLM but the relationship is weak.
Binomial distribution:
Used to model data where the number of successes are integers and where the number of trials, n, is known.
Main difference with Poisson distribution:
- The binomial has an upper limit to its range, corresponding to n. Consequently, it is right-skewed at low p values but left-skewed at high p values
Getting back to our problem... can switch the distribution of error terms (εi) from normal to Poisson:
yi ~ Poisson(λ = β0 + β1xi)
Problems solved!
1. λ varies with x (water content)
- which means residual variance will also vary with x, which means that we just relaxed the homogeneity of variance assumption!
2. Predicted values will now be integers instead of fractions
3. The model will never predict negative values (Poisson is strictly positive)
Probabilities (in orange) are now integers, and both the variance and the mean of the distribution decline as λ decreases with increasing water content.
Workshop 7: Generalized linear (mixed) models
GLM with binary data
GLMMs
Probability distribution
Binary variables
Logistic regression
Binary variables
Expected values can be out of the [0,1] range with lm():
In R, binary variables are coded with 1 and 0:
Clearly not normally distributed!
Common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its “absence”
The Bernoulli distribution is well suited for binary response variables
glm()
- Presence/Absence of a species
logit.reg <- glm(formula, data, family)
- Presence/Absence of a disease
- Success/Failure to observe behaviour
Site Presence
A 1
B 0
C 1
D 1
E 0
F 1
How to run a GLMM in R
Choosing error distribution
Generalized Linear Mixed Models
Review: Linear Mixed Models
Review: LMMs
Probability p of observing an outcome
Exploring variance
- Survival/Death of organisms
Import the Arabidopsis dataset 'Banta_TotalFruits.csv' into R.
- The response variable is count data which suggests to that a Poisson distribution should be used (i.e. the variance is equal to the mean)
Shrinkage estimates
- To illustrate heterogeneity in variance we will first create boxplots of the response variable versus different environmental factors
- Let's create new variables that represents every combination of nutrient x clipping x random factor
Wish to determine if P/A ~ Environment***
dat.tf <- read.csv("Banta_TotalFruits.csv")
str(dat.tf)
To move away from traditional linear models, need to specify two things:
1) probability distribution
AND
2) a link function
- Extension of GLMs to account for additional structure in dataset
- Follows similar steps introduced in LMM Workshop
1) LMMs incorporate random effects
2) GLMs handle non-normal data (letting errors take on different distribution families - e.g. Poisson or negative binomial)
Variance decreases as p is close to 0 or 1
*** Called a logistic regression or logit model
Review of LMM Workshop:
- Structure in the dataset or correlation among observations can result in lack of independence among observations sampled from same sites or time points
- Account for this by including random effect terms
Random effects:
- Parameter is a sample from the population, i.e. the subjects you happen to be working with
- Explains the variance of the response variable
Fixed effects:
- Parameter is reproducible, i.e. would be the same across studies
- Explain the mean of the response variable
# popu factor with a level for each population
# gen factor with a level for each genotype
# nutrient factor with levels for low (value = 1) or high (value = 8)
# amd factor with levels for no damage or simulated herbivory
# total.fruits integer indicating the number of fruits per plant
- Random effects are often called shrinkage estimates because they represent a weighted average of the data and the overall fit (fixed effect)
- The random effect shrinkage toward the overall fit (fixed effect) is more severe if the within-group variability is large compared to the among-group variability
dat.tf <- within(dat.tf,
{
# genotype x nutrient x clipping
gna <- interaction(gen,nutrient,amd)
gna <- reorder(gna, total.fruits, mean)
# population x nutrient x clipping
pna <- interaction(popu,nutrient,amd)
pna <- reorder(pna, total.fruits, mean)
})
The effect of nutrient availability and herbivory (fixed effects) on the fruit production of the mouse-ear cress (Arabidopsis thaliana) was evaluated by measuring 625 plants across 9 different populations, each comprised of 2 to 3 different genotypes (random effects)
- However, as we will soon see, the variance increases with the mean much more rapidly than expected under the Poisson distribution
Exercise 1
The Link Function
Other distributions
The Link Function
Build a logistic regression model using the mites data
Build a model of the presence of Galumna sp. as a function of water content and topography
For binary data, the link function is called the logit:
is only true for normally distributed data
1) Get the odds (μ / 1-μ)
2) log-transform them
Exploring variance
Choosing error distribution
Poisson-lognormal GLMM
For a simple linear model of a normally distributed continuous response variable, the equation for the expected values is:
Overdispersion check
Negative binomial GLMM
Poisson GLMM
(Poisson-gamma)
setwd(~/workshop6)
mites <- read.csv("mites.csv")
str(mites)
- Another option is the Poisson-lognormal distribution.
- This can be achieved simply by placing an observation-level random effect in the formula.
- We can check for overdispersion using the overdisp_fun function (Bolker et al. 2011) which divides the Pearson residuals by the residual degrees of freedom and tests whether ratio is greater than unity
# Boxplot of total fruits vs genotype x nutrient x clipping interaction
library(ggplot2)
ggplot(data = dat.tf, aes(factor(x = gna),y = log(total.fruits + 1))) +
geom_boxplot(colour = "skyblue2", outlier.shape = 21,
outlier.colour = "skyblue2") +
theme_bw() + theme(axis.text.x=element_text(angle=90)) +
stat_summary(fun.y=mean, geom="point", colour = "red")
logit.reg <- glm(pa ~ WatrCont + Topo, data=mites,
family = binomial(link = "logit"))
summary(logit.reg)
- Logit transformation of the data often used with lm(m) for percentages or proportions when the binomial distribution is not appropriate. When not selections from fixed quantities (e.g. percent cover, school grades, etc).
- Log-normal distribution in glm, avoids having to log-transform the data.
- Gamma distribution. Similar to log-normal, more versatile.
- Tweedie distribution. Versatile family of distributions. Useful for data with a mix of zeros and positive values (not necessarily counts).
- Zero-inflated Poisson or zero-inflated negative binomial. When the data comprise an excess number of zeros, that arise from a different process than the process that generates the counts.
The odds puts our expected values on a 0 to +Inf scale
If this is not the case, must use a transformation on the expected values μ
- As we just saw, there is a large amount of heterogeneity among group variances even when the response variable is transformed
- If we plot the group variances vs group means (example with genotype x nutrient x clipping grouping shown here), we can appreciate that the Poisson family is the least appropriate distribution (i.e. variances increase much faster than the means)
- Recall that the negative binomial (or Poisson-gamma) distribution meets the assumption that the variance is proportional to the square of the mean
summary(logit.reg)
Call:
glm(formula = pa ~ WatrCont + Topo, family = binomial, data = mites)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0387 -0.5589 -0.1594 0.4112 2.0252
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.464402 1.670622 2.672 0.007533 **
WatrCont -0.015813 0.004535 -3.487 0.000489 ***
TopoHummock 2.090757 0.735348 2.843 0.004466 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 91.246 on 69 degrees of freedom
Residual deviance: 48.762 on 67 degrees of freedom
AIC: 54.762
Number of Fisher Scoring iterations: 6
- Given the mean-variance relationship, we will most likely need a model with overdispersion
- but let's start with a Poisson model:
- To run a GLMM in R we simply need to use the glmer function of the lme4 package
# Download the glmm_funs.R code from the wiki page and source it to run the function in R
source("glmm_funs.R")
# Overdispersion?
overdisp_fun(mp1)
where μ is the expected value of the response variable
Deals with overdisp. by adding observation-level random effects
The log transformation puts our expected values on a -Inf to +Inf
X is the model matrix (i.e. your data)
str(mites)
'data.frame': 70 obs. of 9 variables:
$ Galumna : int 8 3 1 1 2 1 1 1 2 5 ...
$ pa : int 1 1 1 1 1 1 1 1 1 1 ...
$ totalabund: int 140 268 186 286 199 209 162 126 123 166 ...
$ prop : num 0.05714 0.01119 0.00538 0.0035 0.01005 ...
$ SubsDens : num 39.2 55 46.1 48.2 23.6 ...
$ WatrCont : num 350 435 372 360 204 ...
$ Substrate : Factor w/ 7 levels "Barepeat","Interface",..: 4 3 2 4 4 4 4 2 3 4 ...
$ Shrub : Factor w/ 3 levels "Few","Many","None": 1 1 1 1 1 1 1 2 2 2 ...
$ Topo : Factor w/ 2 levels "Blanket","Hummock": 2 2 2 2 2 2 2 1 1 2 ...
μ = expected values (probability that Y = 1)
mpl1 <- glmer(total.fruits ~ nutrient*amd + rack + status +
(1|X) +
(1|popu)+
(1|gen),
data=dat.tf, family="poisson", control=glmerControl(optimizer="bobyqa"))
overdisp_fun(mpl1)
mnb1 <- glmer.nb(total.fruits ~ nutrient*amd + rack + status +
(1|popu)+
(1|gen),
data=dat.tf, control=glmerControl(optimizer="bobyqa"))
# Control argument specifies the way we optimize the parameter values
library(lme4)
mp1 <- glmer(total.fruits ~ nutrient*amd + rack + status +
(1|popu)+
(1|gen),
data=dat.tf, family="poisson")
β is the vector of estimated parameters
(i.e. the intercept & slope)
where g(μ) is the link function
> overdisp_fun(mp1)
chisq ratio p logp
15755.86833 25.57771 0.00000 -6578.47027
NB = negative binomial
QP = quasi-Poisson
loess = Locally weighted
regression smoothing
Specify the random effects
chisq ratio p logp
1.775361e+02 2.886766e-01 1.000000e+00 -3.755374e-73
The expected values can now be linearly related to the linear predictor
1) Get the odds (μ / 1-μ)
2) log-transform them
# Overdispersion?
overdisp_fun(mnb1)
Ratio is now much closer to 1 although p-value is still less than 0.05
Xβ is called the linear predictor
factor(genotype x nutrient x clipping)
This allows us to relax the normality assumption
Ratio now meets our criterion
- Ratio is significantly >> 1
- As expected, we need to try a different distribution where the variance increase more rapidly than the mean
* Similarly, the boxplot of total fruits vs population x nutrient x clipping interaction shows a large amount of heterogeneity among populations.
Interpreting the output
Solution
Interpreting the output
Predictive Power and goodness of fit
Challenge 1
model.bact1 <- glm(y ~ trt * week,
data = bacteria, family = binomial)
Website: http://qcbs.ca/wiki/r/workshop7
On the odds scale for water content:
Let's go back to the summary of our 'logit.reg' model to see the coefficients:
Remember we used a logit transformation on the expected values!
exp(logit.reg$coefficient[2])
Get the pseudo-R , the analogue of the R for models fitted by maximum likelihood:
model.bact2 <- glm(y ~ trt + week,
data = bacteria, family = binomial)
Visualizing the random effects
Visualization the model parameters
Model selection
summary(logit.reg)$coefficients
Model selection
To properly interpret the regression parameters, we have to use a 'reverse' function:
- A graphical representation of the model parameters can be obtained using the coefplot2 funtion:
Using the 'bacteria' dataset, model the presence of H. influenzae as a function of treatment and week of test. Start with a full model and reduce it to the most parsimonious model.
- We first code potential models and compare them using AICc*:
model.bact3 <- glm(y ~ week, data = bacteria,
family = binomial)
- You can extract the random effect predictions using ranef and plot them using a dotplot
- Regional variability among populations:
- Spanish populations (SP) larger values than Swedish (SW) or Dutch (NL)
- Difference among genotypes largely driven by genotype 34
- Alternatively, we can use drop1 and dfun functions to evaluate our fixed effects (dfun converts the AIC values returned by the drop1 into ΔAIC values)
null deviance - residual deviance
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.46440 1.67062 2.67230 0.00753
WatrCont -0.01581 0.00454 -3.48673 0.00049
TopoHummock 2.09076 0.73535 2.84322 0.00447
library(coefplot2)
# Variance terms
coefplot2(mpl1,ptype="vcov",intercept=TRUE)
# Fixed effects
coefplot2(mpl1,intercept=TRUE)
The natural exponential function to obtain the odds: e
On the probability scale for water content:
Comparing deviance of your model (residual deviance) to the deviance of a null model (null deviance)
The null model is a model without explanatory variables
anova(model.bact1, model.bact2, model.bact3,
test = "LRT")
null.model <- glm(Response.variable ~ 1, family = binomial)
mpl2 <- update(mpl1, . ~ . - rack) # model without rack
mpl3 <- update(mpl1, . ~ . - status) # model without status
mpl4 <- update(mpl1, . ~ . - amd:nutrient) # without amd:nutrient interaction
ICtab(mpl1,mpl2,mpl3,mpl4, type = c("AICc"))
- The same methods can be used with a glmm or lmm to choose between models with various random intercepts and/or random slopes and to choose fixed effects to keep in final model.
- an information theoretic approach (e.g., AICc - Workshop 5),
- a frequentist approach (where the significance of each term is evaluated using ''anova()'' and the likelihood ratio test; LRT)
The inverse logit function to obtain the probabilities:
(dd_LRT <- drop1(mpl1,test="Chisq"))
(dd_AIC <- dfun(drop1(mpl1)))
1 / (1 + 1/exp(logit.reg$coefficient[2]))
mpl2 <- update(mpl1, . ~ . - and:nutrient)
# Use AIC
mpl3 <- update(mpl2, . ~ . - rack) # no rack or interaction
mpl4 <- update(mpl2, . ~ . - status) # no status or interaction
mpl5 <- update(mpl2, . ~ . - nutrient) # no nutrient or interaction
mpl6 <- update(mpl2, . ~ . - amd) # no clipping or interaction
ICtab(mpl2,mpl3,mpl4,mpl5,mpl6, type = c("AICc"))
# Or use drop1
(dd_LRT2 <- drop1(mpl2,test="Chisq"))
(dd_AIC2 <- dfun(drop1(mpl2)))
install.packages("MASS")
library(MASS)
data(bacteria)
str(bacteria)
?bacteria
Analysis of Deviance Table
Model 1: y ~ trt * week
Model 2: y ~ trt + week
Model 3: y ~ trt
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 214 203.12
2 216 203.81 -2 -0.6854 0.709836
3 217 210.72 -1 -6.9140 0.008552 **
The output indicates that both water content and topography are significant, but how do we interpret the slope coefficients?
Df dAIC
<none> 0.000
rack 1 55.083
status 2 1.612
nutrient:amd 1 1.444
dAICc df
mpl1 0.0 10
mpl4 1.4 9
mpl3 1.5 8
mpl2 55.0 9
variance explained by the model
Based on these results, we choose model #2 as the best candidate to model the data.
In R, we can extract the residual and null deviances directly from the glm object:
objects(logit.reg)
pseudoR2 <- (logit.reg$null.deviance –
logit.reg$deviance) / logit.reg$null.deviance
pseudoR2
Hence, the model explains 46.6% of the variability in the data
* NB: We do not cover all possible models above, however, the interaction amd: nutrient can only be evaluated if both amd and nutrient (i.e., the main effects) are included in the model.
- Both the main effects of nutrient and clipping are strong (large change in AIC of 135.6 (mpl5) and 10.2 (mpl6) if either nutrient or clipping are dropped, respectively).
- Final model includes the fixed nutrient and clipping effects, rack, and observation-level random e ffects (1 | X) to account for over-dispersion
- Strong rack effect (dAIC = 55.08 if we remove this variable)
- Effects of status and interaction term are weak (dAIC < 2)
- Start by removing the non-significant interaction term to test main effects of nutrient and clipping
Note: the grid.arrange() function from the {gridExtra} library was used to make this graphic (see script for details).
Note: error bars are only shown for the fixed effects because glmer doesn't provide information on uncertainty of variance terms
Proportion data and GLM
Exercise 2
Solution
Challenge 2
Predictive Power and goodness of fit
null.d <- model.bact2$null.deviance
In R these tests are available in the 'binomTools' package
In R, we have to specify the number of times something happened and the number of times something did not happen:
Sometimes, proportion data is more similar to logistic regression than you think...
resid.d <- model.bact2$deviance
Additional GLMM resources
library(binomTools)
Using the model created with bacteria dataset, assess goodness of fit and predictive power.
bact.pseudoR2 <- (null.d -resid.d) / null.d
If we measure the number of occurrences and we know the total sample size, it is not count data!
New statistic - coefficient of discrimination (D) evaluates the predictive power of logistic regression
- Measure of how well logistic regression classifies an outcome as a success or a failure
Up for a challenge?
Compute the coefficient of discrimination D
HLtest(Rsq(model.bact2)
> prop.reg <- glm(cbind(Galumna, totalabund - Galumna) ~ Topo + WatrCont, data = mites, family = binomial)
> summary(prop.reg)
We can also code the model directly with proportions :
> prop.reg2 <- glm(prop ~ Topo + WatrCont, data = mites, family = binomial, weights = totalabund)
Books
B. Bolker (2009) Ecological Models and Data in R. Princeton University Press.
A. Zuur et al. (2009) Mixed Effects Models and Extensions in Ecology with R. Springer.
Websites
GLMM for ecologists (http://glmm.wikidot.com)
A great website on GLMM with a Q&A section!
Call:
glm(formula = cbind(Galumna, totalabund - Galumna) ~ WatrCont +
Topo, family = binomial, data = mites)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4808 -0.9699 -0.6327 -0.1798 4.1688
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.288925 0.422109 -7.792 6.61e-15 ***
WatrCont -0.005886 0.001086 -5.420 5.97e-08 ***
TopoHummock 0.578332 0.274928 2.104 0.0354 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 140.702 on 69 degrees of freedom
Residual deviance: 85.905 on 67 degrees of freedom
AIC: 158.66
fit <- Rsq(object = logit.reg)
fit
Think how predictive power could be improved for this model.
Suppose we measure disease prevalence in ten deer populations on 10 deer individuals per population:
To assess goodness of fit, diagnostic plots are not useful, instead must use Hosmer-Lemeshow test:
- Compare observed and expected number of outcomes
- Similar to a Chi square test
Adding informative explanatory variables could increase the explanatory power of the model
- Use the inverts dataset (larval development times (PLD) of 74 marine invertebrate and vertebrate species reared at different temperatures and time), answer the following questions:
- What is the effect of feeding type and climate (fixed effects) on PLD?
- Does this relationship vary among taxa (random effects)?
- What is the best distribution family for this count data?
- Finally, once you determined the best distribution family, re-evaluate your random and fixed effects.
Perform a Hosmer-Lemeshow test*
always bound between 0 an 1!
HLtest(Rsq(logit.reg))
*A non significant value indicates an adequate fit!