Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

QCBS R Workshop 7

No description
by

CSBQ QCBS

on 15 September 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of QCBS R Workshop 7

Other distributions
Logit transformation of the data
often used with lm(m) for percentages or proportions when the binomial distribution is not appropriate. When not selections from fixed quantities (e.g. percent cover, school grades, etc).
Log-normal distribution in glm
, avoids having to log-transform the data.
Gamma distribution
. Similar to log-normal, more versatile.
Tweedie distribution
. Versatile family of distributions. Useful for data with a mix of zeros and positive values (not necessarily counts).
Zero-inflated Poisson or zero-inflated negative binomial
. When the data comprise an excess number of zeros, that arise from a different process than the process that generates the counts.
Quebec Centre for Biodiversity Science

R Workshop Series

Workshop 7: Generalized linear (mixed) models
Website
:
http://qcbs.ca/wiki/r/workshop7

Poisson GLM behind the scenes
Modeling count data
positive values: you do not count -7 individuals
integer values: you do not count 7.56 individuals
exhibits larger variance for large values
The Poisson distribution
What is count data?
Count data are characterized by:
The Poisson distribution specifies the probability of a discrete random variable Y and is given by:
If µ is known we can compute the probability for different values of y
Properties
µ
is the parameter of the Poisson distribution
specifies the probably only for integer values
probability for negative values is null (P(Y<0) = 0)
mean = variance (allows for heterogeneity)
A Poisson GLM will model the value of µ as a function of different explanatory variables
Step 1.

We assume Yi follows a Poisson distribution with mean and variance µi
Fitting a Poisson GLM in R
Three steps
µ
i corresponds to the expected number of individuals
Poisson GLM behind the scenes
Step 2.

We specify the systematic part of the model just as in a linear model
Step 3.

The link between the mean of Yi and the systematic part is a logarithmic
Or
glm.poisson = glm(Faramea.occidentalis~Elevation, data=faramea family=poisson)
Specify the distribution and link function
summary(glm.poisson)
The function
glm()
allows you to specify a Poisson GLM
As with
lm()
you can access the outputs of the model using the function summary()
Call:
glm(formula = Faramea.occidentalis ~ Elevation, family = poisson,
data = faramea)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.3319 -2.7509 -1.5451 0.1139 11.3995

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.7687001 0.1099136 16.092 < 2e-16 ***
Elevation -0.0027375 0.0006436 -4.253 2.11e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 414.81 on 42 degrees of freedom
Residual deviance: 388.12 on 41 degrees of freedom
AIC: 462.01
Parameter estimates
The deviance
In our model the unknown parameters are the intercept (alpha) and the slope of elevation (Beta)
Remember that to estimate the unknown parameters maximum likelihood estimation is used
Model summary
The residual deviance is defined as twice the difference between the log likelihood of a model that provides a perfect fit and the log likelihood of our model
Res dev
= 2log(
L
(
y
;
y
)) - 2log(
L
(
y
;
µ
))
In a Poisson GLM, the residual deviance should equal the residual degrees of freedoms
388.12 >> 41
Overdispersion
When the residual deviance is higher than the residual degrees of freedom we say that the model is
overdispersed
Occurs when the variance in the data is even higher than the mean, hence the Poisson distribution is not the best choice (many zeros, many very high values, missing covariates, etc)
Solutions
2.
Choose another distribution:
the negative binomial
1.
Correct for overdispersion using
quasi-Poisson GLM
Quasi-Poisson GLM
The variance of the distribution will account for
overdispersion
:
is the dispersion parameter. It will be estimated prior to estimate parameters. Correcting for overdispersion will not affect parameter estimates but will affect their
significance
. Indeed, the standard errors of the parameters are multiplied by

the
systematic part
and the
link function
remain the same
Some marginally significant p-values may no longer hold!
Dispersion parameter
1
5
Poisson GLM
10
>15-20
Quasi-Poisson GLM
Negative binomial
Fitting a quasi-Poisson GLM in R
glm.quasipoisson = glm(Faramea.occidentalis~Elevation, data=faramea, family=quasipoisson)
Specify the new family
Call:
glm(formula = Faramea.occidentalis ~ Elevation, family = quasipoisson,
data = faramea)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.3319 -2.7509 -1.5451 0.1139 11.3995

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.768700 0.439233 4.027 0.000238 ***
Elevation -0.002738 0.002572 -1.064 0.293391
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 15.96936)

Null deviance: 414.81 on 42 degrees of freedom
Residual deviance: 388.12 on 41 degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 10
summary(glm.quasipoisson)
The standard errors of the parameters are multiplied by
Same estimates but
AIC is not defined
Negative binomial GLM
Fitting a negative binomial in R
Negative binomial GLMs are favor when overdispersion is high
NB is not in the
glm()
function so you need to install and charge the MASS library
install.packages("MASS")
glm.negbin = glm.nb(Faramea.occidentalis~Elevation, data=faramea)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.369226 0.473841 5.00 5.73e-07 ***
Elevation -0.007038 0.002496 -2.82 0.00481 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.2593) family taken to be 1)

Null deviance: 41.974 on 42 degrees of freedom
Residual deviance: 36.343 on 41 degrees of freedom
AIC: 182.51

Number of Fisher Scoring iterations: 1


Theta: 0.2593
Std. Err.: 0.0755

2 x log-likelihood: -176.5090
Plotting the final GLM model
Step1
Plot the data and the use estimates of the parameters to draw model line
Use
summary()
to get the parameters
Step2
Use the standard errors to build the confidence envelope
It corresponds to a combination of two distributions (
Poisson
and
gamma
)
It assumes that the
Yi
are Poisson distributed with the mean μ assumed to follow a gamma distribution!
It has
two parameters

µ
and
k.

k
controls for the dispersion parameter (smaller
k
indicates higher dispersion)
GLM with count data
GLM with binary data
Binary variables
Presence/Absence of a species
Presence/Absence of a disease
Success/Failure to observe behaviour
Survival/Death of organisms
Wish to determine if P/A ~ Environment***
*** Called a logistic regression or logit model
Binary variables
In R, binary variables are coded with 1 and 0:
Site Presence
A 1
B 0
C 1
D 1
E 0
F 1
Species is present!
Species is absent!
GLMMs
Why be normal?
Logistic regression
The function!
glm()
logit.reg <- glm(formula, data, family)
Binary variables
Clearly not normally distributed!
Binary variables
Expected values can be out of the [0,1] range with lm():
To move away from traditional linear models, need to specify two things:
1) probability distribution
AND
2) a link function
Probability distribution
The Bernoulli distribution is well suited for binary response variables
E(
Y
) =
p
Var(
Y
) =
p
* (1-
p
)
The Link Function
For a
simple linear model
of a normally distributed continuous response variable, the equation for the expected values is:
μ = Xβ
where μ is the expected value of the response variable
X is the model matrix (
i.e.
your data)
β is the vector of estimated parameters
(
i.e.
the intercept & slope)
Xβ is called the
linear predictor
The Link Function
μ = Xβ
is only true for normally distributed data
If this is not the case, must use a transformation on the expected values μ
g
(μ) = Xβ
The Link Function
g
(μ) = log
For binary data, the link function is called the
logit:
1-μ
μ
1) Get the odds (μ / 1-μ)
2) log-transform them
The Link Function
g
(μ) = log
1-μ
μ
1) Get the odds (μ / 1-μ)
2) log-transform them
The odds puts our expected values on a 0 to +Inf scale
The log transformation puts our expected values on a -Inf to +Inf
The expected values can now be
linearly
related to the linear predictor
μ = expected values (probability that Y = 1)
Exercise 1
Build a logistic regression model using the mites data
setwd(~/workshop6)
mites <- read.csv("mites.csv")
str(mites)
str(mites)
'data.frame': 70 obs. of 9 variables:
$ Galumna : int 8 3 1 1 2 1 1 1 2 5 ...
$ pa : int 1 1 1 1 1 1 1 1 1 1 ...
$ totalabund: int 140 268 186 286 199 209 162 126 123 166 ...
$ prop : num 0.05714 0.01119 0.00538 0.0035 0.01005 ...
$ SubsDens : num 39.2 55 46.1 48.2 23.6 ...
$ WatrCont : num 350 435 372 360 204 ...
$ Substrate : Factor w/ 7 levels "Barepeat","Interface",..: 4 3 2 4 4 4 4 2 3 4 ...
$ Shrub : Factor w/ 3 levels "Few","Many","None": 1 1 1 1 1 1 1 2 2 2 ...
$ Topo : Factor w/ 2 levels "Blanket","Hummock": 2 2 2 2 2 2 2 1 1 2 ...
Exercise 1
Build a model of the presence of Galumna sp. as a function of water content and topography
logit.reg <- glm(pa ~ WatrCont + Topo, data=mites,
family = binomial(link = "logit"))

summary(logit.reg)
Call:
glm(formula = pa ~ WatrCont + Topo, family = binomial, data = mites)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0387 -0.5589 -0.1594 0.4112 2.0252

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.464402 1.670622 2.672 0.007533 **
WatrCont -0.015813 0.004535 -3.487 0.000489 ***
TopoHummock 2.090757 0.735348 2.843 0.004466 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 91.246 on 69 degrees of freedom
Residual deviance: 48.762 on 67 degrees of freedom
AIC: 54.762

Number of Fisher Scoring iterations: 6
summary(logit.reg)
Challenge 1
Using the 'bacteria' dataset, model the presence of
H. influenzae
as a function of treatment and week of test. Start with a full model and reduce it to the most parsimonious model.
install.packages("MASS")
library(MASS)
data(bacteria)
str(bacteria)
?bacteria
Solution
model.bact1 <- glm(y ~ trt * week,
data = bacteria, family = binomial)
model.bact2 <- glm(y ~ trt + week,
data = bacteria, family = binomial)
anova(model.bact1, model.bact2, model.bact3,
test = "LRT")
model.bact3 <- glm(y ~ week, data = bacteria,
family = binomial)
Analysis of Deviance Table

Model 1: y ~ trt * week
Model 2: y ~ trt + week
Model 3: y ~ trt
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 214 203.12
2 216 203.81 -2 -0.6854 0.709836
3 217 210.72 -1 -6.9140 0.008552 **
Based on these results, we choose model #2 as the best candidate to model the data.
Interpreting the output
Let's go back to the summary of our 'logit.reg' model to see the coefficients:
summary(logit.reg)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.46440 1.67062 2.67230 0.00753
WatrCont -0.01581 0.00454 -3.48673 0.00049
TopoHummock 2.09076 0.73535 2.84322 0.00447
Interpreting the output
Remember we used a logit transformation on the expected values!
To properly interpret the regression parameters, we have to use a 'reverse' function:
The natural exponential function to obtain the odds: e
x
The inverse logit function to obtain the probabilities:
logit = 1 / (1 + 1/e )
-1
x
Interpreting the output
On the odds scale for water content:
exp(logit.reg$coefficient[2])
WatrCont
0.9843118
On the probability scale for water content:
1 / (1 + 1/exp(logit.reg$coefficient[2]))
WatrCont
0.4960469
Predictive Power and goodness of fit
Get the pseudo-R , the analogue of the R for models fitted by maximum likelihood:
null deviance - residual deviance
null deviance
pseudo-R =
2
2
2
variance explained by the model
Exercise 2
Compute the coefficient of discrimination D
Predictive Power and goodness of fit
To assess goodness of fit, diagnostic plots are not useful, instead must use
Hosmer-Lemeshow test:
Compare observed and expected number of outcomes
Similar to a Chi square test
Perform a Hosmer-Lemeshow test*
HLtest(Rsq(logit.reg))
Challenge 2
Using the model created with bacteria dataset, assess goodness of fit and predictive power.
Think how predictive power could be improved for this model.
1)
2)
Solution
null.d <- model.bact2$null.deviance
resid.d <- model.bact2$deviance
bact.pseudoR2 <- (null.d -resid.d) / null.d
1)
2)
Adding
informative
explanatory variables could increase the explanatory power of the model
HLtest(Rsq(model.bact2)
Modeling count data
What is count data?
faramea <- read.csv(‘faramea.csv’, header = TRUE)
Import the 'faramea.csv' into R.
Let's investigate the histogram of the number of
Faramea occidentalis

The number of trees of the species
Faramea occidentalis
was assessed in 43 quadrats in Barro Colorado Island (Panama). For each quadrat, environmental characteristics were also recorded such as elevation or precipitation.
Modeling count data
the
Poisson distribution
seems to be the perfect choice to model this data, hence
Poisson GLMs
are usually the good way to start modeling count data.
How to model count data?
Does elevation influence the abundance of
F. occidentalis
?
Slope of '
Elevation
'
One intercept
Solution
# Poisson GLM
glm.p = glm(Galumna~WatrCont+SubsDens, data=mites, family=poisson)

# quasi-Poisson GLM
glm.qp = update(glm.p,family=quasipoisson)

# model selection
drop1(glm.qp, test = "Chi")
# or
glm.qp2 = glm(Galumna~WatrCont, data=mites, family=quasipoisson)
anova(glm.qp2, glm.qp, test="Chisq")
Residual deviance
Residual degrees of freedom
glm.quasipoisson = update(glm.poisson,family=quasipoisson)
Create a new GLM using the '
quasipoisson
' family or update the previous one
= 4!
0.0006436*4
null.model <- glm(Faramea.occidentalis~1, data=faramea, family=quasipoisson)

anova(null.model, glm.quasipoisson, test="Chisq")
Try also deviance analysis to test for the effect of Elevation
Model 1: Faramea.occidentalis ~ 1
Model 2: Faramea.occidentalis ~ Elevation
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 42 414.81
2 41 388.12 1 26.686 0.1961
Fitting a negative binomial in R
summary(glm.negbin)
k
summary(glm.negbin)$coefficients[1,1]
summary(glm.negbin)$coefficients[2,1]
summary(glm.negbin)$coefficients[1,2]
summary(glm.negbin)$coefficients[2,2]
Plotting the final GLM model
The effect of elevation on the number of
F. occidentalis
is significant using the NB GLM but the relationship is weak.
Challenge 3
Use the
mites
dataset! Model the abundance of the species
Galumna
as a function of the substrate characteristics (water content [
WatrCont
] and density [
SubsDens
])
mites <- read.csv("mites.csv", header = TRUE)
N.S.
Upper limit
Lower limit
Do you need to account for overdispersion?
Which covariates have a significant effect?
Select the best model!

Drop each term in turn and compare the full model with a nested model using the command:
Tips
Model selection for GLMs

Specify manually a nested model, call it for example MyGLM2, and use the command:
drop1(MyGLM, test = "Chi")
anova(MyGLM, MyGLM2, test = "Chi")
setwd("~/Desktop")
mites <- read.csv('mites.csv')
head(mites)
str(mites)
Limitations of linear (mixed) models
Load dataset and fit a LM:
The dataset that you just loaded is a subset of the 'Oribatid mite dataset':
Exploring relationships
Can we see any relationship(s) between
Galumna
and the 5 environmental variables?
Exploring relationships
plot(mites)
Exploring relationships
Can we see any relationship(s) between
Galumna
and the 5 environmental variables?
A negative relationship between
Galumna
and water content?
par(mfrow=c(1,3))

plot(Galumna ~ WatrCont, data = mites, xlab = 'Water content', ylab='Abundance')

boxplot(WatrCont ~ pa, data = mites, xlab='Presence/Absence', ylab = 'Water content')

plot(prop ~ WatrCont, data = mites, xlab = 'Water content', ylab='Proportion')

par(mfrow=c(1,1))


Testing linearity
Fit linear models to test whether abund, pa, and/or prop varies as a function of water content.
Testing linearity
Fit linear models to test whether abund, pa, and/or prop varies as a function of water content.
lm.abund <- lm(Galumna ~ WatrCont, data = mites)
summary(lm.abund)
lm.pa <- lm(pa ~ WatrCont, data = mites)
summary(lm.pa)
lm.prop <- lm(prop ~ WatrCont, data = mites)
summary(lm.prop)
Testing linearity
Significant relationship in all models!

Wait a minute...
plot(Galumna ~ WatrCont, data = mites)
abline(lm.abund)
plot(lm.abund)
Testing linearity
Even worse for other models:
#Proportion
plot(prop ~ WatrCont, data = mites)
abline(lm.prop)
plot(lm.prop)

#Presence/Absence
plot(pa ~ WatrCont, data = mites)
abline(lm.pa)
plot(lm.pa)
Model assumptions
Common in Ecology that assumptions of homogeneity of variance and normality are not met.
Main reason why we need GLMs!



Let's revisit the assumptions of lm...
Model assumptions
Equation of lm:

y =
β
+
β
x +
ε


where:
yi = predicted value of response variable
β0 = intercept
β1 = slope
xi = explanatory variable
εi
= model residuals drawn from a normal distribution with a varying mean but a constant variance
**
** Key point!

Residuals (the distance between each observation and the regression line) can be predicted by drawing random values from a normal distribution.
Normally distributed residuals
Recall:
Normal distributions have two parameters, μ (mean) and σ (variance):
i
i
i
0
1
2
Normally distributed residuals
Another way to write the lm equation is:

yi ~ N(μ = β0 + β1xi, σ ),

Which literally means that yi is drawn from a normal distribution with parameters μ (which depends on xi) and σ (which has the same value for all Ys)


Lets predict
Galumna
abund as a function of water content using the lm we fitted earlier...
2
2
Model prediction
We need regression coefficients (betas) and sigma:


coef(lm.abund)
summary(lm.abund)$sigma
What are the parameters of the normal distribution used to model y when water content = 300?
yi ~ N(μ = β0 + β1xi, σ )
2
Model prediction
At y = 300, residuals should follow a normal distribution with μ = 1.63 and σ = 1.51.
At y = 400, we get μ = 1.02 and σ = 1.51, etc.

Graphically, this is our model:
Biological data & distributions
Problems:
σ is not homogeneous, yet lm forces a constant σ
Predicted values should be integers
Statisticians have described a multitude of distributions that correspond to different types of data

A distribution provides the probability of observing each possible outcome of an experiment or survey (e.g. abund = 8
Galumna
)

Distributions can be
discrete
(only includes integers
or
continuous
(includes fractions)

All distributions have
parameters
that dictate the shape of the distribution (e.g. μ and σ for the normal)
Biological data & distributions
Galumna
abund follows a discrete distribution (can only take integer values).

A useful distribution to model abundance data is the “Poisson” distribution:
a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution:
Biological data & distributions
Galumna
seems to follow a Poisson distribution with a low value of λ:
hist(mites$Galumna)
mean(mites$Galumna)
Biological data & distributions
Presence-absence takes yet another form:
only 0s and 1s
Poisson distribution would not be appropriate to model this variable
hist(mites$pa)
Biological data & distributions
“Bernoulli” distribution
Only two possible outcomes in its range: success (1) or failure (0)
One parameter, p, the probability of success












We can use the Bernouilli distribution to calculate the probability
Galumna
present (1) vs. absent (0)
Biological data & distributions
Binomial distribution:
When there are multiple trials (each with a success/failure), the Bernoulli distribution expands into the binomial
Additional parameter, n, for number of trials
Predicts the probability of observing a given proportion of successes, p, out of a known total number of trials, n.
Biological data & distributions
Binomial distribution:
Used to model data where the number of successes are integers and where the number of trials, n, is known.

Main difference with Poisson distribution:

The binomial has an upper limit to its range, corresponding to n. Consequently, it is right-skewed at low p values but left-skewed at high p values
vs.
Biological data & distributions
Getting back to our problem... can switch the distribution of error terms (εi) from normal to Poisson:


yi ~ Poisson(λ = β0 + β1xi)


Problems solved!
1. λ varies with x (water content)
which means residual variance will also vary with x, which means that we just relaxed the homogeneity of variance assumption!

2. Predicted values will now be integers instead of fractions

3. The model will never predict negative values (Poisson is strictly positive)
Biological data & distributions
This is
almost
a Poisson GLM, which looks like this:
Probabilities (in orange) are now integers, and both the variance and the mean of the distribution decline as λ decreases with increasing water content.
Review Workshop 5:
Structure in the dataset or correlation among observations can result in
lack of independence among observations
sampled from same sites or time points
Account for this by including
random effect terms


Random effects:
Parameter is a sample from the population, i.e. the subjects you happen to be working with
Explains the variance of the response variable

Fixed effects:
Parameter is reproducible, i.e. would be the same across studies
Explain the mean of the response variable

Review: Linear Mixed Models
Extension of GLMs to account for additional structure in dataset

Follows similar steps introduced in LMM Workshop 5
1) LMMs incorporate random effects
2) GLMs handle non-normal data (letting errors take on different distribution families - e.g. Poisson or negative binomial)

Generalized Linear Mixed Models
Shrinkage estimates
How to run a GLMM in R
dat.tf <- read.csv("Banta_TotalFruits.csv")
str(dat.tf)
Import the Arabidopsis dataset 'Banta_TotalFruits.csv' into R.
The effect of nutrient availability and herbivory (
fixed effects
)

on the fruit production of the mouse-ear cress (
Arabidopsis thaliana
) was evaluated by measuring 625 plants across 9 different populations, each comprised of 2 to 3 different genotypes (
random effects
)
# popu factor with a level for each population
# gen factor with a level for each genotype
# nutrient factor with levels for low (value = 1) or high (value = 8)
# amd factor with levels for no damage or simulated herbivory
# total.fruits integer indicating the number of fruits per plant
Exploring variance
dat.tf <- within(dat.tf,
{

# genotype x nutrient x clipping
gna <- interaction(gen,nutrient,amd)
gna <- reorder(gna, total.fruits, mean)

# population x nutrient x clipping
pna <- interaction(popu,nutrient,amd)
pna <- reorder(pna, total.fruits, mean)
})
To illustrate heterogeneity in variance we will first create boxplots of the response variable versus different environmental factors
Let's create new variables that represents every combination of
nutrient
x
clipping
x
random factor
# Boxplot of total fruits vs genotype x nutrient x clipping interaction
ggplot(data = dat.tf, aes(factor(x = gna),y = log(total.fruits + 1))) +
geom_boxplot(colour = "skyblue2", outlier.shape = 21,
outlier.colour = "skyblue2") +
theme_bw() + theme(axis.text.x=element_text(angle=90)) +
stat_summary(fun.y=mean, geom="point", colour = "red")
* Similarly, the boxplot of total fruits vs population x nutrient x clipping interaction shows a large amount of heterogeneity among populations.
Choosing error distribution
The response variable is count data which suggests to that a
Poisson distribution
should be used (i.e. the variance is equal to the mean)
However, as we will soon see, the variance increases with the mean much more rapidly than expected under the Poisson distribution
Choosing error distribution
As we just saw, there is a large amount of heterogeneity among group variances even when the response variable is transformed

If we plot the
group variances vs group means
(
example with genotype x nutrient x clipping grouping shown here
), we can appreciate that the Poisson family is the least appropriate distribution (i.e. variances increase much faster than the means)
Poisson GLMM
Given mean-variance relationship, we will most likely need a model with overdispersion
but let's start with a Poisson model:
To run a GLMM in R we simply need to use the
glmer
function of the lme4 package
mp1 <- glmer(total.fruits ~ nutrient*amd + rack + status +
(1|popu)+
(1|gen),
data=dat.tf, family="poisson")
Specify the family
Specify the random effects
We can check for overdispersion using the
overdisp_fun
function (Bolker et al. 2011) which divides the Pearson residuals by the residual degrees of freedom and tests whether ratio is greater than unity
Overdispersion check
# Overdispersion?
overdisp_fun(mp1)
Random effects are often called
shrinkage estimates
because they represent a weighted average of the data and the overall fit (fixed effect)

The random effect shrinkage toward the overall fit (fixed effect) is more severe if the within-group variability is large compared to the among-group variability
Review: LMMs
Exploring variance
factor(genotype x nutrient x clipping)
Exercise 1
> overdisp_fun(mp1)
chisq ratio p logp
15755.86833 25.57771 0.00000 -6578.47027
Ratio is significantly >> 1
As expected, we need to try a different distribution where the variance increase more rapidly than the mean
Negative binomial GLMM
(Poisson-gamma)
Recall that the negative binomial (or Poisson-gamma) distribution meets the assumption that the
variance is proportional to the square of the mean
mnb1 <- glmer.nb(total.fruits ~ nutrient*amd + rack + status +
(1|popu)+
(1|gen),
data=dat.tf, control=glmerControl(optimizer="bobyqa"))
# Control argument specifies the way we optimize the parameter values
# Overdispersion?
overdisp_fun(mnb1)
Ratio is now much closer to 1 although p-value is still less than 0.05
Exercise 2
Poisson-lognormal GLMM
Another option is the
Poisson-lognormal
distribution.
This can be achieved simply by placing an observation-level random effect in the formula.
mpl1 <- glmer(total.fruits ~ nutrient*amd + rack + status +
(1|X) +
(1|popu)+
(1|gen),
data=dat.tf, family="poisson", control=glmerControl(optimizer="bobyqa"))

overdisp_fun(mpl1)
chisq ratio p logp
1.775361e+02 2.886766e-01 1.000000e+00 -3.755374e-73
Ratio now meets our criterion
Model selection
The same methods can be used with a glmm or lmm to choose between models with various random intercepts and/or random slopes and to choose fixed effects to keep in final model.
an
information theoretic approach
(e.g., AICc - Workshop 5),
a
frequentist approach
(where the significance of each term is evaluated using ''anova()'' and the likelihood ratio test; LRT)*
Parameter plots
A graphical representation of the model parameters can be obtained using the
coefplot2
funtion:
# Variance terms
coefplot2(mpl1,ptype="vcov",intercept=TRUE)

# Fixed effects
coefplot2(mpl1,intercept=TRUE)
Random intercept plots
Extract the random effect predictions using
ranef
and plot using
dotplot

Regional variability among populations:
Spanish populations (SP) larger values than Swedish (SW) or Dutch (NL)
Difference among genotypes largely driven by genotype 34
Note:

grid.arrange
was used to make this graphic (see script for details).
Model selection
Alternatively, we can use
drop1
and
dfun
functions to evaluate our fixed effects (dfun converts the AIC values returned by the drop1 into ΔAIC values)
(dd_LRT <- drop1(mpl1,test="Chisq"))
(dd_AIC <- dfun(drop1(mpl1)))
Df dAIC
<none> 0.000
rack 1 55.083
status 2 1.612
nutrient:amd 1 1.444
Strong
rack
effect (dAIC = 55.08 if we remove this variable)
Effects of
status
and
interaction
term are weak (dAIC < 2)
Start by
removing the non-significant interaction
term to test main effects of nutrient and clipping
Proportion data and GLM
Sometimes, proportion data is more similar to logistic regression than you think...
If we measure the number of occurrences and we know the total sample size, it is not count data!
Suppose we measure disease prevalence in ten deer populations on 10 deer individuals per population:
x infected deer
10 deer
}
always bound between 0 an 1!
Exercise 3
In R, we have to specify the number of times something happened and the number of times something did not happen:
> prop.reg <- glm(cbind(Galumna, totalabund - Galumna) ~ Topo + WatrCont, data = mites, family = binomial)

> summary(prop.reg)
Call:
glm(formula = cbind(Galumna, totalabund - Galumna) ~ WatrCont +
Topo, family = binomial, data = mites)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.4808 -0.9699 -0.6327 -0.1798 4.1688

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.288925 0.422109 -7.792 6.61e-15 ***
WatrCont -0.005886 0.001086 -5.420 5.97e-08 ***
TopoHummock 0.578332 0.274928 2.104 0.0354 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 140.702 on 69 degrees of freedom
Residual deviance: 85.905 on 67 degrees of freedom
AIC: 158.66
NB = negative binomial


QP = quasi-Poisson



loess = Locally weighted
regression smoothing
Note:
error bars are only shown for the fixed effects because glmer doesn't provide information on uncertainty of variance terms
Random effect variance
Fixed effect coefficient
library(binomTools)
We can also code the model directly with proportions :
> prop.infected <- n.infected / n.total

> prop.reg2 <- glm(prop.infected ~ Topo + WatrCont, data = mites, family = binomial, weights = n.total)

Model selection
We first code potential models and compare them using AICc
*
:
mpl2 <- update(mpl1, . ~ . - rack)
# model without rack
mpl3 <- update(mpl1, . ~ . - status)
# model without status
mpl4 <- update(mpl1, . ~ . - amd:nutrient)
# without amd:nutrient interaction
ICtab(mpl1,mpl2,mpl3,mpl4, type = c("AICc"))
dAICc df
mpl1 0.0 10
mpl4 1.4 9
mpl3 1.5 8
mpl2 55.0 9
* NB:
We do not cover all possible models above, however, the interaction amd: nutrient can only be evaluated if both amd and nutrient (i.e., the main effects) are included in the model.
Common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its “absence”
Mean of distribution
Variance of distribution
Probability
p
of observing an outcome
Variance decreases as p is close to 0 or 1
where
g
(μ) is the link function
This allows us to relax the normality assumption
The output indicates that both water content and topography are significant, but how do we interpret the slope coefficients?
Comparing deviance of your model (residual deviance) to the deviance of a null model (null deviance)

The
null model
is a model without explanatory variables



null.model <- glm(Response.variable ~ 1, family = binomial)
In R, we can extract the residual and null deviances directly from the glm object:



objects(logit.reg)
New statistic -
coefficient of discrimination (D)
evaluates the predictive power of logistic regression
Measure of how well logistic regression classifies an outcome as a success or a failure
In R these tests are available in the 'binomTools' package
pseudoR2 <- (logit.reg$null.deviance –
logit.reg$deviance) / logit.reg$null.deviance
pseudoR2
[1] 0.4655937
Hence, the model explains 46.6% of the variability in the data
fit <- Rsq(object = logit.reg)
fit
*A non significant value indicates an adequate fit!
Deals with overdisp. by adding
observation-level random effects
# 70 moss and mite samples
# 5 environmental measurements and abundance of the mite ”Galumna sp.”
Goal:
Model the abundance ($abund), occurrence ($pa), and proportion ($prop) of Galumna as a function of the 5 environmental variables.
μ = 3.44 + (-0.006 x 300) = 1.63
σ = 1.51
2
2
2
2
2
(your data is ok; it's the model that's wrong)
Full transcript