Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

GLM with count data

Why be normal?

(your data is ok; it's the model that's wrong)

The Poisson distribution

Modeling count data

Poisson GLM behind the scenes

The Poisson distribution specifies the probability of a discrete random variable Y and is given by:

How to model count data?

What is count data?

A Poisson GLM will model the value of µ as a function of different explanatory variables

Step 2. We specify the systematic part of the model just as in a linear model

Does elevation influence the abundance of F. occidentalis?

Import the 'faramea.csv' into R.

Three steps

faramea <- read.csv(‘faramea.csv’, header = TRUE)

One intercept

Slope of 'Elevation'

Step 1. We assume Yi follows a Poisson distribution with mean and variance µi

Step 3. The link between the mean of Yi and the systematic part is a logarithmic

Exploring relationships

Testing linearity

Exploring relationships

Limitations of linear (mixed) models

Exploring relationships

The number of trees of the species Faramea occidentalis was assessed in 43 quadrats in Barro Colorado Island (Panama). For each quadrat, environmental characteristics were also recorded such as elevation or precipitation.

Properties

Load dataset and fit a LM:

Fit linear models to test whether abund, pa, and/or prop varies as a function of water content.

A negative relationship between Galumna and water content?

Can we see any relationship(s) between Galumna and the 5 environmental variables?

  • µ is the parameter of the Poisson distribution
  • specifies the probably only for integer values
  • probability for negative values is null (P(Y<0) = 0)
  • mean = variance (allows for heterogeneity)

Or

Count data are characterized by:

setwd("~/Desktop")

mites <- read.csv('mites.csv')

head(mites)

str(mites)

Let's investigate the histogram of the number of Faramea occidentalis

plot(mites)

the Poisson distribution seems to be the perfect choice to model this data, hence Poisson GLMs are usually the good way to start modeling count data.

  • positive values: you do not count -7 individuals
  • integer values: you do not count 7.56 individuals
  • exhibits larger variance for large values

If µ is known we can compute the probability for different values of y

µi corresponds to the expected number of individuals

The dataset that you just loaded is a subset of the 'Oribatid mite dataset':

par(mfrow=c(1,3))

plot(Galumna ~ WatrCont, data = mites, xlab = 'Water content', ylab='Abundance')

boxplot(WatrCont ~ pa, data = mites, xlab='Presence/Absence', ylab = 'Water content')

plot(prop ~ WatrCont, data = mites, xlab = 'Water content', ylab='Proportion')

par(mfrow=c(1,1))

# 70 moss and mite samples

# 5 environmental measurements and abundance of the mite ”Galumna sp.”

Fitting a Poisson GLM in R

Goal: Model the abundance ($abund), occurrence ($pa), and proportion ($prop) of Galumna as a function of the 5 environmental variables.

The function glm() allows you to specify a Poisson GLM

glm.poisson = glm(Faramea.occidentalis~Elevation, data=faramea, family=poisson)

Model summary

Specify the distribution and link function

The deviance

Parameter estimates

As with lm() you can access the outputs of the model using the function summary()

Remember that to estimate the unknown parameters maximum likelihood estimation is used

In our model the unknown parameters are the intercept (alpha) and the slope of elevation (Beta)

Model assumptions

Testing linearity

summary(glm.poisson)

Model assumptions

The residual deviance is defined as twice the difference between the log likelihood of a model that provides a perfect fit and the log likelihood of our model

Even worse for other models:

Fit linear models to test whether abund, pa, and/or prop varies as a function of water content.

Call:

glm(formula = Faramea.occidentalis ~ Elevation, family = poisson,

data = faramea)

Deviance Residuals:

Min 1Q Median 3Q Max

-3.3319 -2.7509 -1.5451 0.1139 11.3995

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.7687001 0.1099136 16.092 < 2e-16 ***

Elevation -0.0027375 0.0006436 -4.253 2.11e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 414.81 on 42 degrees of freedom

Residual deviance: 388.12 on 41 degrees of freedom

AIC: 462.01

Significant relationship in all models!

Wait a minute...

1

i

0

i

Res dev = 2log(L(y;y)) - 2log(L(y;µ))

Common in Ecology that assumptions of homogeneity of variance and normality are not met.

  • Main reason why we need GLMs!

Let's revisit the assumptions of lm...

Equation of lm:

y = β + β x + ε

where:

  • yi = predicted value of response variable
  • β0 = intercept
  • β1 = slope
  • xi = explanatory variable
  • εi = model residuals drawn from a normal distribution with a varying mean but a constant variance**

Quebec Centre for Biodiversity Science

R Workshop Series

# Proportion

plot(prop ~ WatrCont, data = mites)

abline(lm.prop)

plot(lm.prop)

# Presence/Absence

plot(pa ~ WatrCont, data = mites)

abline(lm.pa)

plot(lm.pa)

In a Poisson GLM, the residual deviance should equal the residual degrees of freedoms

lm.abund <- lm(Galumna ~ WatrCont, data = mites)

summary(lm.abund)

lm.pa <- lm(pa ~ WatrCont, data = mites)

summary(lm.pa)

lm.prop <- lm(prop ~ WatrCont, data = mites)

summary(lm.prop)

388.12 >> 41

** Key point! Residuals (the distance between each observation and the regression line) can be predicted by drawing random values from a normal distribution.

Model prediction

Biological data & distributions

Normally distributed residuals

Fitting a quasi-Poisson GLM in R

Overdispersion

Quasi-Poisson GLM

Dispersion parameter

Solution

2

Create a new GLM using the 'quasipoisson' family or update the previous one

We need regression coefficients (betas) and sigma:

Recall: Normal distributions have two parameters, μ (mean) and σ (variance):

2

When the residual deviance is higher than the residual degrees of freedom we say that the model is overdispersed

At x = 300, residuals should follow a normal distribution with μ = 1.63 and σ = 1.51.

At x = 400, we get μ = 1.02 and σ = 1.51, etc.

Graphically, this is our model:

The variance of the distribution will account for overdispersion:

coef(lm.abund)

summary(lm.abund)$sigma

glm.quasipoisson = glm(Faramea.occidentalis~Elevation, data=faramea, family=quasipoisson)

2

Negative binomial

Residual deviance

glm.quasipoisson = update(glm.poisson,family=quasipoisson)

Residual degrees of freedom

Statisticians have described a multitude of distributions that correspond to different types of data

A distribution provides the probability of observing each possible outcome of an experiment or survey (e.g. abund = 8 Galumna)

Distributions can be discrete (only includes integers

or continuous (includes fractions)

All distributions have parameters that dictate the shape of the distribution (e.g. μ and σ for the normal)

summary(glm.quasipoisson)

2

>15-20

Another way to write the lm equation is:

yi ~ N(μ = β0 + β1xi, σ ),

Which literally means that yi is drawn from a normal distribution with parameters μ (which depends on xi) and σ (which has the same value for all Ys)

Lets predict Galumna abund as a function of water content using the lm we fitted earlier...

Specify the new family

# Poisson GLM

glm.p = glm(Galumna~WatrCont+SubsDens, data=mites, family=poisson)

# quasi-Poisson GLM

glm.qp = update(glm.p,family=quasipoisson)

# model selection

drop1(glm.qp, test = "Chi")

# or

glm.qp2 = glm(Galumna~WatrCont, data=mites, family=quasipoisson)

anova(glm.qp2, glm.qp, test="Chisq")

the systematic part and the link function remain the same

What are the parameters of the normal distribution used to model y when water content = 300?

Occurs when the variance in the data is even higher than the mean, hence the Poisson distribution is not the best choice (many zeros, many very high values, missing covariates, etc)

Same estimates but

The standard errors of the parameters are multiplied by

10

2

yi ~ N(μ = β0 + β1xi, σ )

Try also deviance analysis to test for the effect of Elevation

Quasi-Poisson GLM

Solutions

null.model <- glm(Faramea.occidentalis~1, data=faramea, family=quasipoisson)

anova(null.model, glm.quasipoisson, test="Chisq")

= 4!

is the dispersion parameter. It will be estimated prior to estimate parameters. Correcting for overdispersion will not affect parameter estimates but will affect their significance. Indeed, the standard errors of the parameters are multiplied by

Model 1: Faramea.occidentalis ~ 1

Model 2: Faramea.occidentalis ~ Elevation

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 42 414.81

2 41 388.12 1 26.686 0.1961

N.S.

Call:

glm(formula = Faramea.occidentalis ~ Elevation, family = quasipoisson,

data = faramea)

Deviance Residuals:

Min 1Q Median 3Q Max

-3.3319 -2.7509 -1.5451 0.1139 11.3995

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.768700 0.439233 4.027 0.000238 ***

Elevation -0.002738 0.002572 -1.064 0.293391

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 15.96936)

Null deviance: 414.81 on 42 degrees of freedom

Residual deviance: 388.12 on 41 degrees of freedom

AIC: NA

Number of Fisher Scoring iterations: 10

μ = 3.44 + (-0.006 x 300) = 1.63

σ = 1.51

5

0.0006436*4 =

0.00257

Some marginally significant p-values may no longer hold!

1. Correct for overdispersion using quasi-Poisson GLM

2. Choose another distribution: the negative binomial

1

Poisson GLM

AIC is not defined

Biological data & distributions

Galumna seems to follow a Poisson distribution with a low value of λ:

Presence-absence takes yet another form:

  • only 0s and 1s
  • Poisson distribution would not be appropriate to model this variable

Binomial distribution:

When there are multiple trials (each with a success/failure), the Bernoulli distribution expands into the binomial

  • Additional parameter, n, for number of trials
  • Predicts the probability of observing a given proportion of successes, p, out of a known total number of trials, n.

Galumna abund follows a discrete distribution (can only take integer values).

A useful distribution to model abundance data is the “Poisson” distribution:

  • a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution:

hist(mites$Galumna)

mean(mites$Galumna)

hist(mites$pa)

“Bernoulli” distribution

  • Only two possible outcomes in its range: success (1) or failure (0)
  • One parameter, p, the probability of success

We can use the Bernouilli distribution to calculate the probability Galumna present (1) vs. absent (0)

Fitting a negative binomial in R

Negative binomial GLM

Plotting the final GLM model

Challenge 3

Tips

Step 1

Negative binomial GLMs are favor when overdispersion is high

Plot the data and the use estimates of the parameters to draw model line

NB is not in the glm() function so you need to install and charge the MASS library

Model selection for GLMs

  • It has two parameters µ and k. k controls for the dispersion parameter (smaller k indicates higher dispersion)

Use the mites dataset! Model the abundance of the species Galumna as a function of the substrate characteristics (water content [WatrCont] and density [SubsDens])

install.packages("MASS")

  • Drop each term in turn and compare the full model with a nested model using the command:

Use summary() to get the parameters

glm.negbin = glm.nb(Faramea.occidentalis~Elevation, data=faramea)

  • It corresponds to a combination of two distributions (Poisson and gamma)

summary(glm.negbin)$coefficients[1,1]

summary(glm.negbin)$coefficients[2,1]

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.369226 0.473841 5.00 5.73e-07 ***

Elevation -0.007038 0.002496 -2.82 0.00481 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.2593) family taken to be 1)

Null deviance: 41.974 on 42 degrees of freedom

Residual deviance: 36.343 on 41 degrees of freedom

AIC: 182.51

Number of Fisher Scoring iterations: 1

Theta: 0.2593

Std. Err.: 0.0755

2 x log-likelihood: -176.5090

drop1(MyGLM, test = "Chi")

  • Do you need to account for overdispersion?
  • Which covariates have a significant effect?
  • Select the best model!

Step 2

summary(glm.negbin)

Use the standard errors to build the confidence envelope

  • It assumes that the Yi are Poisson distributed with the mean μ assumed to follow a gamma distribution!

Biological data & distributions

  • Specify manually a nested model, call it for example MyGLM2, and use the command:

Biological data & distributions

mites <- read.csv("mites.csv", header = TRUE)

This is almost a Poisson GLM, which looks like this:

summary(glm.negbin)$coefficients[1,2]

summary(glm.negbin)$coefficients[2,2]

k

anova(MyGLM, MyGLM2, test = "Chi")

The effect of elevation on the number of F. occidentalis is significant using the NB GLM but the relationship is weak.

Upper limit

Lower limit

Binomial distribution:

Used to model data where the number of successes are integers and where the number of trials, n, is known.

Main difference with Poisson distribution:

  • The binomial has an upper limit to its range, corresponding to n. Consequently, it is right-skewed at low p values but left-skewed at high p values

Getting back to our problem... can switch the distribution of error terms (εi) from normal to Poisson:

yi ~ Poisson(λ = β0 + β1xi)

Problems solved!

1. λ varies with x (water content)

  • which means residual variance will also vary with x, which means that we just relaxed the homogeneity of variance assumption!

2. Predicted values will now be integers instead of fractions

3. The model will never predict negative values (Poisson is strictly positive)

vs.

Probabilities (in orange) are now integers, and both the variance and the mean of the distribution decline as λ decreases with increasing water content.

Workshop 7: Generalized linear (mixed) models

GLM with binary data

GLMMs

Probability distribution

Binary variables

Logistic regression

Binary variables

Expected values can be out of the [0,1] range with lm():

In R, binary variables are coded with 1 and 0:

Clearly not normally distributed!

Common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its “absence”

The Bernoulli distribution is well suited for binary response variables

The function!

glm()

  • Presence/Absence of a species

logit.reg <- glm(formula, data, family)

Species is present!

Mean of distribution

  • Presence/Absence of a disease

E(Y) = p

  • Success/Failure to observe behaviour

Site Presence

A 1

B 0

C 1

D 1

E 0

F 1

How to run a GLMM in R

Choosing error distribution

Generalized Linear Mixed Models

Review: Linear Mixed Models

Review: LMMs

Probability p of observing an outcome

Exploring variance

  • Survival/Death of organisms

Import the Arabidopsis dataset 'Banta_TotalFruits.csv' into R.

  • The response variable is count data which suggests to that a Poisson distribution should be used (i.e. the variance is equal to the mean)

Shrinkage estimates

  • To illustrate heterogeneity in variance we will first create boxplots of the response variable versus different environmental factors
  • Let's create new variables that represents every combination of nutrient x clipping x random factor

Var(Y) = p * (1-p)

Variance of distribution

Wish to determine if P/A ~ Environment***

Species is absent!

dat.tf <- read.csv("Banta_TotalFruits.csv")

str(dat.tf)

To move away from traditional linear models, need to specify two things:

1) probability distribution

AND

2) a link function

  • Extension of GLMs to account for additional structure in dataset

  • Follows similar steps introduced in LMM Workshop

1) LMMs incorporate random effects

2) GLMs handle non-normal data (letting errors take on different distribution families - e.g. Poisson or negative binomial)

Variance decreases as p is close to 0 or 1

*** Called a logistic regression or logit model

Review of LMM Workshop:

  • Structure in the dataset or correlation among observations can result in lack of independence among observations sampled from same sites or time points
  • Account for this by including random effect terms

Random effects:

  • Parameter is a sample from the population, i.e. the subjects you happen to be working with
  • Explains the variance of the response variable

Fixed effects:

  • Parameter is reproducible, i.e. would be the same across studies
  • Explain the mean of the response variable

# popu factor with a level for each population

# gen factor with a level for each genotype

# nutrient factor with levels for low (value = 1) or high (value = 8)

# amd factor with levels for no damage or simulated herbivory

# total.fruits integer indicating the number of fruits per plant

  • Random effects are often called shrinkage estimates because they represent a weighted average of the data and the overall fit (fixed effect)

  • The random effect shrinkage toward the overall fit (fixed effect) is more severe if the within-group variability is large compared to the among-group variability

dat.tf <- within(dat.tf,

{

# genotype x nutrient x clipping

gna <- interaction(gen,nutrient,amd)

gna <- reorder(gna, total.fruits, mean)

# population x nutrient x clipping

pna <- interaction(popu,nutrient,amd)

pna <- reorder(pna, total.fruits, mean)

})

The effect of nutrient availability and herbivory (fixed effects) on the fruit production of the mouse-ear cress (Arabidopsis thaliana) was evaluated by measuring 625 plants across 9 different populations, each comprised of 2 to 3 different genotypes (random effects)

  • However, as we will soon see, the variance increases with the mean much more rapidly than expected under the Poisson distribution

Exercise 1

Exercise 1

The Link Function

Other distributions

The Link Function

Build a logistic regression model using the mites data

Build a model of the presence of Galumna sp. as a function of water content and topography

g(μ) = log

μ

For binary data, the link function is called the logit:

is only true for normally distributed data

μ = Xβ

1) Get the odds (μ / 1-μ)

2) log-transform them

Exploring variance

Choosing error distribution

Poisson-lognormal GLMM

For a simple linear model of a normally distributed continuous response variable, the equation for the expected values is:

Overdispersion check

Negative binomial GLMM

1-μ

Poisson GLMM

(Poisson-gamma)

setwd(~/workshop6)

mites <- read.csv("mites.csv")

str(mites)

  • Another option is the Poisson-lognormal distribution.
  • This can be achieved simply by placing an observation-level random effect in the formula.
  • We can check for overdispersion using the overdisp_fun function (Bolker et al. 2011) which divides the Pearson residuals by the residual degrees of freedom and tests whether ratio is greater than unity

μ

# Boxplot of total fruits vs genotype x nutrient x clipping interaction

library(ggplot2)

ggplot(data = dat.tf, aes(factor(x = gna),y = log(total.fruits + 1))) +

geom_boxplot(colour = "skyblue2", outlier.shape = 21,

outlier.colour = "skyblue2") +

theme_bw() + theme(axis.text.x=element_text(angle=90)) +

stat_summary(fun.y=mean, geom="point", colour = "red")

g(μ) = log

logit.reg <- glm(pa ~ WatrCont + Topo, data=mites,

family = binomial(link = "logit"))

summary(logit.reg)

μ = Xβ

  • Logit transformation of the data often used with lm(m) for percentages or proportions when the binomial distribution is not appropriate. When not selections from fixed quantities (e.g. percent cover, school grades, etc).
  • Log-normal distribution in glm, avoids having to log-transform the data.
  • Gamma distribution. Similar to log-normal, more versatile.
  • Tweedie distribution. Versatile family of distributions. Useful for data with a mix of zeros and positive values (not necessarily counts).
  • Zero-inflated Poisson or zero-inflated negative binomial. When the data comprise an excess number of zeros, that arise from a different process than the process that generates the counts.

The odds puts our expected values on a 0 to +Inf scale

If this is not the case, must use a transformation on the expected values μ

  • As we just saw, there is a large amount of heterogeneity among group variances even when the response variable is transformed

  • If we plot the group variances vs group means (example with genotype x nutrient x clipping grouping shown here), we can appreciate that the Poisson family is the least appropriate distribution (i.e. variances increase much faster than the means)
  • Recall that the negative binomial (or Poisson-gamma) distribution meets the assumption that the variance is proportional to the square of the mean

summary(logit.reg)

Call:

glm(formula = pa ~ WatrCont + Topo, family = binomial, data = mites)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.0387 -0.5589 -0.1594 0.4112 2.0252

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 4.464402 1.670622 2.672 0.007533 **

WatrCont -0.015813 0.004535 -3.487 0.000489 ***

TopoHummock 2.090757 0.735348 2.843 0.004466 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 91.246 on 69 degrees of freedom

Residual deviance: 48.762 on 67 degrees of freedom

AIC: 54.762

Number of Fisher Scoring iterations: 6

  • Given the mean-variance relationship, we will most likely need a model with overdispersion
  • but let's start with a Poisson model:
  • To run a GLMM in R we simply need to use the glmer function of the lme4 package

1-μ

# Download the glmm_funs.R code from the wiki page and source it to run the function in R

source("glmm_funs.R")

# Overdispersion?

overdisp_fun(mp1)

where μ is the expected value of the response variable

g(μ) = Xβ

Deals with overdisp. by adding observation-level random effects

The log transformation puts our expected values on a -Inf to +Inf

X is the model matrix (i.e. your data)

str(mites)

'data.frame': 70 obs. of 9 variables:

$ Galumna : int 8 3 1 1 2 1 1 1 2 5 ...

$ pa : int 1 1 1 1 1 1 1 1 1 1 ...

$ totalabund: int 140 268 186 286 199 209 162 126 123 166 ...

$ prop : num 0.05714 0.01119 0.00538 0.0035 0.01005 ...

$ SubsDens : num 39.2 55 46.1 48.2 23.6 ...

$ WatrCont : num 350 435 372 360 204 ...

$ Substrate : Factor w/ 7 levels "Barepeat","Interface",..: 4 3 2 4 4 4 4 2 3 4 ...

$ Shrub : Factor w/ 3 levels "Few","Many","None": 1 1 1 1 1 1 1 2 2 2 ...

$ Topo : Factor w/ 2 levels "Blanket","Hummock": 2 2 2 2 2 2 2 1 1 2 ...

μ = expected values (probability that Y = 1)

mpl1 <- glmer(total.fruits ~ nutrient*amd + rack + status +

(1|X) +

(1|popu)+

(1|gen),

data=dat.tf, family="poisson", control=glmerControl(optimizer="bobyqa"))

overdisp_fun(mpl1)

mnb1 <- glmer.nb(total.fruits ~ nutrient*amd + rack + status +

(1|popu)+

(1|gen),

data=dat.tf, control=glmerControl(optimizer="bobyqa"))

# Control argument specifies the way we optimize the parameter values

library(lme4)

mp1 <- glmer(total.fruits ~ nutrient*amd + rack + status +

(1|popu)+

(1|gen),

data=dat.tf, family="poisson")

β is the vector of estimated parameters

(i.e. the intercept & slope)

where g(μ) is the link function

> overdisp_fun(mp1)

chisq ratio p logp

15755.86833 25.57771 0.00000 -6578.47027

NB = negative binomial

QP = quasi-Poisson

loess = Locally weighted

regression smoothing

Specify the random effects

chisq ratio p logp

1.775361e+02 2.886766e-01 1.000000e+00 -3.755374e-73

The expected values can now be linearly related to the linear predictor

1) Get the odds (μ / 1-μ)

2) log-transform them

# Overdispersion?

overdisp_fun(mnb1)

Specify the family

Ratio is now much closer to 1 although p-value is still less than 0.05

Xβ is called the linear predictor

factor(genotype x nutrient x clipping)

This allows us to relax the normality assumption

Ratio now meets our criterion

  • Ratio is significantly >> 1
  • As expected, we need to try a different distribution where the variance increase more rapidly than the mean

* Similarly, the boxplot of total fruits vs population x nutrient x clipping interaction shows a large amount of heterogeneity among populations.

Interpreting the output

Solution

Interpreting the output

Predictive Power and goodness of fit

Challenge 1

model.bact1 <- glm(y ~ trt * week,

data = bacteria, family = binomial)

Website: http://qcbs.ca/wiki/r/workshop7

On the odds scale for water content:

Let's go back to the summary of our 'logit.reg' model to see the coefficients:

Remember we used a logit transformation on the expected values!

2

exp(logit.reg$coefficient[2])

Get the pseudo-R , the analogue of the R for models fitted by maximum likelihood:

model.bact2 <- glm(y ~ trt + week,

data = bacteria, family = binomial)

Visualizing the random effects

Visualization the model parameters

Model selection

summary(logit.reg)$coefficients

Model selection

To properly interpret the regression parameters, we have to use a 'reverse' function:

  • A graphical representation of the model parameters can be obtained using the coefplot2 funtion:

WatrCont

0.9843118

Using the 'bacteria' dataset, model the presence of H. influenzae as a function of treatment and week of test. Start with a full model and reduce it to the most parsimonious model.

  • We first code potential models and compare them using AICc*:

model.bact3 <- glm(y ~ week, data = bacteria,

family = binomial)

2

  • You can extract the random effect predictions using ranef and plot them using a dotplot
  • Regional variability among populations:
  • Spanish populations (SP) larger values than Swedish (SW) or Dutch (NL)
  • Difference among genotypes largely driven by genotype 34
  • Alternatively, we can use drop1 and dfun functions to evaluate our fixed effects (dfun converts the AIC values returned by the drop1 into ΔAIC values)

pseudo-R =

null deviance - residual deviance

x

Estimate Std. Error z value Pr(>|z|)

(Intercept) 4.46440 1.67062 2.67230 0.00753

WatrCont -0.01581 0.00454 -3.48673 0.00049

TopoHummock 2.09076 0.73535 2.84322 0.00447

library(coefplot2)

# Variance terms

coefplot2(mpl1,ptype="vcov",intercept=TRUE)

# Fixed effects

coefplot2(mpl1,intercept=TRUE)

The natural exponential function to obtain the odds: e

On the probability scale for water content:

null deviance

Comparing deviance of your model (residual deviance) to the deviance of a null model (null deviance)

The null model is a model without explanatory variables

anova(model.bact1, model.bact2, model.bact3,

test = "LRT")

null.model <- glm(Response.variable ~ 1, family = binomial)

mpl2 <- update(mpl1, . ~ . - rack) # model without rack

mpl3 <- update(mpl1, . ~ . - status) # model without status

mpl4 <- update(mpl1, . ~ . - amd:nutrient) # without amd:nutrient interaction

ICtab(mpl1,mpl2,mpl3,mpl4, type = c("AICc"))

  • The same methods can be used with a glmm or lmm to choose between models with various random intercepts and/or random slopes and to choose fixed effects to keep in final model.
  • an information theoretic approach (e.g., AICc - Workshop 5),
  • a frequentist approach (where the significance of each term is evaluated using ''anova()'' and the likelihood ratio test; LRT)

The inverse logit function to obtain the probabilities:

(dd_LRT <- drop1(mpl1,test="Chisq"))

(dd_AIC <- dfun(drop1(mpl1)))

1 / (1 + 1/exp(logit.reg$coefficient[2]))

mpl2 <- update(mpl1, . ~ . - and:nutrient)

# Use AIC

mpl3 <- update(mpl2, . ~ . - rack) # no rack or interaction

mpl4 <- update(mpl2, . ~ . - status) # no status or interaction

mpl5 <- update(mpl2, . ~ . - nutrient) # no nutrient or interaction

mpl6 <- update(mpl2, . ~ . - amd) # no clipping or interaction

ICtab(mpl2,mpl3,mpl4,mpl5,mpl6, type = c("AICc"))

# Or use drop1

(dd_LRT2 <- drop1(mpl2,test="Chisq"))

(dd_AIC2 <- dfun(drop1(mpl2)))

Fixed effect coefficient

Random effect variance

-1

x

logit = 1 / (1 + 1/e )

install.packages("MASS")

library(MASS)

data(bacteria)

str(bacteria)

?bacteria

Analysis of Deviance Table

Model 1: y ~ trt * week

Model 2: y ~ trt + week

Model 3: y ~ trt

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 214 203.12

2 216 203.81 -2 -0.6854 0.709836

3 217 210.72 -1 -6.9140 0.008552 **

WatrCont

0.4960469

The output indicates that both water content and topography are significant, but how do we interpret the slope coefficients?

Df dAIC

<none> 0.000

rack 1 55.083

status 2 1.612

nutrient:amd 1 1.444

dAICc df

mpl1 0.0 10

mpl4 1.4 9

mpl3 1.5 8

mpl2 55.0 9

variance explained by the model

Based on these results, we choose model #2 as the best candidate to model the data.

In R, we can extract the residual and null deviances directly from the glm object:

objects(logit.reg)

pseudoR2 <- (logit.reg$null.deviance –

logit.reg$deviance) / logit.reg$null.deviance

pseudoR2

[1] 0.4655937

Hence, the model explains 46.6% of the variability in the data

* NB: We do not cover all possible models above, however, the interaction amd: nutrient can only be evaluated if both amd and nutrient (i.e., the main effects) are included in the model.

  • Both the main effects of nutrient and clipping are strong (large change in AIC of 135.6 (mpl5) and 10.2 (mpl6) if either nutrient or clipping are dropped, respectively).
  • Final model includes the fixed nutrient and clipping effects, rack, and observation-level random e ffects (1 | X) to account for over-dispersion
  • Strong rack effect (dAIC = 55.08 if we remove this variable)
  • Effects of status and interaction term are weak (dAIC < 2)
  • Start by removing the non-significant interaction term to test main effects of nutrient and clipping

Note: the grid.arrange() function from the {gridExtra} library was used to make this graphic (see script for details).

Note: error bars are only shown for the fixed effects because glmer doesn't provide information on uncertainty of variance terms

Exercise 3

Proportion data and GLM

Exercise 2

Solution

Challenge 2

Predictive Power and goodness of fit

1)

null.d <- model.bact2$null.deviance

In R these tests are available in the 'binomTools' package

In R, we have to specify the number of times something happened and the number of times something did not happen:

Sometimes, proportion data is more similar to logistic regression than you think...

resid.d <- model.bact2$deviance

Additional GLMM resources

1)

library(binomTools)

Using the model created with bacteria dataset, assess goodness of fit and predictive power.

bact.pseudoR2 <- (null.d -resid.d) / null.d

If we measure the number of occurrences and we know the total sample size, it is not count data!

New statistic - coefficient of discrimination (D) evaluates the predictive power of logistic regression

  • Measure of how well logistic regression classifies an outcome as a success or a failure

Up for a challenge?

Compute the coefficient of discrimination D

HLtest(Rsq(model.bact2)

> prop.reg <- glm(cbind(Galumna, totalabund - Galumna) ~ Topo + WatrCont, data = mites, family = binomial)

> summary(prop.reg)

We can also code the model directly with proportions :

> prop.reg2 <- glm(prop ~ Topo + WatrCont, data = mites, family = binomial, weights = totalabund)

Books

B. Bolker (2009) Ecological Models and Data in R. Princeton University Press.

A. Zuur et al. (2009) Mixed Effects Models and Extensions in Ecology with R. Springer.

Websites

GLMM for ecologists (http://glmm.wikidot.com)

A great website on GLMM with a Q&A section!

Call:

glm(formula = cbind(Galumna, totalabund - Galumna) ~ WatrCont +

Topo, family = binomial, data = mites)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.4808 -0.9699 -0.6327 -0.1798 4.1688

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.288925 0.422109 -7.792 6.61e-15 ***

WatrCont -0.005886 0.001086 -5.420 5.97e-08 ***

TopoHummock 0.578332 0.274928 2.104 0.0354 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 140.702 on 69 degrees of freedom

Residual deviance: 85.905 on 67 degrees of freedom

AIC: 158.66

2)

fit <- Rsq(object = logit.reg)

fit

Think how predictive power could be improved for this model.

Suppose we measure disease prevalence in ten deer populations on 10 deer individuals per population:

2)

To assess goodness of fit, diagnostic plots are not useful, instead must use Hosmer-Lemeshow test:

  • Compare observed and expected number of outcomes
  • Similar to a Chi square test

Adding informative explanatory variables could increase the explanatory power of the model

  • Use the inverts dataset (larval development times (PLD) of 74 marine invertebrate and vertebrate species reared at different temperatures and time), answer the following questions:

  • What is the effect of feeding type and climate (fixed effects) on PLD?
  • Does this relationship vary among taxa (random effects)?
  • What is the best distribution family for this count data?
  • Finally, once you determined the best distribution family, re-evaluate your random and fixed effects.

Perform a Hosmer-Lemeshow test*

x infected deer

}

always bound between 0 an 1!

HLtest(Rsq(logit.reg))

10 deer

*A non significant value indicates an adequate fit!

2

Problems:

  • σ is not homogeneous, yet lm forces a constant σ
  • Predicted values should be integers

2

plot(Galumna ~ WatrCont, data = mites)

abline(lm.abund)

plot(lm.abund)

Learn more about creating dynamic, engaging presentations with Prezi