Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

QCBS R Workshop 4

Linear models
by

CSBQ QCBS

on 19 October 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of QCBS R Workshop 4

Quebec Centre for Biodiversity Science

R Workshop Series

Workshop 4:
Linear models

Website:
http://qcbs.ca/wiki/r/workshop4

Learning objectives
Download the
birdsdiet
dataset:
setwd()
bird<-read.csv("birdsdiet.csv")
Type of explanatory variable
1 variable
1. Simple linear
regression

2 or more
5. Multiple linear
regression

1 with 2
levels

2. T test
3. ANOVA
Continuous
Categorical
Continuous AND categorical
4. ANCOVA
We first want to test if bird maximum abundance is a function of bird mass
Response variable:
Bird abundance
/ num: continuous
Explanatory variable:
Bird mass
/

num: continuous
In R, the function
lm()
is used to fit a linear model
lm1 <- lm(Y ~ X)
Response variable
New object containing the linear model we created
Running a regression in R
Assumptions
Review: Simple linear regression

Yi = α + βXi + εi

Linear relationship between response (Y) and explanatory (X) variable
Step 1:
Run your linear model
Step 2:
Verify assumptions
Assumptions are met
Step 3:
Estimate parameters (regression coefficients), test significance, plot your model
Assumptions are not met
Can you transform your variables (and does it make sense to do so)?
Go back to
Step 1
with transformed variables
Yes
No
Try GLM that might better suit your data
Step 1:
Run your linear model
par()
sets the graphical parameters, for example, the
mfrow
argument sets the number of rows in the frame
plot()
is a generic function to plot graphics in R
opar <- par(mfrow=c(2,2))
plot(lm1)
The output will provide the four built-in diagnostic plots of the lm() function
Should show a scatter across and no pattern
Example of independence
(what we want!)
Examples of non-independence
(what we don't want)
Solution:
Transform your data or try another distribution than linear (Gaussian) (i.e., a generalized linear model (GLM): Poisson, binomial, negative binomial, etc.)
Diagnostic plot # 1 - Residuals vs Fitted
If points lie linearly on the 1:1 line, data follow a normal distribution
Diagnostic plot # 3 - Normal QQ
Diagnostic plot # 2 - Scale Location
Should show a scatter across and no pattern
No pattern in the residuals
Strong pattern in the residuals
Compares the distribution (quantiles) of the residuals of the current model to those of a normal distribution
Diagnostic plot # 4 - Residuals vs Leverage
Leverage vs influence
No influential values
High leverage point and reasonable influence
Here, point 32 has high leverage but its influence is acceptable (inside the 0.5 Cook's distance limits)
High leverage point and high influence
Points are outside the 0.5 limit of Cook's distance
These points have too much influence on the regression
The function
summary()
is used to obtain parameter estimates, significance, etc:
Step 3 :
Estimate parameters and test their significance
** Note: you should never remove outliers if you don't have good reasons to do so (ex: error of measurement)
GLM
GLM
Group discussion
Challenge 1
T-test
Assumptions
Data follow a normal distribution
Equality of variance between groups

Running a T-test in R
t.test(Y~X2, data= data, alternative = "two.sided")
Use the function
t.test()
response variable
factor
(2 levels)
name of dataframe
Alternative hypothesis:
"two.sided" (default), "less", or "greater"
The t-test is still a linear model and a specific case of ANOVA with one factor with 2 levels
lm.t <-lm(Y~X2, data = data)
anova(lm.t)
mean of the two groups
Running a T test in R
Running a T test in R
1 with 3 levels or more
and/ or many factors

Y = μ +
+ Interactions between factors

+ Residuals

Review: ANOVA
Running a one-way ANOVA in R
Can use the function

lm()
to run an ANOVA
anov1 <- lm(logMaxAbund ~ Diet, data=bird)
anova(anov1)
Or use a specific function called

aov()
aov1 <- aov(logMaxAbund ~ Diet, data=bird)
summary(aov1)
Did we violate the model assumptions?
Bartlett test

- equality of variance among groups
bartlett.test(logMaxAbund ~ Diet, data=bird)
Shapiro-Wilk test

- Normality of residuals
shapiro.test(resid(anov1))
What if the assumptions were not met?
Transform the data
- might normalize the residuals and homogenize the variances, and convert a multiplicative effect into an additive one
data$logY <- log10(data$Y)
* re-run your data with the transformed variables and verify the assumptions once again
Kruskal-Wallis Test

- non-parametric equivalent to ANOVA if you cannot
(or do not wish to)
transform the data
kruskal.test(Y~X, data)
* see Workshop 1 for rules on data transformation
$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...
$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...
$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...
$ Mass : num 716 5.3 35.8 119.4 315.5 ...
$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...
$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...
$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...
Running an ANOVA in R
Output of our ANOVA model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1539 0.1500 7.692 5.66e-10 ***
DietInsectVert -0.6434 0.4975 -1.293 0.2020
DietPlant 0.2410 0.4975 0.484 0.6303
DietPlantInsect 0.4223 0.2180 1.938 0.0585 .
DietVertebrate -0.3070 0.2450 -1.253 0.2161
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6709 on 49 degrees of freedom
Multiple R-squared: 0.188, Adjusted R-squared: 0.1217
F-statistic: 2.836 on 4 and 49 DF, p-value: 0.0341
Factor levels in alphabetical order and all levels are compared to the reference level ("Insect")
Significant difference btw Diet groups, but we don't know which ones!
A posteriori test
TukeyHSD(aov(anov1),ordered=T)
$Diet
diff lwr upr p adj
Vertebrate-InsectVert 0.3364295 -1.11457613 1.787435 0.9645742
Insect-InsectVert 0.6434334 -0.76550517 2.052372 0.6965047
Plant-InsectVert 0.8844338 -1.01537856 2.784246 0.6812494
PlantInsect-InsectVert 1.0657336 -0.35030287 2.481770 0.2235587
Insect-Vertebrate 0.3070039 -0.38670951 1.000717 0.7204249
Plant-Vertebrate 0.5480043 -0.90300137 1.999010 0.8211024
PlantInsect-Vertebrate 0.7293041 0.02128588 1.437322 0.0405485
Plant-Insect 0.2410004 -1.16793813 1.649939 0.9884504
PlantInsect-Insect 0.4223003 -0.19493574 1.039536 0.3117612
PlantInsect-Plant 0.1812999 -1.23473664 1.597336 0.9961844

Only "Vertebrate" and "PlantInsect" differ
Two-way ANOVA
anova.with.lm <- lm(Y~X*Z, data)
The "*" symbol indicates that the main effects as well as their interaction will be included in the model
Analysis of Variance Table

Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
X 4 5.1059 1.27647 3.0378 0.02669 *
Z 1 0.3183 0.31834 0.7576 0.38870
X:Z 3 2.8250 0.94167 2.2410 0.10689
Residuals 45 18.9087 0.42019
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Example of a non-significant interaction
anova.with.lm <- lm(Y~X+Z, data)
If use the "+" symbol, the
main effects
but
not
their interaction are included
According to law of
parsimony
, select model that explain the most variance with the least model parameters possible


Thus, remove the interaction term if non-signficant and re-evaluate

You will see in the assumptions that there should be NO interaction between covariates and factors
ancova.exemple <- lm(Y~X*Z, data=data)
summary(ancova.exemple)
2- Test whether the interaction is signifiant
ancova.exemple2 <- lm(Y~X+Z, data=data)
No interaction
1- Run an ANCOVA to test the effect of Diet and Mass (as well as their interaction) on Maximum abundance
If your covariate and factor are significant, you will have a scenario that looks like this:
summary(ancova.exemple2)
If you want to compare the mean values of each factor, conditional on the effect of the other, the
ajusted means
should be visualized

The
effect()
function uses the output of the ANCOVA model to estimate the means of each factor level, corrected by the effect of the covariate
ancova.exemple <- lm(Y ~ X*Z, data=data)
# X = quantitative; Z = categorical

library(effects)
adj.means.ex <- effect('Z', ancova.exemple)
plot(adj.means.ex)
If only your factor is significant, remove the covariate -> you will have a simple
ANOVA
If the interaction btw your covariate and factor (*) is significant, you should explore which levels of the factor have different slopes from the others
If only your covariate is significant, remove your factor -> you will have a
simple linear regression
Verify assumptions!
Very similar to what we have done so far!
$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...
$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...
$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...
$ Mass : num 716 5.3 35.8 119.4 315.5 ...
$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...
$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...
$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...
Yi = α + β1X1i + β2X2i + β3X3i + ...+ βkXki+ ε
Verify collinearity of all explanatory variables

Verify assumptions!
lm.mult <- lm(abund ~ clTma + NDVI + grass, data=Dickcissel)
summary(lm.mult)
plot(data=Dickcissel)
If you see a pattern between any two variables, they may be collinear!
They are likely to explain the same variability of the response variable and the effect of one variable will be masked by the other



Keep only
one
of the collinear variables
Try multivariate analyses (see
Ordination Workshop
)
Try a pseudo-orthogonal analysis

Verify diagnostic plots, as you did for the simple linear regression

par(mfrow=c(2,2))
plot(lm.mult)
Recall the principle of parcimony: Most variance explained using the least number of terms possible

Find the best-fit model
Using the
Dickcissel
datast, test the
effect of climate (clTma), productivity (NDVI) and grassland cover (grass) on bird abundance (abund)
Step 2:
Verify assumptions using diagnostic plots
lm1 <- lm(bird$MaxAbund ~ bird$Mass)
bird$logMaxAbund <- log10(bird$MaxAbund)
bird$logMass <- log10(bird$Mass)
Can we improve the model if we only use terrestrial birds?
lm3 <- lm(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic)
# removes aquatic birds (= TRUE)

# or equivalently
lm3 <- lm(logMaxAbund~logMass, data=bird, subset=bird$Aquatic == 0)

# Examine the diagnostic plots
par(mfrow=c(2,2))
plot(lm3)
summary(lm3)

# Compare both models
par(mfrow=c(1,2))
plot(logMaxAbund~logMass, data=bird)
plot(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic)
ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird)

# Or use lm()

ttest.lm1 <- lm(logMass ~ Aquatic, data=bird)
boxplot(logMaxAbund ~ Diet, data=bird)
par(mfrow=c(1,3))
plot(abund ~ clTma, data=Dickcissel)
plot(abund ~ NDVI, data=Dickcissel)
plot(abund ~ grass, data=Dickcissel)
Stepwise regression
Running a T test in R
boxplot(logMass ~ Aquatic, data=bird,names=c("Non-Aquatic","Aquatic"))
First, lets visualize the data using the function
boxplot()
Graphical representation
Optional section
1. Interpreting contrasts
2. Unbalanced ANOVA
4. Variance partitioning
Hypothesis (for a bilateral t-test)
If p<0.01 (or 0.05), the null hypothesis of no difference between the two groups (H0) can be rejected, with a risk of 0.01 (or 0.05) that we made a mistake in this conclusion.
Running a t-test in R
Next, test the assumption of equality of variance using
var.test()
function
var.test(logMass~Aquatic,data=bird)
Group discussion
Are aquatic birds heavier than terrestrial birds?
# Unilateral t-test
uni.ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird, alternative="less")
uni.ttest1
Running a t-test in R
Two Sample t-test

data: logMass by Aquatic
t = -7.7707, df = 52, p-value = 2.936e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.6669697 -0.9827343
sample estimates:
mean in group 0 mean in group 1
1.583437 2.908289
There exists a difference in mass between the aquatic and terrestrial birds
What did you conclude?
Two Sample t-test

data: logMass by Aquatic
t = -7.7707, df = 52, p-value = 1.468e-10
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -1.039331
sample estimates:
mean in group 0 mean in group 1
1.583437 2.908289
Yes, aquatic birds are heavier than terrestrial birds
Group discussion
sd <- tapply(bird$logMaxAbund,list(bird$Diet),sd)
means <- tapply(bird$logMaxAbund,list(bird$Diet),mean)
n <- length(bird$logMaxAbund)
se <- 1.96*sd/sqrt(n)
bp <- barplot(means)
segments(bp, means - se, bp, means + se, lwd=2)
segments(bp - 0.1, means - se, bp + 0.1, means - se, lwd=2)
segments(bp - 0.1, means + se, bp + 0.1, means + se, lwd=2)
Both tests are non-significant; variances assumed to be equal and residuals assumed to be normally distributed
3. Polynomial regression
Optional
Plot Y ~ X with corresponding best-fit regression line, and a histogram of Y and X to explore their distributions
Lets try normalizing data with a log10() transformation
Add the log-transformed variables to our dataframe
Can also use the
Shapiro-Wilk
and the
Skewness
tests to see if variables follow a normal distribuon:
shapiro.test(bird$MaxAbund)
shapiro.test(bird$Mass)
skewness(bird$MaxAbund)
skewness(bird$Mass)
Shapiro-Wilk normality test
data: bird$MaxAbund
W = 0.5831, p-value = 3.872e-11

Shapiro-Wilk normality test
data: bird$Mass
W = 0.5467, p-value = 1.155e-11
[1] 3.167159
# Skewness of MaxAbund
[1] 3.146062
# Skewness of Mass
In both cases, distributions are significantly different from normal
The positive skewness values also indicate that the distributions are left skewed
Much improved!
plot(lm2)
Challenge 1 - Solution
Call:
lm(formula = abund ~ clDD + clFD + clTmi + clTma + clP + grass,
data = Dickcissel)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.457e+02 3.464e+01 -12.868 < 2e-16 ***
clDD 5.534e-02 8.795e-03 6.292 5.83e-10 ***
clFD 1.090e+00 1.690e-01 6.452 2.19e-10 ***
clTmi -6.717e+00 7.192e-01 -9.339 < 2e-16 ***
clTma 3.172e+00 1.288e+00 2.463 0.014030 *
clP 1.562e-01 4.707e-02 3.318 0.000959 ***
grass 1.066e+01 4.280e+00 2.490 0.013027 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.85 on 639 degrees of freedom
Multiple R-squared: 0.3207, Adjusted R-squared: 0.3144
F-statistic: 50.29 on 6 and 639 DF, p-value: < 2.2e-16
Variables selected by the step() function
(... if time permits)
ttest1$statistic^2
anova(ttest.lm1)$F
# answer: F=60.3845 in both cases
All 3 variables are significant

Model explains ~11% of the dickcissel variance (R2-adj = 0.11)
Call:
lm(formula = abund ~ clTma + NDVI + grass, data = Dickcissel)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -83.60813 11.57745 -7.222 1.46e-12 ***
clTma 3.27299 0.40677 8.046 4.14e-15 ***
NDVI 0.13716 0.05486 2.500 0.0127 *
grass 10.41435 4.68962 2.221 0.0267 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.58 on 642 degrees of freedom
Multiple R-squared: 0.117, Adjusted R-squared: 0.1128
F-statistic: 28.35 on 3 and 642 DF, p-value: < 2.2e-16
When the assumption of equal variance is met t^2 follows an F distribution
We can reorder factor levels according to group medians using the
tapply()
et
sort()
functions
med <- sort(tapply(bird$logMaxAbund, bird$Diet, median))
boxplot(logMaxAbund ~ factor(Diet, levels=names(med)))
Remove the weakest variable

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...
$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...
$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...
$ Mass : num 716 5.3 35.8 119.4 315.5 ...
$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...
$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...
$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...
plot(lm1)
summary(lm2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***
logMass -0.2361 0.1170 -2.019 0.0487 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6959 on 52 degrees of freedom
Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484
F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869
lm2$coef
str(summary(lm2))
summary(lm2)$coefficients
summary(lm2)$r.squared
We can also call out specific parameters of the model, for example:
Multiple linear regression in R
Dickcissel <- read.csv("Dickcissel.csv")
str(Dickcissel)
Visualize the data using the structure
str()
command:
str(bird)
$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...
$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...
$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...
$ Mass : num 716 5.3 35.8 119.4 315.5 ...
$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...
$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...
$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...
Step 2:
Verify assumptions of
lm1
Transform the data
plot(logMaxAbund ~ logMass, data=bird)
abline(lm2)
hist(log10(bird$MaxAbund))
hist(log10(bird$Mass))
lm2 <- lm(bird$logMaxAbund ~ bird$logMass)
Step 2:
Verify assumptions of model
lm2
Step 2:
Verify the assumptions of model
lm2
plot(bird$MaxAbund ~ bird$Mass)
abline(lm1)
hist(bird$MaxAbund)
hist(bird$Mass)
* The

robustness of this test increases with sample size and is higher when groups have equal sizes
Thus the function
lm()
can also be used
Running an ANOVA in R
summary.lm(aov1)
summary(anov1)
If ANOVA detects a significant effect, a post-hoc test is required to determine which treatment(s) differ from the others. This can be done with the
TukeyHSD()
function

Two-way ANOVA
Challenge 2 - Solution
anov2 <- lm(logMaxAbund ~ Diet*Aquatic, data=bird)
summary(anov2)
anova(anov2)
Analysis of Variance Table

Response: logMaxAbund
Df Sum Sq Mean Sq F value Pr(>F)
Diet 4 5.1059 1.27647 3.0378 0.02669 *
Aquatic 1 0.3183 0.31834 0.7576 0.38870
Diet:Aquatic 3 2.8250 0.94167 2.2410 0.09644 .
Residuals 45 18.9087 0.42019
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANCOVA with one covariate and one factor
Running an ANCOVA in R
Challenge 3
ANCOVA with one covariate and one factor
ANCOVA with one covariate and one factor
Explanatory variable
Set working directory
2 x 2 picture on one plot
Assumptions not met - where did we go wrong?
Group discussion
**You could exclude objects using "=!"
F test to compare two variances
data: logMass by Aquatic
F = 1.0725, num df = 38, denom df = 14, p-value = 0.9305
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3996428 2.3941032
sample estimates:
ratio of variances
1.072452
Indicates that homogeneity of variance was respected
(as we just tested)
uni.ttest1
Challenge 3 - Solution
ancov1 <- lm(logMaxAbund ~ logMass*Diet, data=bird)
anova(ancov1)
Analysis of Variance Table

Response: logMaxAbund
Df Sum Sq Mean Sq F value Pr(>F)
logMass 1 1.9736 1.97357 4.6054 0.03743 *
Diet 4 3.3477 0.83691 1.9530 0.11850
logMass:Diet 4 2.9811 0.74527 1.7391 0.15849
Residuals 44 18.8556 0.42854
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interaction btw logMass and Diet is non-significant
ancov2 <- lm(logMaxAbund ~ logMass + Diet, data=bird)
anova(ancov2)
Remove the interaction term and re-evaluate the model (with only the main effects of Diet and logMass)
Challenge 3 - Solution
Analysis of Variance Table

Model 1: logMaxAbund ~ logMass
Model 2: logMaxAbund ~ logMass + Diet
Res.Df RSS Df Sum of Sq F Pr(>F)
1 52 25.184
2 48 21.837 4 3.3477 1.8396 0.1366
'data.frame': 646 obs. of 15 variables:
$ abund : num 5 0.2 0.4 0 0 0 0 0 0 0 ...
$ Present : Factor w/ 2 levels "Absent","Present": 1 1 1 2 2 2 2 2 2 2 ...
$ clDD : num 5543 5750 5395 5920 6152 ...
$ clFD : num 83.5 67.5 79.5 66.7 57.6 59.2 59.5 51.5 47.4 46.3 ...
$ clTmi : num 9 9.6 8.6 11.9 11.6 10.8 10.8 11.6 13.6 13.5 ...
$ clTma : num 32.1 31.4 30.9 31.9 32.4 32.1 32.3 33 33.5 33.4 ...
$ clTmn : num 15.2 15.7 14.8 16.2 16.8 ...
$ clP : num 140 147 148 143 141 ...
$ NDVI : int -56 -44 -36 -49 -42 -49 -48 -50 -64 -58 ...
$ broadleaf: num 0.3866 0.9516 0.9905 0.0506 0.2296 ...
$ conif : num 0.0128 0.0484 0 0.9146 0.7013 ...
$ grass : num 0 0 0 0 0 0 0 0 0 0 ...
$ crop : num 0.2716 0 0 0.0285 0.044 ...
$ urban : num 0.2396 0 0 0 0.0157 ...
$ wetland : num 0 0 0 0 0 0 0 0 0 0 ...
Multiple linear regression in R
Run the multiple linear regression in R with abund as a function of clTma + NDVI + grass
summary(lm.mult)
Stepwise regression
Least squares
- most used method and corresponds to the default function in R
Observed (measured) value of Yi at Xi
Predicted value of Yi at Xi
Mean of all Yi
VE:
Error variance (residuals)
VR:
Regression variance
VT:
Total variance
R2
= VR/VT
Shapiro-Wilk normality test

data: resid(anov1)
W = 0.98, p-value = 0.4982
Bartlett test of homogeneity of variances

data: logMaxAbund by Diet
Bartlett's K-squared = 7.4728, df = 4, p-value = 0.1129
Continuous explanatory variable
Homogeneous and normally distributed error
Independent residuals
No outlier
Heteroscedastic
Nonlinear
Looks for influential values
Leverage points
: observations at extreme/ outlying values of the explanatory variable. Because they lack neighboring observations, the regression model passes close to leverage points.
They may OR may not have a high influence on the regression.
High leverage points with high influence can be identified with a
Cook's distance greater than 0.5
No leverage
Low influence
High leverage
No influence
High leverage
High influence
Can you write down the equation of the regression line for your model lm2?

Are the parameters significant?

What proportion of variance is explained by model lm2 ?
R2-adj changed from 0.05 to 0.25 when we dropped aquatic birds:
Examine the relationship between log(MaxAbund) and log(Mass) for passerine birds
Save the model object as lm4
HINT: Passerine is also coded 0 and 1 (look at str(bird))
Compare the variance explained by lm2, lm3 and lm4
# Run the model
lm4 <- lm(logMaxAbund ~ logMass, data=bird, subset=bird$Passerine == 1)

# Examine the diagnostic plots
par(mfrow=c(2,2))
plot(lm4)
summary(lm4)

# Compare variance explained by lm2, lm3 and lm4
str(summary(lm4))
# Recall: we want adj.r.squared
summary(lm4)$adj.r.squared
# R2-adj = -0.02
summary(lm2)$adj.r.squared
# R2-adj = 0.05
summary(lm3)$adj.r.squared
# R2-adj = 0.25

The best model among the three models is lm3
(only terrestrial birds)

Response variable:
Continuous

Explanatory variable:
Categorical with
2 levels
H0:
The mean of group 1 is equal to the mean of group 2
H1:
The mean of group 1 is not equal to the mean of group 2
If assumptions cannot be respected,

the
non-parametric
equivalent of t-test is the
Mann-Whitney test
If two groups
are not independent
(e.g. measurements on the same individual at 2 different years), you should use a
paired t-test
Violation of Assumptions
If variances between groups are not equal, can use corrections like the
Welch correction
(DEFAULT in R!)
Are aquatic birds heavier than non-aquatic birds?
Response variable:
Bird mass
/

num: Continuous
Explanatory variable:
Aquatic
/

2 levels: 1 or 0 (yes or no)
The ratio of variances is not statistically different from 1, therefore variances are equal
We may now proceed with the t-test!
You can verify that
t.test()
and

lm()

provide the same model:
Analysis of Variance (ANOVA)
Subsets variation in the response variable into additive effects of one or several factors and the interactions between them
Main Effects
of Factors
The constant is the mean of the values of the response variable
Generalization of the t-test to >2 groups and/or ≥ 2 factor levels
Y = u +
+ Residuals
+ Interactions Between Factors
Types of ANOVA
One-Way ANOVA
One factor with >2 levels
Two-Way ANOVA
2 factors or more
Each factor can have multiple levels
Need to test for
interactions
among factors
Note that ANOVA can be used for repeated measures, but we won't cover this today. Mixed-effect linear models can also be used for this kind of data (see Workshop 5!)
Repeated measures?
Just like the t-test, the robustness of ANOVA increases with sample size and if groups have equal sizes
For each factor and each interaction term:
H0: The mean of the response variable is equal between groups
H1: At least two groups have a different mean
Compares variation within and between groups in order to determine if group means differ
When an interaction term is significant: the effect of a given factor varies according to the levels of another factor
Review: ANOVA
Assumptions
Normality of residuals
Equality of variance between groups
Complementary test
When ANOVA detects a significant difference between groups, it does not tell you which group differs from which other group
A commonly used post-hoc test to answer this question is the
Tukey's test
Response variable:
MaxAbund
/ num: continuous
First, visualize the data using
boxplot()
(alphabetically by default)
Is abundance a function of diet?
The ANOVA results can be graphically illustrated using the
barplot()
function
Challenge 2
Evaluate how log(MaxAbund) varies with Diet and Aquatic
Hint: Test the factors Diet, Aquatic and their interaction in a two-way ANOVA model

e.g. lm(Y ~ A*B)
where A is your first factor, B is your second and "*" is the command for interaction in R
Analysis of Covariance (ANCOVA)
Subsets the observed variation:
Y = μ +
+ Interactions between factors
Main Effects
of Factors
+ Residuals

Combination of ANOVA and linear regression
Explanatory variables are a mix of continuous variable(s) (covariates) and categorial variable(s) (factors)
+ Main effects of covariates
+ Interactions between covariates and factors
Assumptions
Variables are

fixed *
Residuals are normally distributed
Independence btw residuals and fitted values
Equal variance among factor levels
Same range for all covariates
No interaction between factors and covariate(s)
No outlier
If not respected; try transforming your data
* A
fixed
variable is one that you are specifically interested in (bird mass) whereas a
random
variable is noise that you want to control for (i.e. site a bird was sampled in).

See
Linear mixed-effect model (LMM) Workshop 5
next semester for more on this!
Types of ANCOVA
Frequently used ANCOVA
You can have any number of factors and/or covariates, but as their number increases, the interpretation of results gets more complex
1-
One covariate and one factor
2- One covariate and two factors
3- Two covariates and one factor
We will only see the first case today, but the two others are pretty similar!
ANCOVA with one covariate and one factor
1) Effect of factor and covariate on the response variable

2) Effect of factor on the response variable after removing effect
of covariate

3) Effect of covariate on response variable while controlling for the effect of factor
Different possible goals of this analysis:
You can only meet these goals if you have no interaction between your factor and covariate!
One level of the factor has a different slope
Many levels have different slopes
If the interaction is significant, outputs will look like these:
Is Max Abund a function of Diet and Mass?
Response variable:
MaxAbund
/

num : quantitative continuous
Explanatory variables:
Diet
/

factor with 5 levels
Mass
/ numeric continuous
Multiple linear regression
Explanatory variable
: 2 or more continuous variables
Response variable
: 1 continuous variable
Generalization of simple linear regression
Assumptions
Each of the explanatory variables must follow a normal distribution
Linear relationship between each explanatory variable and response variable
Explanatory variables are orthogonal (NO collinearity)
Residuals follow a normal distribution
Independence of residuals versus predicted values
Independence of the variance of residuals versus predicted values
No outliers

*same as simple linear regression!

Possible solutions:
Collinearity:
However, the response variable (abund) is not linearly related to the explanatory variables.
* See
Advanced section
on
polynomial regression
for solution!
Run a model with everything in except the "Present/ absence" variable

Use
step()
to iteratively select the best model,
i.e. model with lowest Akaike's Information Criterion (AIC)
lm.full <- lm(abund ~ . - Present, data=Dickcissel)
summary(lm.full)
lm.step <- step(lm.full)
summary(lm.step)
?step
Interpreting contrasts
DEFAULT contrast compares each level of a factor to baseline level
The intercept is the baseline group and corresponds to the mean of the first (alphabetically) Diet level (Insect)
tapply(bird$logMaxAbund, bird$Diet, mean)
coef(anov1)
coef(anov1)[1] + coef(anov1)[2] # InsectVert
coef(anov1)[1] + coef(anov1)[3] # Plant
What did you notice?
Interpreting contrasts
bird$Diet2 <- relevel(bird$Diet, ref="Plant")
anov2 <- lm(logMaxAbund ~ Diet2, data=bird)
summary(anov2)
anova(anov2)
We may want to relevel the baseline:
Did you see change in the significance of each Diet level?
Interpreting contrasts
Note, the DEFAULT contrast ("contr.treatment") is NOT orthogonal
To be orthogonal,
1) coefficients must sum to 0
2) any two contrast columns must sum to 0

sum(contrasts(bird$Diet)[,1])
sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])
table(bird$Aquatic)
Unbalanced ANOVA
The
birdsdiet
data is actually unbalanced (number of Aquatic and non-Aquatic is not equal)
As we noticed in the section on
multiple linear regression
, MaxAbund was non-linearly related to some variables.

To test for non-linear relationships, polynomial models of different degrees are compared.
Polynomial regression
A polynomial model looks like this:
terms
this polynomial has 3 terms
Polynomial regression
For a polynomial with one variable (x), the
degree
is the largest exponent of that variable
this makes the degree 4
Polynomial regression
When you know a degree, you can also give it a name:
lm.linear <- lm(abund ~ clDD, data=Dickcissel)

lm.quad <- lm(abund ~ clDD + I(clDD^2), data=Dickcissel)

lm.cubic <- lm(abund ~ clDD + I(clDD^2) + I(clDD^3), data=Dickcissel)
Using the
Dickcissel
dataset, test the non-linear relationship between max abundance and temperature by comparing three sets of nested polynomial models (of degrees 0, 1, and 3):
Polynomial regression
Polynomial regression
Compare the polynomial models and determine which nested model we should keep

Run a summary of this model, report the regression formula, p-values and R2-adj

Variance Partitioning
Some of the selected explanatory variables in the
multiple linear regression
section were highly correlated

Collinearity between explanatory variables can be assessed using the variance inflation factor -
vif()
function of package ‘HH’
Variable with VIF >> 5 are considered collinearity
vif(clDD ~ clFD + clTmi + clTma + clP + grass, data=Dickcissel)
Variance Partitioning
Use
varpart()
to partition the variance in max abundance with all land cover variables in one set and all climate variables in the other set (leaving out NDVI)
Note
: Collinear variables do not have to be removed prior to partitioning
library(vegan)
part.lm = varpart(Dickcissel$abund,
Dickcissel[ ,c("clDD","clFD","clTmi","clTma","clP")],
Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])
part.lm

showvarparts(2)
plot(part.lm,digits=2,bg=rgb(48,225,210,80,maxColorValue=225),col="turquoise4")
Variance Partitioning
Proportion of variance explained by climate alone is 28.5% (given by X1|X2), by land cover alone is ~0% (X2|X1), and by both combined is 2.4%
# Climate set
out.1 = rda(Dickcissel$abund, Dickcissel[ ,c("clDD",
"clFD","clTmi","clTma","clP")], Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])
anova(out.1, step=1000, perm.max=1000)

# Land cover set
out.2 = rda(Dickcissel$abund,
Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban", "wetland")], Dickcissel[ ,c("clDD","clFD","clTmi", "clTma","clP")])
anova(out.2, step=1000, perm.max=1000)
Variance Partitioning
Add Intercept + coefficient estimates of each Diet level
1) Compare the Plant diet to all other diet levels
bird$Diet2 <- factor(bird$Diet, levels=names(med))
anov2 <- lm(logMaxAbund ~ Diet2, data=bird)
summary(anov2)
anova(anov2)
2) Reorder multiple levels according to median
options(contrasts=c("contr.helmert", "contr.poly"))
sum(contrasts(bird$Diet)[,1])
sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])
anov3 <- lm(logMaxAbund ~ Diet, data=bird)
summary(anov3)
Change the contrast to make levels orthogonal (e.g. Helmert contrast will contrast the second level with the first, the third with the average of the first two, and so on)
Which means the order of the covariates changes the values of Sums of Squares
unb.anov1 <- lm(logMaxAbund ~ Aquatic + Diet, data=bird)
unb.anov2 <- lm(logMaxAbund ~ Diet + Aquatic, data=bird)
anova(unb.anov1)
anova(unb.anov2)
library(car)
Anova(unb.anov1,type="III")
Anova(unb.anov2,type="III")

Now try type III ANOVA using the
Anova()
function. What have you noticed when using Anova()?
Test significance of each fraction
Conclusion
: the land cover fraction is non-significant once climate data is accounted for

Review: Simple linear regression
Running a regression in R
Running a regression in R
Running a regression in R
Running a regression in R
Diagnostic plot # 1 - Residuals vs Fitted
The function
abline()
adds the best-fit line
The function
hist()
produces a histogram of the variable
Assumptions not met - where did we go wrong?
Step 1:

Re-run the analysis with log-transformed variables
Hypothesis
Explanatory variable:
Diet
/ factor with 5 levels
Try it and compare the outputs!
Review: ANCOVA
H0:
The mean of group 1 is not larger than the mean of group 2
H1
: The mean of group 1 is smaller than the mean of group 2
Hypothesis (for a unilateral t-test)
The model now accounts for 31.44% of the Dickcissel abundance variability
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***
logMass -0.2361 0.1170 -2.019 0.0487 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6959 on 52 degrees of freedom
Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484
F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869
Running a one-way ANOVA in R
Compare outputs:
anov1 <- lm(logMaxAbund ~ Diet, data=bird)
anova(anov1)
aov1 <- aov(logMaxAbund ~ Diet, data=bird)
summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
Diet 4 5.106 1.276 2.836 0.0341 *
Residuals 49 22.052 0.450
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Full transcript