Learning objectives
Quebec Centre for Biodiversity Science
R Workshop Series
Type of explanatory variable
Continuous AND categorical
Workshop 4:
Linear models
GLM
4. ANCOVA
2. T test
ANCOVA with one covariate and one factor
Challenge 3
Challenge 3 - Solution
Running an ANCOVA in R
Types of ANCOVA
Review: ANCOVA
ANCOVA with one covariate and one factor
Analysis of Covariance (ANCOVA)
If the interaction is significant, outputs will look like these:
If your covariate and factor are significant, you will have a scenario that looks like this:
ancov1 <- lm(logMaxAbund ~ logMass*Diet, data=bird)
anova(ancov1)
Is Max Abund a function of Diet and Mass?
- Remove the interaction term and re-evaluate the model (with only the main effects of Diet and logMass)
- Combination of ANOVA and linear regression
Assumptions
- If you want to compare the mean values of each factor, conditional on the effect of the other, the ajusted means should be visualized
- The effect() function uses the output of the ANCOVA model to estimate the means of each factor level, corrected by the effect of the covariate
- Different possible goals of this analysis:
- If only your factor is significant, remove the covariate -> you will have a simple ANOVA
1- Run an ANCOVA to test the effect of Diet and Mass (as well as their interaction) on Maximum abundance
- You can have any number of factors and/or covariates, but as their number increases, the interpretation of results gets more complex
- Explanatory variables are a mix of continuous variable(s) (covariates) and categorial variable(s) (factors)
- Response variable: MaxAbund / num : quantitative continuous
If not respected; try transforming your data
ancova.exemple <- lm(Y~X*Z, data=data)
- If only your covariate is significant, remove your factor -> you will have a simple linear regression
ancov2 <- lm(logMaxAbund ~ logMass + Diet, data=bird)
1) Effect of factor and covariate on the response variable
2) Effect of factor on the response variable after removing effect
of covariate
3) Effect of covariate on response variable while controlling for the effect of factor
- Subsets the observed variation:
- Explanatory variables: Diet / factor with 5 levels
Mass / numeric continuous
summary(ancova.exemple)
- Variables are fixed *
- Residuals are normally distributed
- Independence btw residuals and fitted values
- Equal variance among factor levels
- Same range for all covariates
- No interaction between factors and covariate(s)
- No outlier
anova(ancov2)
ancova.exemple <- lm(Y ~ X*Z, data=data)
# X = quantitative; Z = categorical
library(effects)
adj.means.ex <- effect('Z', ancova.exemple)
plot(adj.means.ex)
Analysis of Variance Table
Response: logMaxAbund
Df Sum Sq Mean Sq F value Pr(>F)
logMass 1 1.9736 1.97357 4.6054 0.03743 *
Diet 4 3.3477 0.83691 1.9530 0.11850
logMass:Diet 4 2.9811 0.74527 1.7391 0.15849
Residuals 44 18.8556 0.42854
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- If the interaction btw your covariate and factor (*) is significant, you should explore which levels of the factor have different slopes from the others
1- One covariate and one factor
2- One covariate and two factors
3- Two covariates and one factor
2- Test whether the interaction is signifiant
+ Interactions between factors
+ Main effects of covariates
+ Interactions between covariates and factors
- You can only meet these goals if you have no interaction between your factor and covariate!
Analysis of Variance Table
Model 1: logMaxAbund ~ logMass
Model 2: logMaxAbund ~ logMass + Diet
Res.Df RSS Df Sum of Sq F Pr(>F)
1 52 25.184
2 48 21.837 4 3.3477 1.8396 0.1366
ancova.exemple2 <- lm(Y~X+Z, data=data)
Verify assumptions!
$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...
$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...
$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...
$ Mass : num 716 5.3 35.8 119.4 315.5 ...
$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...
$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...
$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...
- We will only see the first case today, but the two others are pretty similar!
* A fixed variable is one that you are specifically interested in (bird mass) whereas a random variable is noise that you want to control for (i.e. site a bird was sampled in).
See Linear mixed-effect model (LMM) Workshop 5 next semester for more on this!
summary(ancova.exemple2)
1. Simple linear
regression
- Very similar to what we have done so far!
Many levels have different slopes
One level of the factor has a different slope
Interaction btw logMass and Diet is non-significant
Running a t-test in R
Running a T test in R
Running a t-test in R
T-test
Group discussion
Violation of Assumptions
Running a T test in R
Running a T-test in R
- Response variable: Continuous
- Next, test the assumption of equality of variance using var.test() function
- If p<0.01 (or 0.05), the null hypothesis of no difference between the two groups (H0) can be rejected, with a risk of 0.01 (or 0.05) that we made a mistake in this conclusion.
- Are aquatic birds heavier than terrestrial birds?
- Explanatory variable: Categorical with 2 levels
- Use the function t.test()
- First, lets visualize the data using the function boxplot()
Are aquatic birds heavier than non-aquatic birds?
Yes, aquatic birds are heavier than terrestrial birds
- If variances between groups are not equal, can use corrections like the Welch correction (DEFAULT in R!)
var.test(logMass~Aquatic,data=bird)
uni.ttest1
ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird)
# Or use lm()
ttest.lm1 <- lm(logMass ~ Aquatic, data=bird)
Hypothesis (for a bilateral t-test)
t.test(Y~X2, data= data, alternative = "two.sided")
boxplot(logMass ~ Aquatic, data=bird,names=c("Non-Aquatic","Aquatic"))
- H0: The mean of group 1 is equal to the mean of group 2
- H1: The mean of group 1 is not equal to the mean of group 2
- Response variable: Bird mass / num: Continuous
# Unilateral t-test
uni.ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird, alternative="less")
uni.ttest1
- Explanatory variable: Aquatic / 2 levels: 1 or 0 (yes or no)
There exists a difference in mass between the aquatic and terrestrial birds
Indicates that homogeneity of variance was respected (as we just tested)
- If assumptions cannot be respected, the non-parametric equivalent of t-test is the Mann-Whitney test
Alternative hypothesis:
"two.sided" (default), "less", or "greater"
Hypothesis (for a unilateral t-test)
F test to compare two variances
data: logMass by Aquatic
F = 1.0725, num df = 38, denom df = 14, p-value = 0.9305
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3996428 2.3941032
sample estimates:
ratio of variances
1.072452
Two Sample t-test
data: logMass by Aquatic
t = -7.7707, df = 52, p-value = 1.468e-10
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -1.039331
sample estimates:
mean in group 0 mean in group 1
1.583437 2.908289
Two Sample t-test
data: logMass by Aquatic
t = -7.7707, df = 52, p-value = 2.936e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.6669697 -0.9827343
sample estimates:
mean in group 0 mean in group 1
1.583437 2.908289
- H0: The mean of group 1 is not larger than the mean of group 2
- H1: The mean of group 1 is smaller than the mean of group 2
- You can verify that t.test() and lm() provide the same model:
- The t-test is still a linear model and a specific case of ANOVA with one factor with 2 levels
Assumptions
- If two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a paired t-test
- Thus the function lm() can also be used
ttest1$statistic^2
anova(ttest.lm1)$F
# answer: F=60.3845 in both cases
- Data follow a normal distribution
- Equality of variance between groups
lm.t <-lm(Y~X2, data = data)
- The ratio of variances is not statistically different from 1, therefore variances are equal
When the assumption of equal variance is met t^2 follows an F distribution
* The robustness of this test increases with sample size and is higher when groups have equal sizes
anova(lm.t)
- We may now proceed with the t-test!
Review: Simple linear regression
Running a regression in R
Linear relationship between response (Y) and explanatory (X) variable
Download the birdsdiet dataset:
- Response variable: Bird abundance / num: continuous
- Least squares - most used method and corresponds to the default function in R
Step 1: Run your linear model
- Explanatory variable: Bird mass / num: continuous
Yi = α + βXi + εi
Step 2: Verify assumptions
In R, the function lm() is used to fit a linear model
setwd()
bird<-read.csv("birdsdiet.csv")
Observed (measured) value of Yi at Xi
Can you transform your variables (and does it make sense to do so)?
Assumptions
$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...
$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...
$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...
$ Mass : num 716 5.3 35.8 119.4 315.5 ...
$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...
$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...
$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...
Visualize the data using the structure str() command:
lm1 <- lm(Y ~ X)
Predicted value of Yi at Xi
str(bird)
- Continuous explanatory variable
- Homogeneous and normally distributed error
- Independent residuals
- No outlier
New object containing the linear model we created
- We first want to test if bird maximum abundance is a function of bird mass
Try GLM that might better suit your data
$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...
$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...
$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...
$ Mass : num 716 5.3 35.8 119.4 315.5 ...
$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...
$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...
$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...
Go back to Step 1 with transformed variables
Step 3: Estimate parameters (regression coefficients), test significance, plot your model
VE: Error variance (residuals)
VR: Regression variance
VT: Total variance
R2 = VR/VT
lm1 <- lm(bird$MaxAbund ~ bird$Mass)
Leverage vs influence
Assumptions not met - where did we go wrong?
Step 2: Verify assumptions of lm1
Assumptions not met - where did we go wrong?
Step 2: Verify the assumptions of model lm2
Step 2: Verify assumptions of model lm2
Transform the data
Diagnostic plot # 1 - Residuals vs Fitted
Diagnostic plot # 2 - Scale Location
Diagnostic plot # 3 - Normal QQ
Diagnostic plot # 4 - Residuals vs Leverage
Running a regression in R
Diagnostic plot # 1 - Residuals vs Fitted
High leverage point and high influence
plot(lm1)
- Plot Y ~ X with corresponding best-fit regression line, and a histogram of Y and X to explore their distributions
- Can also use the Shapiro-Wilk and the Skewness tests to see if variables follow a normal distribuon:
- Lets try normalizing data with a log10() transformation
plot(lm2)
- Points are outside the 0.5 limit of Cook's distance
- These points have too much influence on the regression
- Compares the distribution (quantiles) of the residuals of the current model to those of a normal distribution
- Should show a scatter across and no pattern
Step 2: Verify assumptions using diagnostic plots
Examples of non-independence
(what we don't want)
plot(logMaxAbund ~ logMass, data=bird)
abline(lm2)
hist(log10(bird$MaxAbund))
hist(log10(bird$Mass))
- Add the log-transformed variables to our dataframe
- Looks for influential values
Example of independence (what we want!)
shapiro.test(bird$MaxAbund)
shapiro.test(bird$Mass)
plot(bird$MaxAbund ~ bird$Mass)
abline(lm1)
hist(bird$MaxAbund)
hist(bird$Mass)
bird$logMaxAbund <- log10(bird$MaxAbund)
bird$logMass <- log10(bird$Mass)
- If points lie linearly on the 1:1 line, data follow a normal distribution
In both cases, distributions are significantly different from normal
- Should show a scatter across and no pattern
2 x 2 picture on one plot
opar <- par(mfrow=c(2,2))
plot(lm1)
- The function abline() adds the best-fit line
- The function hist() produces a histogram of the variable
- Leverage points: observations at extreme/ outlying values of the explanatory variable. Because they lack neighboring observations, the regression model passes close to leverage points. They may OR may not have a high influence on the regression.
Shapiro-Wilk normality test
data: bird$MaxAbund
W = 0.5831, p-value = 3.872e-11
Shapiro-Wilk normality test
data: bird$Mass
W = 0.5467, p-value = 1.155e-11
- High leverage
- No influence
- par() sets the graphical parameters, for example, the mfrow argument sets the number of rows in the frame
- plot() is a generic function to plot graphics in R
** Note: you should never remove outliers if you don't have good reasons to do so (ex: error of measurement)
Step 1: Re-run the analysis with log-transformed variables
The positive skewness values also indicate that the distributions are left skewed
High leverage point and reasonable influence
lm2 <- lm(bird$logMaxAbund ~ bird$logMass)
- High leverage points with high influence can be identified with a Cook's distance greater than 0.5
skewness(bird$MaxAbund)
skewness(bird$Mass)
The output will provide the four built-in diagnostic plots of the lm() function
- Here, point 32 has high leverage but its influence is acceptable (inside the 0.5 Cook's distance limits)
- High leverage
- High influence
[1] 3.167159 # Skewness of MaxAbund
[1] 3.146062 # Skewness of Mass
- Solution: Transform your data or try another distribution than linear (Gaussian) (i.e., a generalized linear model (GLM): Poisson, binomial, negative binomial, etc.)
No pattern in the residuals
Strong pattern in the residuals
Group discussion
Step 3 : Estimate parameters and test their significance
Challenge 1 - Solution
- R2-adj changed from 0.05 to 0.25 when we dropped aquatic birds:
Challenge 1
Group discussion
- The function summary() is used to obtain parameter estimates, significance, etc:
- Can we improve the model if we only use terrestrial birds?
- The best model among the three models is lm3 (only terrestrial birds)
summary(lm2)
**You could exclude objects using "=!"
- Can you write down the equation of the regression line for your model lm2?
- Are the parameters significant?
- What proportion of variance is explained by model lm2 ?
- Examine the relationship between log(MaxAbund) and log(Mass) for passerine birds
- Save the model object as lm4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***
logMass -0.2361 0.1170 -2.019 0.0487 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6959 on 52 degrees of freedom
Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484
F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869
HINT: Passerine is also coded 0 and 1 (look at str(bird))
# Run the model
lm4 <- lm(logMaxAbund ~ logMass, data=bird, subset=bird$Passerine == 1)
# Examine the diagnostic plots
par(mfrow=c(2,2))
plot(lm4)
summary(lm4)
# Compare variance explained by lm2, lm3 and lm4
str(summary(lm4)) # Recall: we want adj.r.squared
summary(lm4)$adj.r.squared # R2-adj = -0.02
summary(lm2)$adj.r.squared # R2-adj = 0.05
summary(lm3)$adj.r.squared # R2-adj = 0.25
lm3 <- lm(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic) # removes aquatic birds (= TRUE)
# or equivalently
lm3 <- lm(logMaxAbund~logMass, data=bird, subset=bird$Aquatic == 0)
# Examine the diagnostic plots
par(mfrow=c(2,2))
plot(lm3)
summary(lm3)
# Compare both models
par(mfrow=c(1,2))
plot(logMaxAbund~logMass, data=bird)
plot(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic)
- We can also call out specific parameters of the model, for example:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***
logMass -0.2361 0.1170 -2.019 0.0487 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6959 on 52 degrees of freedom
Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484
F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869
- Compare the variance explained by lm2, lm3 and lm4
lm2$coef
str(summary(lm2))
summary(lm2)$coefficients
summary(lm2)$r.squared
Website: http://qcbs.ca/wiki/r/workshop4
1 with 3 levels or more
and/ or many factors
GLM
3. ANOVA
Did we violate the model assumptions?
Running an ANOVA in R
Graphical representation
Running a one-way ANOVA in R
What if the assumptions were not met?
A posteriori test
Output of our ANOVA model
Running a one-way ANOVA in R
Types of ANOVA
Analysis of Variance (ANOVA)
Review: ANOVA
Running an ANOVA in R
Review: ANOVA
- Bartlett test - equality of variance among groups
- First, visualize the data using boxplot() (alphabetically by default)
- The ANOVA results can be graphically illustrated using the barplot() function
- Can use the function lm() to run an ANOVA
summary.lm(aov1)
summary(anov1)
bartlett.test(logMaxAbund ~ Diet, data=bird)
- Generalization of the t-test to >2 groups and/or ≥ 2 factor levels
- Transform the data - might normalize the residuals and homogenize the variances, and convert a multiplicative effect into an additive one
boxplot(logMaxAbund ~ Diet, data=bird)
Compares variation within and between groups in order to determine if group means differ
- If ANOVA detects a significant effect, a post-hoc test is required to determine which treatment(s) differ from the others. This can be done with the TukeyHSD() function
Is abundance a function of diet?
Assumptions
anov1 <- lm(logMaxAbund ~ Diet, data=bird)
anova(anov1)
- One factor with >2 levels
- Response variable: MaxAbund / num: continuous
- Factor levels in alphabetical order and all levels are compared to the reference level ("Insect")
Bartlett test of homogeneity of variances
data: logMaxAbund by Diet
Bartlett's K-squared = 7.4728, df = 4, p-value = 0.1129
data$logY <- log10(data$Y)
- Normality of residuals
- Equality of variance between groups
- We can reorder factor levels according to group medians using the tapply() et sort() functions
- Explanatory variable: Diet / factor with 5 levels
Hypothesis
TukeyHSD(aov(anov1),ordered=T)
sd <- tapply(bird$logMaxAbund,list(bird$Diet),sd)
means <- tapply(bird$logMaxAbund,list(bird$Diet),mean)
n <- length(bird$logMaxAbund)
se <- 1.96*sd/sqrt(n)
bp <- barplot(means)
segments(bp, means - se, bp, means + se, lwd=2)
segments(bp - 0.1, means - se, bp + 0.1, means - se, lwd=2)
segments(bp - 0.1, means + se, bp + 0.1, means + se, lwd=2)
aov1 <- aov(logMaxAbund ~ Diet, data=bird)
summary(aov1)
Only "Vertebrate" and "PlantInsect" differ
* see Workshop 1 for rules on data transformation
- Or use a specific function called aov()
med <- sort(tapply(bird$logMaxAbund, bird$Diet, median))
boxplot(logMaxAbund ~ factor(Diet, levels=names(med)))
- For each factor and each interaction term:
- H0: The mean of the response variable is equal between groups
- H1: At least two groups have a different mean
* re-run your data with the transformed variables and verify the assumptions once again
- Shapiro-Wilk test - Normality of residuals
Complementary test
- 2 factors or more
- Each factor can have multiple levels
- Need to test for interactions among factors
aov1 <- aov(logMaxAbund ~ Diet, data=bird)
summary(aov1)
shapiro.test(resid(anov1))
Significant difference btw Diet groups, but we don't know which ones!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1539 0.1500 7.692 5.66e-10 ***
DietInsectVert -0.6434 0.4975 -1.293 0.2020
DietPlant 0.2410 0.4975 0.484 0.6303
DietPlantInsect 0.4223 0.2180 1.938 0.0585 .
DietVertebrate -0.3070 0.2450 -1.253 0.2161
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6709 on 49 degrees of freedom
Multiple R-squared: 0.188, Adjusted R-squared: 0.1217
F-statistic: 2.836 on 4 and 49 DF, p-value: 0.0341
- When ANOVA detects a significant difference between groups, it does not tell you which group differs from which other group
$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...
$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...
$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...
$ Mass : num 716 5.3 35.8 119.4 315.5 ...
$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...
$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...
$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...
- When an interaction term is significant: the effect of a given factor varies according to the levels of another factor
+ Interactions Between Factors
Df Sum Sq Mean Sq F value Pr(>F)
Diet 4 5.106 1.276 2.836 0.0341 *
Residuals 49 22.052 0.450
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Shapiro-Wilk normality test
data: resid(anov1)
W = 0.98, p-value = 0.4982
- Kruskal-Wallis Test - non-parametric equivalent to ANOVA if you cannot (or do not wish to) transform the data
$Diet
diff lwr upr p adj
Vertebrate-InsectVert 0.3364295 -1.11457613 1.787435 0.9645742
Insect-InsectVert 0.6434334 -0.76550517 2.052372 0.6965047
Plant-InsectVert 0.8844338 -1.01537856 2.784246 0.6812494
PlantInsect-InsectVert 1.0657336 -0.35030287 2.481770 0.2235587
Insect-Vertebrate 0.3070039 -0.38670951 1.000717 0.7204249
Plant-Vertebrate 0.5480043 -0.90300137 1.999010 0.8211024
PlantInsect-Vertebrate 0.7293041 0.02128588 1.437322 0.0405485
Plant-Insect 0.2410004 -1.16793813 1.649939 0.9884504
PlantInsect-Insect 0.4223003 -0.19493574 1.039536 0.3117612
PlantInsect-Plant 0.1812999 -1.23473664 1.597336 0.9961844
Repeated measures?
- Try it and compare the outputs!
- A commonly used post-hoc test to answer this question is the Tukey's test
- Just like the t-test, the robustness of ANOVA increases with sample size and if groups have equal sizes
kruskal.test(Y~X, data)
- Note that ANOVA can be used for repeated measures, but we won't cover this today. Mixed-effect linear models can also be used for this kind of data (see Workshop 5!)
The constant is the mean of the values of the response variable
- Both tests are non-significant; variances assumed to be equal and residuals assumed to be normally distributed
Two-way ANOVA
Challenge 2
Challenge 2 - Solution
anova.with.lm <- lm(Y~X*Z, data)
anov2 <- lm(logMaxAbund ~ Diet*Aquatic, data=bird)
summary(anov2)
anova(anov2)
- According to law of parsimony, select model that explain the most variance with the least model parameters possible
- Thus, remove the interaction term if non-signficant and re-evaluate
The "*" symbol indicates that the main effects as well as their interaction will be included in the model
- Evaluate how log(MaxAbund) varies with Diet and Aquatic
anova.with.lm <- lm(Y~X+Z, data)
Analysis of Variance Table
Response: logMaxAbund
Df Sum Sq Mean Sq F value Pr(>F)
Diet 4 5.1059 1.27647 3.0378 0.02669 *
Aquatic 1 0.3183 0.31834 0.7576 0.38870
Diet:Aquatic 3 2.8250 0.94167 2.2410 0.09644 .
Residuals 45 18.9087 0.42019
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Hint: Test the factors Diet, Aquatic and their interaction in a two-way ANOVA model
e.g. lm(Y ~ A*B)
where A is your first factor, B is your second and "*" is the command for interaction in R
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
X 4 5.1059 1.27647 3.0378 0.02669 *
Z 1 0.3183 0.31834 0.7576 0.38870
X:Z 3 2.8250 0.94167 2.2410 0.10689
Residuals 45 18.9087 0.42019
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If use the "+" symbol, the main effects but not their interaction are included
Example of a non-significant interaction
5. Multiple linear
regression
Verify assumptions!
Multiple linear regression in R
- However, the response variable (abund) is not linearly related to the explanatory variables.
Find the best-fit model
Multiple linear regression
Multiple linear regression in R
- Explanatory variable: 2 or more continuous variables
- Response variable: 1 continuous variable
Collinearity:
- Run the multiple linear regression in R with abund as a function of clTma + NDVI + grass
par(mfrow=c(1,3))
plot(abund ~ clTma, data=Dickcissel)
plot(abund ~ NDVI, data=Dickcissel)
plot(abund ~ grass, data=Dickcissel)
- Recall the principle of parcimony: Most variance explained using the least number of terms possible
- Using the Dickcissel datast, test the effect of climate (clTma), productivity (NDVI) and grassland cover (grass) on bird abundance (abund)
- Verify collinearity of all explanatory variables
- Generalization of simple linear regression
plot(data=Dickcissel)
- Remove the weakest variable
lm.mult <- lm(abund ~ clTma + NDVI + grass, data=Dickcissel)
summary(lm.mult)
Yi = α + β1X1i + β2X2i + β3X3i + ...+ βkXki+ ε
summary(lm.mult)
Dickcissel <- read.csv("Dickcissel.csv")
str(Dickcissel)
Assumptions
- Verify diagnostic plots, as you did for the simple linear regression
- If you see a pattern between any two variables, they may be collinear!
- They are likely to explain the same variability of the response variable and the effect of one variable will be masked by the other
- Keep only one of the collinear variables
- Try multivariate analyses (see Ordination Workshop)
- Try a pseudo-orthogonal analysis
All 3 variables are significant
Model explains ~11% of the dickcissel variance (R2-adj = 0.11)
Possible solutions:
Call:
lm(formula = abund ~ clTma + NDVI + grass, data = Dickcissel)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -83.60813 11.57745 -7.222 1.46e-12 ***
clTma 3.27299 0.40677 8.046 4.14e-15 ***
NDVI 0.13716 0.05486 2.500 0.0127 *
grass 10.41435 4.68962 2.221 0.0267 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.58 on 642 degrees of freedom
Multiple R-squared: 0.117, Adjusted R-squared: 0.1128
F-statistic: 28.35 on 3 and 642 DF, p-value: < 2.2e-16
'data.frame': 646 obs. of 15 variables:
$ abund : num 5 0.2 0.4 0 0 0 0 0 0 0 ...
$ Present : Factor w/ 2 levels "Absent","Present": 1 1 1 2 2 2 2 2 2 2 ...
$ clDD : num 5543 5750 5395 5920 6152 ...
$ clFD : num 83.5 67.5 79.5 66.7 57.6 59.2 59.5 51.5 47.4 46.3 ...
$ clTmi : num 9 9.6 8.6 11.9 11.6 10.8 10.8 11.6 13.6 13.5 ...
$ clTma : num 32.1 31.4 30.9 31.9 32.4 32.1 32.3 33 33.5 33.4 ...
$ clTmn : num 15.2 15.7 14.8 16.2 16.8 ...
$ clP : num 140 147 148 143 141 ...
$ NDVI : int -56 -44 -36 -49 -42 -49 -48 -50 -64 -58 ...
$ broadleaf: num 0.3866 0.9516 0.9905 0.0506 0.2296 ...
$ conif : num 0.0128 0.0484 0 0.9146 0.7013 ...
$ grass : num 0 0 0 0 0 0 0 0 0 0 ...
$ crop : num 0.2716 0 0 0.0285 0.044 ...
$ urban : num 0.2396 0 0 0 0.0157 ...
$ wetland : num 0 0 0 0 0 0 0 0 0 0 ...
par(mfrow=c(2,2))
plot(lm.mult)
- Each of the explanatory variables must follow a normal distribution
- Linear relationship between each explanatory variable and response variable
- Explanatory variables are orthogonal (NO collinearity)
- Residuals follow a normal distribution
- Independence of residuals versus predicted values
- Independence of the variance of residuals versus predicted values
- No outliers
*same as simple linear regression!
* See Advanced section on polynomial regression for solution!
Variables selected by the step() function
Stepwise regression
- Run a model with everything in except the "Present/ absence" variable
- Use step() to iteratively select the best model,
- i.e. model with lowest Akaike's Information Criterion (AIC)
Call:
lm(formula = abund ~ clDD + clFD + clTmi + clTma + clP + grass,
data = Dickcissel)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.457e+02 3.464e+01 -12.868 < 2e-16 ***
clDD 5.534e-02 8.795e-03 6.292 5.83e-10 ***
clFD 1.090e+00 1.690e-01 6.452 2.19e-10 ***
clTmi -6.717e+00 7.192e-01 -9.339 < 2e-16 ***
clTma 3.172e+00 1.288e+00 2.463 0.014030 *
clP 1.562e-01 4.707e-02 3.318 0.000959 ***
grass 1.066e+01 4.280e+00 2.490 0.013027 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.85 on 639 degrees of freedom
Multiple R-squared: 0.3207, Adjusted R-squared: 0.3144
F-statistic: 50.29 on 6 and 639 DF, p-value: < 2.2e-16
lm.full <- lm(abund ~ . - Present, data=Dickcissel)
summary(lm.full)
lm.step <- step(lm.full)
summary(lm.step)
?step
The model now accounts for 31.44% of the Dickcissel abundance variability
Optional section
(... if time permits)
Optional
1. Interpreting contrasts
Interpreting contrasts
- We may want to relevel the baseline:
- DEFAULT contrast compares each level of a factor to baseline level
- The intercept is the baseline group and corresponds to the mean of the first (alphabetically) Diet level (Insect)
1) Compare the Plant diet to all other diet levels
- Note, the DEFAULT contrast ("contr.treatment") is NOT orthogonal
- To be orthogonal,
1) coefficients must sum to 0
2) any two contrast columns must sum to 0
bird$Diet2 <- relevel(bird$Diet, ref="Plant")
anov2 <- lm(logMaxAbund ~ Diet2, data=bird)
summary(anov2)
anova(anov2)
sum(contrasts(bird$Diet)[,1])
sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])
- Add Intercept + coefficient estimates of each Diet level
2) Reorder multiple levels according to median
- Change the contrast to make levels orthogonal (e.g. Helmert contrast will contrast the second level with the first, the third with the average of the first two, and so on)
tapply(bird$logMaxAbund, bird$Diet, mean)
coef(anov1)
coef(anov1)[1] + coef(anov1)[2] # InsectVert
coef(anov1)[1] + coef(anov1)[3] # Plant
bird$Diet2 <- factor(bird$Diet, levels=names(med))
anov2 <- lm(logMaxAbund ~ Diet2, data=bird)
summary(anov2)
anova(anov2)
options(contrasts=c("contr.helmert", "contr.poly"))
sum(contrasts(bird$Diet)[,1])
sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])
anov3 <- lm(logMaxAbund ~ Diet, data=bird)
summary(anov3)
- Did you see change in the significance of each Diet level?
Unbalanced ANOVA
- The birdsdiet data is actually unbalanced (number of Aquatic and non-Aquatic is not equal)
table(bird$Aquatic)
- Which means the order of the covariates changes the values of Sums of Squares
unb.anov1 <- lm(logMaxAbund ~ Aquatic + Diet, data=bird)
unb.anov2 <- lm(logMaxAbund ~ Diet + Aquatic, data=bird)
anova(unb.anov1)
anova(unb.anov2)
- Now try type III ANOVA using the Anova() function. What have you noticed when using Anova()?
library(car)
Anova(unb.anov1,type="III")
Anova(unb.anov2,type="III")
Polynomial regression
- When you know a degree, you can also give it a name:
- For a polynomial with one variable (x), the degree is the largest exponent of that variable
- As we noticed in the section on multiple linear regression, MaxAbund was non-linearly related to some variables.
- To test for non-linear relationships, polynomial models of different degrees are compared.
- A polynomial model looks like this:
this polynomial has 3 terms
Polynomial regression
- Using the Dickcissel dataset, test the non-linear relationship between max abundance and temperature by comparing three sets of nested polynomial models (of degrees 0, 1, and 3):
- Compare the polynomial models and determine which nested model we should keep
- Run a summary of this model, report the regression formula, p-values and R2-adj
lm.linear <- lm(abund ~ clDD, data=Dickcissel)
lm.quad <- lm(abund ~ clDD + I(clDD^2), data=Dickcissel)
lm.cubic <- lm(abund ~ clDD + I(clDD^2) + I(clDD^3), data=Dickcissel)
Variance Partitioning
Variance Partitioning
- Test significance of each fraction
- Proportion of variance explained by climate alone is 28.5% (given by X1|X2), by land cover alone is ~0% (X2|X1), and by both combined is 2.4%
- Use varpart() to partition the variance in max abundance with all land cover variables in one set and all climate variables in the other set (leaving out NDVI)
- Note: Collinear variables do not have to be removed prior to partitioning
- Some of the selected explanatory variables in the multiple linear regression section were highly correlated
- Collinearity between explanatory variables can be assessed using the variance inflation factor - vif() function of package ‘HH’
- Variable with VIF >> 5 are considered collinearity
# Climate set
out.1 = rda(Dickcissel$abund, Dickcissel[ ,c("clDD",
"clFD","clTmi","clTma","clP")], Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])
anova(out.1, step=1000, perm.max=1000)
# Land cover set
out.2 = rda(Dickcissel$abund,
Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban", "wetland")], Dickcissel[ ,c("clDD","clFD","clTmi", "clTma","clP")])
anova(out.2, step=1000, perm.max=1000)
library(vegan)
part.lm = varpart(Dickcissel$abund,
Dickcissel[ ,c("clDD","clFD","clTmi","clTma","clP")],
Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])
part.lm
showvarparts(2)
plot(part.lm,digits=2,bg=rgb(48,225,210,80,maxColorValue=225),col="turquoise4")
vif(clDD ~ clFD + clTmi + clTma + clP + grass, data=Dickcissel)
- Conclusion: the land cover fraction is non-significant once climate data is accounted for