**Quebec Centre for Biodiversity Science**

R Workshop Series

R Workshop Series

**Workshop 4:**

Linear models

Linear models

**Website:**

http://qcbs.ca/wiki/r/workshop4

http://qcbs.ca/wiki/r/workshop4

**Learning objectives**

Download the

birdsdiet

dataset:

setwd()

bird<-read.csv("birdsdiet.csv")

**Type of explanatory variable**

**1 variable**

**1. Simple linear**

regression

regression

**2 or more**

**5. Multiple linear**

regression

regression

**1 with 2**

levels

levels

**2. T test**

**3. ANOVA**

**Continuous**

**Categorical**

**Continuous AND categorical**

**4. ANCOVA**

We first want to test if bird maximum abundance is a function of bird mass

Response variable:

Bird abundance

/ num: continuous

Explanatory variable:

Bird mass

/

num: continuous

In R, the function

lm()

is used to fit a linear model

lm1 <- lm(Y ~ X)

Response variable

New object containing the linear model we created

Running a regression in R

Assumptions

Review: Simple linear regression

Yi = α + βXi + εi

Linear relationship between response (Y) and explanatory (X) variable

Step 1:

Run your linear model

Step 2:

Verify assumptions

Assumptions are met

Step 3:

Estimate parameters (regression coefficients), test significance, plot your model

Assumptions are not met

Can you transform your variables (and does it make sense to do so)?

Go back to

Step 1

with transformed variables

Yes

No

Try GLM that might better suit your data

Step 1:

Run your linear model

par()

sets the graphical parameters, for example, the

mfrow

argument sets the number of rows in the frame

plot()

is a generic function to plot graphics in R

opar <- par(mfrow=c(2,2))

plot(lm1)

The output will provide the four built-in diagnostic plots of the lm() function

Should show a scatter across and no pattern

Example of independence

(what we want!)

Examples of non-independence

(what we don't want)

Solution:

Transform your data or try another distribution than linear (Gaussian) (i.e., a generalized linear model (GLM): Poisson, binomial, negative binomial, etc.)

Diagnostic plot # 1 - Residuals vs Fitted

If points lie linearly on the 1:1 line, data follow a normal distribution

Diagnostic plot # 3 - Normal QQ

Diagnostic plot # 2 - Scale Location

Should show a scatter across and no pattern

No pattern in the residuals

Strong pattern in the residuals

Compares the distribution (quantiles) of the residuals of the current model to those of a normal distribution

Diagnostic plot # 4 - Residuals vs Leverage

Leverage vs influence

No influential values

High leverage point and reasonable influence

Here, point 32 has high leverage but its influence is acceptable (inside the 0.5 Cook's distance limits)

High leverage point and high influence

Points are outside the 0.5 limit of Cook's distance

These points have too much influence on the regression

The function

summary()

is used to obtain parameter estimates, significance, etc:

Step 3 :

Estimate parameters and test their significance

** Note: you should never remove outliers if you don't have good reasons to do so (ex: error of measurement)

**GLM**

**GLM**

Group discussion

Challenge 1

T-test

Assumptions

Data follow a normal distribution

Equality of variance between groups

Running a T-test in R

t.test(Y~X2, data= data, alternative = "two.sided")

Use the function

t.test()

response variable

factor

(2 levels)

name of dataframe

Alternative hypothesis:

"two.sided" (default), "less", or "greater"

The t-test is still a linear model and a specific case of ANOVA with one factor with 2 levels

lm.t <-lm(Y~X2, data = data)

anova(lm.t)

mean of the two groups

Running a T test in R

Running a T test in R

**1 with 3 levels or more**

and/ or many factors

and/ or many factors

Y = μ +

+ Interactions between factors

+ Residuals

Review: ANOVA

Running a one-way ANOVA in R

Can use the function

lm()

to run an ANOVA

anov1 <- lm(logMaxAbund ~ Diet, data=bird)

anova(anov1)

Or use a specific function called

aov()

aov1 <- aov(logMaxAbund ~ Diet, data=bird)

summary(aov1)

Did we violate the model assumptions?

Bartlett test

- equality of variance among groups

bartlett.test(logMaxAbund ~ Diet, data=bird)

Shapiro-Wilk test

- Normality of residuals

shapiro.test(resid(anov1))

What if the assumptions were not met?

Transform the data

- might normalize the residuals and homogenize the variances, and convert a multiplicative effect into an additive one

data$logY <- log10(data$Y)

* re-run your data with the transformed variables and verify the assumptions once again

Kruskal-Wallis Test

- non-parametric equivalent to ANOVA if you cannot

(or do not wish to)

transform the data

kruskal.test(Y~X, data)

* see Workshop 1 for rules on data transformation

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

Running an ANOVA in R

Output of our ANOVA model

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.1539 0.1500 7.692 5.66e-10 ***

DietInsectVert -0.6434 0.4975 -1.293 0.2020

DietPlant 0.2410 0.4975 0.484 0.6303

DietPlantInsect 0.4223 0.2180 1.938 0.0585 .

DietVertebrate -0.3070 0.2450 -1.253 0.2161

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6709 on 49 degrees of freedom

Multiple R-squared: 0.188, Adjusted R-squared: 0.1217

F-statistic: 2.836 on 4 and 49 DF, p-value: 0.0341

Factor levels in alphabetical order and all levels are compared to the reference level ("Insect")

Significant difference btw Diet groups, but we don't know which ones!

A posteriori test

TukeyHSD(aov(anov1),ordered=T)

$Diet

diff lwr upr p adj

Vertebrate-InsectVert 0.3364295 -1.11457613 1.787435 0.9645742

Insect-InsectVert 0.6434334 -0.76550517 2.052372 0.6965047

Plant-InsectVert 0.8844338 -1.01537856 2.784246 0.6812494

PlantInsect-InsectVert 1.0657336 -0.35030287 2.481770 0.2235587

Insect-Vertebrate 0.3070039 -0.38670951 1.000717 0.7204249

Plant-Vertebrate 0.5480043 -0.90300137 1.999010 0.8211024

PlantInsect-Vertebrate 0.7293041 0.02128588 1.437322 0.0405485

Plant-Insect 0.2410004 -1.16793813 1.649939 0.9884504

PlantInsect-Insect 0.4223003 -0.19493574 1.039536 0.3117612

PlantInsect-Plant 0.1812999 -1.23473664 1.597336 0.9961844

Only "Vertebrate" and "PlantInsect" differ

Two-way ANOVA

anova.with.lm <- lm(Y~X*Z, data)

The "*" symbol indicates that the main effects as well as their interaction will be included in the model

Analysis of Variance Table

Response: Y

Df Sum Sq Mean Sq F value Pr(>F)

X 4 5.1059 1.27647 3.0378 0.02669 *

Z 1 0.3183 0.31834 0.7576 0.38870

X:Z 3 2.8250 0.94167 2.2410 0.10689

Residuals 45 18.9087 0.42019

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Example of a non-significant interaction

anova.with.lm <- lm(Y~X+Z, data)

If use the "+" symbol, the

main effects

but

not

their interaction are included

According to law of

parsimony

, select model that explain the most variance with the least model parameters possible

Thus, remove the interaction term if non-signficant and re-evaluate

You will see in the assumptions that there should be NO interaction between covariates and factors

ancova.exemple <- lm(Y~X*Z, data=data)

summary(ancova.exemple)

2- Test whether the interaction is signifiant

ancova.exemple2 <- lm(Y~X+Z, data=data)

No interaction

1- Run an ANCOVA to test the effect of Diet and Mass (as well as their interaction) on Maximum abundance

If your covariate and factor are significant, you will have a scenario that looks like this:

summary(ancova.exemple2)

If you want to compare the mean values of each factor, conditional on the effect of the other, the

ajusted means

should be visualized

The

effect()

function uses the output of the ANCOVA model to estimate the means of each factor level, corrected by the effect of the covariate

ancova.exemple <- lm(Y ~ X*Z, data=data)

# X = quantitative; Z = categorical

library(effects)

adj.means.ex <- effect('Z', ancova.exemple)

plot(adj.means.ex)

If only your factor is significant, remove the covariate -> you will have a simple

ANOVA

If the interaction btw your covariate and factor (*) is significant, you should explore which levels of the factor have different slopes from the others

If only your covariate is significant, remove your factor -> you will have a

simple linear regression

Verify assumptions!

Very similar to what we have done so far!

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

Yi = α + β1X1i + β2X2i + β3X3i + ...+ βkXki+ ε

Verify collinearity of all explanatory variables

Verify assumptions!

lm.mult <- lm(abund ~ clTma + NDVI + grass, data=Dickcissel)

summary(lm.mult)

plot(data=Dickcissel)

If you see a pattern between any two variables, they may be collinear!

They are likely to explain the same variability of the response variable and the effect of one variable will be masked by the other

Keep only

one

of the collinear variables

Try multivariate analyses (see

Ordination Workshop

)

Try a pseudo-orthogonal analysis

Verify diagnostic plots, as you did for the simple linear regression

par(mfrow=c(2,2))

plot(lm.mult)

Recall the principle of parcimony: Most variance explained using the least number of terms possible

Find the best-fit model

Using the

Dickcissel

datast, test the

effect of climate (clTma), productivity (NDVI) and grassland cover (grass) on bird abundance (abund)

Step 2:

Verify assumptions using diagnostic plots

lm1 <- lm(bird$MaxAbund ~ bird$Mass)

bird$logMaxAbund <- log10(bird$MaxAbund)

bird$logMass <- log10(bird$Mass)

Can we improve the model if we only use terrestrial birds?

lm3 <- lm(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic)

# removes aquatic birds (= TRUE)

# or equivalently

lm3 <- lm(logMaxAbund~logMass, data=bird, subset=bird$Aquatic == 0)

# Examine the diagnostic plots

par(mfrow=c(2,2))

plot(lm3)

summary(lm3)

# Compare both models

par(mfrow=c(1,2))

plot(logMaxAbund~logMass, data=bird)

plot(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic)

ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird)

# Or use lm()

ttest.lm1 <- lm(logMass ~ Aquatic, data=bird)

boxplot(logMaxAbund ~ Diet, data=bird)

par(mfrow=c(1,3))

plot(abund ~ clTma, data=Dickcissel)

plot(abund ~ NDVI, data=Dickcissel)

plot(abund ~ grass, data=Dickcissel)

Stepwise regression

Running a T test in R

boxplot(logMass ~ Aquatic, data=bird,names=c("Non-Aquatic","Aquatic"))

First, lets visualize the data using the function

boxplot()

Graphical representation

**Optional section**

1. Interpreting contrasts

2. Unbalanced ANOVA

4. Variance partitioning

Hypothesis (for a bilateral t-test)

If p<0.01 (or 0.05), the null hypothesis of no difference between the two groups (H0) can be rejected, with a risk of 0.01 (or 0.05) that we made a mistake in this conclusion.

Running a t-test in R

Next, test the assumption of equality of variance using

var.test()

function

var.test(logMass~Aquatic,data=bird)

Group discussion

Are aquatic birds heavier than terrestrial birds?

# Unilateral t-test

uni.ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird, alternative="less")

uni.ttest1

Running a t-test in R

Two Sample t-test

data: logMass by Aquatic

t = -7.7707, df = 52, p-value = 2.936e-10

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.6669697 -0.9827343

sample estimates:

mean in group 0 mean in group 1

1.583437 2.908289

There exists a difference in mass between the aquatic and terrestrial birds

What did you conclude?

Two Sample t-test

data: logMass by Aquatic

t = -7.7707, df = 52, p-value = 1.468e-10

alternative hypothesis: true difference in means is less than 0

95 percent confidence interval:

-Inf -1.039331

sample estimates:

mean in group 0 mean in group 1

1.583437 2.908289

Yes, aquatic birds are heavier than terrestrial birds

Group discussion

sd <- tapply(bird$logMaxAbund,list(bird$Diet),sd)

means <- tapply(bird$logMaxAbund,list(bird$Diet),mean)

n <- length(bird$logMaxAbund)

se <- 1.96*sd/sqrt(n)

bp <- barplot(means)

segments(bp, means - se, bp, means + se, lwd=2)

segments(bp - 0.1, means - se, bp + 0.1, means - se, lwd=2)

segments(bp - 0.1, means + se, bp + 0.1, means + se, lwd=2)

Both tests are non-significant; variances assumed to be equal and residuals assumed to be normally distributed

3. Polynomial regression

Optional

Plot Y ~ X with corresponding best-fit regression line, and a histogram of Y and X to explore their distributions

Lets try normalizing data with a log10() transformation

Add the log-transformed variables to our dataframe

Can also use the

Shapiro-Wilk

and the

Skewness

tests to see if variables follow a normal distribuon:

shapiro.test(bird$MaxAbund)

shapiro.test(bird$Mass)

skewness(bird$MaxAbund)

skewness(bird$Mass)

Shapiro-Wilk normality test

data: bird$MaxAbund

W = 0.5831, p-value = 3.872e-11

Shapiro-Wilk normality test

data: bird$Mass

W = 0.5467, p-value = 1.155e-11

[1] 3.167159

# Skewness of MaxAbund

[1] 3.146062

# Skewness of Mass

In both cases, distributions are significantly different from normal

The positive skewness values also indicate that the distributions are left skewed

Much improved!

plot(lm2)

Challenge 1 - Solution

Call:

lm(formula = abund ~ clDD + clFD + clTmi + clTma + clP + grass,

data = Dickcissel)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -4.457e+02 3.464e+01 -12.868 < 2e-16 ***

clDD 5.534e-02 8.795e-03 6.292 5.83e-10 ***

clFD 1.090e+00 1.690e-01 6.452 2.19e-10 ***

clTmi -6.717e+00 7.192e-01 -9.339 < 2e-16 ***

clTma 3.172e+00 1.288e+00 2.463 0.014030 *

clP 1.562e-01 4.707e-02 3.318 0.000959 ***

grass 1.066e+01 4.280e+00 2.490 0.013027 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.85 on 639 degrees of freedom

Multiple R-squared: 0.3207, Adjusted R-squared: 0.3144

F-statistic: 50.29 on 6 and 639 DF, p-value: < 2.2e-16

Variables selected by the step() function

(... if time permits)

ttest1$statistic^2

anova(ttest.lm1)$F

# answer: F=60.3845 in both cases

All 3 variables are significant

Model explains ~11% of the dickcissel variance (R2-adj = 0.11)

Call:

lm(formula = abund ~ clTma + NDVI + grass, data = Dickcissel)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -83.60813 11.57745 -7.222 1.46e-12 ***

clTma 3.27299 0.40677 8.046 4.14e-15 ***

NDVI 0.13716 0.05486 2.500 0.0127 *

grass 10.41435 4.68962 2.221 0.0267 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.58 on 642 degrees of freedom

Multiple R-squared: 0.117, Adjusted R-squared: 0.1128

F-statistic: 28.35 on 3 and 642 DF, p-value: < 2.2e-16

When the assumption of equal variance is met t^2 follows an F distribution

We can reorder factor levels according to group medians using the

tapply()

et

sort()

functions

med <- sort(tapply(bird$logMaxAbund, bird$Diet, median))

boxplot(logMaxAbund ~ factor(Diet, levels=names(med)))

Remove the weakest variable

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

plot(lm1)

summary(lm2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***

logMass -0.2361 0.1170 -2.019 0.0487 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6959 on 52 degrees of freedom

Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484

F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869

lm2$coef

str(summary(lm2))

summary(lm2)$coefficients

summary(lm2)$r.squared

We can also call out specific parameters of the model, for example:

Multiple linear regression in R

Dickcissel <- read.csv("Dickcissel.csv")

str(Dickcissel)

Visualize the data using the structure

str()

command:

str(bird)

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

Step 2:

Verify assumptions of

lm1

Transform the data

plot(logMaxAbund ~ logMass, data=bird)

abline(lm2)

hist(log10(bird$MaxAbund))

hist(log10(bird$Mass))

lm2 <- lm(bird$logMaxAbund ~ bird$logMass)

Step 2:

Verify assumptions of model

lm2

Step 2:

Verify the assumptions of model

lm2

plot(bird$MaxAbund ~ bird$Mass)

abline(lm1)

hist(bird$MaxAbund)

hist(bird$Mass)

* The

robustness of this test increases with sample size and is higher when groups have equal sizes

Thus the function

lm()

can also be used

Running an ANOVA in R

summary.lm(aov1)

summary(anov1)

If ANOVA detects a significant effect, a post-hoc test is required to determine which treatment(s) differ from the others. This can be done with the

TukeyHSD()

function

Two-way ANOVA

Challenge 2 - Solution

anov2 <- lm(logMaxAbund ~ Diet*Aquatic, data=bird)

summary(anov2)

anova(anov2)

Analysis of Variance Table

Response: logMaxAbund

Df Sum Sq Mean Sq F value Pr(>F)

Diet 4 5.1059 1.27647 3.0378 0.02669 *

Aquatic 1 0.3183 0.31834 0.7576 0.38870

Diet:Aquatic 3 2.8250 0.94167 2.2410 0.09644 .

Residuals 45 18.9087 0.42019

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

ANCOVA with one covariate and one factor

Running an ANCOVA in R

Challenge 3

ANCOVA with one covariate and one factor

ANCOVA with one covariate and one factor

Explanatory variable

Set working directory

2 x 2 picture on one plot

Assumptions not met - where did we go wrong?

Group discussion

**You could exclude objects using "=!"

F test to compare two variances

data: logMass by Aquatic

F = 1.0725, num df = 38, denom df = 14, p-value = 0.9305

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.3996428 2.3941032

sample estimates:

ratio of variances

1.072452

Indicates that homogeneity of variance was respected

(as we just tested)

uni.ttest1

Challenge 3 - Solution

ancov1 <- lm(logMaxAbund ~ logMass*Diet, data=bird)

anova(ancov1)

Analysis of Variance Table

Response: logMaxAbund

Df Sum Sq Mean Sq F value Pr(>F)

logMass 1 1.9736 1.97357 4.6054 0.03743 *

Diet 4 3.3477 0.83691 1.9530 0.11850

logMass:Diet 4 2.9811 0.74527 1.7391 0.15849

Residuals 44 18.8556 0.42854

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interaction btw logMass and Diet is non-significant

ancov2 <- lm(logMaxAbund ~ logMass + Diet, data=bird)

anova(ancov2)

Remove the interaction term and re-evaluate the model (with only the main effects of Diet and logMass)

Challenge 3 - Solution

Analysis of Variance Table

Model 1: logMaxAbund ~ logMass

Model 2: logMaxAbund ~ logMass + Diet

Res.Df RSS Df Sum of Sq F Pr(>F)

1 52 25.184

2 48 21.837 4 3.3477 1.8396 0.1366

'data.frame': 646 obs. of 15 variables:

$ abund : num 5 0.2 0.4 0 0 0 0 0 0 0 ...

$ Present : Factor w/ 2 levels "Absent","Present": 1 1 1 2 2 2 2 2 2 2 ...

$ clDD : num 5543 5750 5395 5920 6152 ...

$ clFD : num 83.5 67.5 79.5 66.7 57.6 59.2 59.5 51.5 47.4 46.3 ...

$ clTmi : num 9 9.6 8.6 11.9 11.6 10.8 10.8 11.6 13.6 13.5 ...

$ clTma : num 32.1 31.4 30.9 31.9 32.4 32.1 32.3 33 33.5 33.4 ...

$ clTmn : num 15.2 15.7 14.8 16.2 16.8 ...

$ clP : num 140 147 148 143 141 ...

$ NDVI : int -56 -44 -36 -49 -42 -49 -48 -50 -64 -58 ...

$ broadleaf: num 0.3866 0.9516 0.9905 0.0506 0.2296 ...

$ conif : num 0.0128 0.0484 0 0.9146 0.7013 ...

$ grass : num 0 0 0 0 0 0 0 0 0 0 ...

$ crop : num 0.2716 0 0 0.0285 0.044 ...

$ urban : num 0.2396 0 0 0 0.0157 ...

$ wetland : num 0 0 0 0 0 0 0 0 0 0 ...

Multiple linear regression in R

Run the multiple linear regression in R with abund as a function of clTma + NDVI + grass

summary(lm.mult)

Stepwise regression

Least squares

- most used method and corresponds to the default function in R

Observed (measured) value of Yi at Xi

Predicted value of Yi at Xi

Mean of all Yi

VE:

Error variance (residuals)

VR:

Regression variance

VT:

Total variance

R2

= VR/VT

Shapiro-Wilk normality test

data: resid(anov1)

W = 0.98, p-value = 0.4982

Bartlett test of homogeneity of variances

data: logMaxAbund by Diet

Bartlett's K-squared = 7.4728, df = 4, p-value = 0.1129

Continuous explanatory variable

Homogeneous and normally distributed error

Independent residuals

No outlier

Heteroscedastic

Nonlinear

Looks for influential values

Leverage points

: observations at extreme/ outlying values of the explanatory variable. Because they lack neighboring observations, the regression model passes close to leverage points.

They may OR may not have a high influence on the regression.

High leverage points with high influence can be identified with a

Cook's distance greater than 0.5

No leverage

Low influence

High leverage

No influence

High leverage

High influence

Can you write down the equation of the regression line for your model lm2?

Are the parameters significant?

What proportion of variance is explained by model lm2 ?

R2-adj changed from 0.05 to 0.25 when we dropped aquatic birds:

Examine the relationship between log(MaxAbund) and log(Mass) for passerine birds

Save the model object as lm4

HINT: Passerine is also coded 0 and 1 (look at str(bird))

Compare the variance explained by lm2, lm3 and lm4

# Run the model

lm4 <- lm(logMaxAbund ~ logMass, data=bird, subset=bird$Passerine == 1)

# Examine the diagnostic plots

par(mfrow=c(2,2))

plot(lm4)

summary(lm4)

# Compare variance explained by lm2, lm3 and lm4

str(summary(lm4))

# Recall: we want adj.r.squared

summary(lm4)$adj.r.squared

# R2-adj = -0.02

summary(lm2)$adj.r.squared

# R2-adj = 0.05

summary(lm3)$adj.r.squared

# R2-adj = 0.25

The best model among the three models is lm3

(only terrestrial birds)

Response variable:

Continuous

Explanatory variable:

Categorical with

2 levels

H0:

The mean of group 1 is equal to the mean of group 2

H1:

The mean of group 1 is not equal to the mean of group 2

If assumptions cannot be respected,

the

non-parametric

equivalent of t-test is the

Mann-Whitney test

If two groups

are not independent

(e.g. measurements on the same individual at 2 different years), you should use a

paired t-test

Violation of Assumptions

If variances between groups are not equal, can use corrections like the

Welch correction

(DEFAULT in R!)

Are aquatic birds heavier than non-aquatic birds?

Response variable:

Bird mass

/

num: Continuous

Explanatory variable:

Aquatic

/

2 levels: 1 or 0 (yes or no)

The ratio of variances is not statistically different from 1, therefore variances are equal

We may now proceed with the t-test!

You can verify that

t.test()

and

lm()

provide the same model:

Analysis of Variance (ANOVA)

Subsets variation in the response variable into additive effects of one or several factors and the interactions between them

Main Effects

of Factors

The constant is the mean of the values of the response variable

Generalization of the t-test to >2 groups and/or ≥ 2 factor levels

Y = u +

+ Residuals

+ Interactions Between Factors

Types of ANOVA

One-Way ANOVA

One factor with >2 levels

Two-Way ANOVA

2 factors or more

Each factor can have multiple levels

Need to test for

interactions

among factors

Note that ANOVA can be used for repeated measures, but we won't cover this today. Mixed-effect linear models can also be used for this kind of data (see Workshop 5!)

Repeated measures?

Just like the t-test, the robustness of ANOVA increases with sample size and if groups have equal sizes

For each factor and each interaction term:

H0: The mean of the response variable is equal between groups

H1: At least two groups have a different mean

Compares variation within and between groups in order to determine if group means differ

When an interaction term is significant: the effect of a given factor varies according to the levels of another factor

Review: ANOVA

Assumptions

Normality of residuals

Equality of variance between groups

Complementary test

When ANOVA detects a significant difference between groups, it does not tell you which group differs from which other group

A commonly used post-hoc test to answer this question is the

Tukey's test

Response variable:

MaxAbund

/ num: continuous

First, visualize the data using

boxplot()

(alphabetically by default)

Is abundance a function of diet?

The ANOVA results can be graphically illustrated using the

barplot()

function

Challenge 2

Evaluate how log(MaxAbund) varies with Diet and Aquatic

Hint: Test the factors Diet, Aquatic and their interaction in a two-way ANOVA model

e.g. lm(Y ~ A*B)

where A is your first factor, B is your second and "*" is the command for interaction in R

Analysis of Covariance (ANCOVA)

Subsets the observed variation:

Y = μ +

+ Interactions between factors

Main Effects

of Factors

+ Residuals

Combination of ANOVA and linear regression

Explanatory variables are a mix of continuous variable(s) (covariates) and categorial variable(s) (factors)

+ Main effects of covariates

+ Interactions between covariates and factors

Assumptions

Variables are

fixed *

Residuals are normally distributed

Independence btw residuals and fitted values

Equal variance among factor levels

Same range for all covariates

No interaction between factors and covariate(s)

No outlier

If not respected; try transforming your data

* A

fixed

variable is one that you are specifically interested in (bird mass) whereas a

random

variable is noise that you want to control for (i.e. site a bird was sampled in).

See

Linear mixed-effect model (LMM) Workshop 5

next semester for more on this!

Types of ANCOVA

Frequently used ANCOVA

You can have any number of factors and/or covariates, but as their number increases, the interpretation of results gets more complex

1-

One covariate and one factor

2- One covariate and two factors

3- Two covariates and one factor

We will only see the first case today, but the two others are pretty similar!

ANCOVA with one covariate and one factor

1) Effect of factor and covariate on the response variable

2) Effect of factor on the response variable after removing effect

of covariate

3) Effect of covariate on response variable while controlling for the effect of factor

Different possible goals of this analysis:

You can only meet these goals if you have no interaction between your factor and covariate!

One level of the factor has a different slope

Many levels have different slopes

If the interaction is significant, outputs will look like these:

Is Max Abund a function of Diet and Mass?

Response variable:

MaxAbund

/

num : quantitative continuous

Explanatory variables:

Diet

/

factor with 5 levels

Mass

/ numeric continuous

Multiple linear regression

Explanatory variable

: 2 or more continuous variables

Response variable

: 1 continuous variable

Generalization of simple linear regression

Assumptions

Each of the explanatory variables must follow a normal distribution

Linear relationship between each explanatory variable and response variable

Explanatory variables are orthogonal (NO collinearity)

Residuals follow a normal distribution

Independence of residuals versus predicted values

Independence of the variance of residuals versus predicted values

No outliers

*same as simple linear regression!

Possible solutions:

Collinearity:

However, the response variable (abund) is not linearly related to the explanatory variables.

* See

Advanced section

on

polynomial regression

for solution!

Run a model with everything in except the "Present/ absence" variable

Use

step()

to iteratively select the best model,

i.e. model with lowest Akaike's Information Criterion (AIC)

lm.full <- lm(abund ~ . - Present, data=Dickcissel)

summary(lm.full)

lm.step <- step(lm.full)

summary(lm.step)

?step

Interpreting contrasts

DEFAULT contrast compares each level of a factor to baseline level

The intercept is the baseline group and corresponds to the mean of the first (alphabetically) Diet level (Insect)

tapply(bird$logMaxAbund, bird$Diet, mean)

coef(anov1)

coef(anov1)[1] + coef(anov1)[2] # InsectVert

coef(anov1)[1] + coef(anov1)[3] # Plant

What did you notice?

Interpreting contrasts

bird$Diet2 <- relevel(bird$Diet, ref="Plant")

anov2 <- lm(logMaxAbund ~ Diet2, data=bird)

summary(anov2)

anova(anov2)

We may want to relevel the baseline:

Did you see change in the significance of each Diet level?

Interpreting contrasts

Note, the DEFAULT contrast ("contr.treatment") is NOT orthogonal

To be orthogonal,

1) coefficients must sum to 0

2) any two contrast columns must sum to 0

sum(contrasts(bird$Diet)[,1])

sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])

table(bird$Aquatic)

Unbalanced ANOVA

The

birdsdiet

data is actually unbalanced (number of Aquatic and non-Aquatic is not equal)

As we noticed in the section on

multiple linear regression

, MaxAbund was non-linearly related to some variables.

To test for non-linear relationships, polynomial models of different degrees are compared.

Polynomial regression

A polynomial model looks like this:

terms

this polynomial has 3 terms

Polynomial regression

For a polynomial with one variable (x), the

degree

is the largest exponent of that variable

this makes the degree 4

Polynomial regression

When you know a degree, you can also give it a name:

lm.linear <- lm(abund ~ clDD, data=Dickcissel)

lm.quad <- lm(abund ~ clDD + I(clDD^2), data=Dickcissel)

lm.cubic <- lm(abund ~ clDD + I(clDD^2) + I(clDD^3), data=Dickcissel)

Using the

Dickcissel

dataset, test the non-linear relationship between max abundance and temperature by comparing three sets of nested polynomial models (of degrees 0, 1, and 3):

Polynomial regression

Polynomial regression

Compare the polynomial models and determine which nested model we should keep

Run a summary of this model, report the regression formula, p-values and R2-adj

Variance Partitioning

Some of the selected explanatory variables in the

multiple linear regression

section were highly correlated

Collinearity between explanatory variables can be assessed using the variance inflation factor -

vif()

function of package ‘HH’

Variable with VIF >> 5 are considered collinearity

vif(clDD ~ clFD + clTmi + clTma + clP + grass, data=Dickcissel)

Variance Partitioning

Use

varpart()

to partition the variance in max abundance with all land cover variables in one set and all climate variables in the other set (leaving out NDVI)

Note

: Collinear variables do not have to be removed prior to partitioning

library(vegan)

part.lm = varpart(Dickcissel$abund,

Dickcissel[ ,c("clDD","clFD","clTmi","clTma","clP")],

Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])

part.lm

showvarparts(2)

plot(part.lm,digits=2,bg=rgb(48,225,210,80,maxColorValue=225),col="turquoise4")

Variance Partitioning

Proportion of variance explained by climate alone is 28.5% (given by X1|X2), by land cover alone is ~0% (X2|X1), and by both combined is 2.4%

# Climate set

out.1 = rda(Dickcissel$abund, Dickcissel[ ,c("clDD",

"clFD","clTmi","clTma","clP")], Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])

anova(out.1, step=1000, perm.max=1000)

# Land cover set

out.2 = rda(Dickcissel$abund,

Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban", "wetland")], Dickcissel[ ,c("clDD","clFD","clTmi", "clTma","clP")])

anova(out.2, step=1000, perm.max=1000)

Variance Partitioning

Add Intercept + coefficient estimates of each Diet level

1) Compare the Plant diet to all other diet levels

bird$Diet2 <- factor(bird$Diet, levels=names(med))

anov2 <- lm(logMaxAbund ~ Diet2, data=bird)

summary(anov2)

anova(anov2)

2) Reorder multiple levels according to median

options(contrasts=c("contr.helmert", "contr.poly"))

sum(contrasts(bird$Diet)[,1])

sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])

anov3 <- lm(logMaxAbund ~ Diet, data=bird)

summary(anov3)

Change the contrast to make levels orthogonal (e.g. Helmert contrast will contrast the second level with the first, the third with the average of the first two, and so on)

Which means the order of the covariates changes the values of Sums of Squares

unb.anov1 <- lm(logMaxAbund ~ Aquatic + Diet, data=bird)

unb.anov2 <- lm(logMaxAbund ~ Diet + Aquatic, data=bird)

anova(unb.anov1)

anova(unb.anov2)

library(car)

Anova(unb.anov1,type="III")

Anova(unb.anov2,type="III")

Now try type III ANOVA using the

Anova()

function. What have you noticed when using Anova()?

Test significance of each fraction

Conclusion

: the land cover fraction is non-significant once climate data is accounted for

Review: Simple linear regression

Running a regression in R

Running a regression in R

Running a regression in R

Running a regression in R

Diagnostic plot # 1 - Residuals vs Fitted

The function

abline()

adds the best-fit line

The function

hist()

produces a histogram of the variable

Assumptions not met - where did we go wrong?

Step 1:

Re-run the analysis with log-transformed variables

Hypothesis

Explanatory variable:

Diet

/ factor with 5 levels

Try it and compare the outputs!

Review: ANCOVA

H0:

The mean of group 1 is not larger than the mean of group 2

H1

: The mean of group 1 is smaller than the mean of group 2

Hypothesis (for a unilateral t-test)

The model now accounts for 31.44% of the Dickcissel abundance variability

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***

logMass -0.2361 0.1170 -2.019 0.0487 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6959 on 52 degrees of freedom

Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484

F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869

Running a one-way ANOVA in R

Compare outputs:

anov1 <- lm(logMaxAbund ~ Diet, data=bird)

anova(anov1)

aov1 <- aov(logMaxAbund ~ Diet, data=bird)

summary(aov1)

Df Sum Sq Mean Sq F value Pr(>F)

Diet 4 5.106 1.276 2.836 0.0341 *

Residuals 49 22.052 0.450

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1