Introducing

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

QCBS R Workshop 4

CSBQ QCBS

Updated Oct. 19, 2016

Transcript

Learning objectives

Quebec Centre for Biodiversity Science

R Workshop Series

Type of explanatory variable

Continuous

Categorical

Continuous AND categorical

1 with 2

levels

Workshop 4:

Linear models

1 variable

GLM

4. ANCOVA

2. T test

ANCOVA with one covariate and one factor

Challenge 3

Challenge 3 - Solution

Running an ANCOVA in R

Types of ANCOVA

Review: ANCOVA

ANCOVA with one covariate and one factor

Analysis of Covariance (ANCOVA)

If the interaction is significant, outputs will look like these:

If your covariate and factor are significant, you will have a scenario that looks like this:

ancov1 <- lm(logMaxAbund ~ logMass*Diet, data=bird)

anova(ancov1)

Is Max Abund a function of Diet and Mass?

Remove the interaction term and re-evaluate the model (with only the main effects of Diet and logMass)

Combination of ANOVA and linear regression

Assumptions

If you want to compare the mean values of each factor, conditional on the effect of the other, the ajusted means should be visualized

The effect() function uses the output of the ANCOVA model to estimate the means of each factor level, corrected by the effect of the covariate

Different possible goals of this analysis:

If only your factor is significant, remove the covariate -> you will have a simple ANOVA

1- Run an ANCOVA to test the effect of Diet and Mass (as well as their interaction) on Maximum abundance

You can have any number of factors and/or covariates, but as their number increases, the interpretation of results gets more complex

Explanatory variables are a mix of continuous variable(s) (covariates) and categorial variable(s) (factors)

Response variable: MaxAbund / num : quantitative continuous

If not respected; try transforming your data

ancova.exemple <- lm(Y~X*Z, data=data)

If only your covariate is significant, remove your factor -> you will have a simple linear regression

ancov2 <- lm(logMaxAbund ~ logMass + Diet, data=bird)

1) Effect of factor and covariate on the response variable

2) Effect of factor on the response variable after removing effect

of covariate

3) Effect of covariate on response variable while controlling for the effect of factor

Frequently used ANCOVA

Subsets the observed variation:

Explanatory variables: Diet / factor with 5 levels

Mass / numeric continuous

summary(ancova.exemple)

Variables are fixed *
Residuals are normally distributed
Independence btw residuals and fitted values
Equal variance among factor levels
Same range for all covariates
No interaction between factors and covariate(s)
No outlier

anova(ancov2)

ancova.exemple <- lm(Y ~ X*Z, data=data)

# X = quantitative; Z = categorical

library(effects)

adj.means.ex <- effect('Z', ancova.exemple)

plot(adj.means.ex)

Analysis of Variance Table

Response: logMaxAbund

Df Sum Sq Mean Sq F value Pr(>F)

logMass 1 1.9736 1.97357 4.6054 0.03743 *

Diet 4 3.3477 0.83691 1.9530 0.11850

logMass:Diet 4 2.9811 0.74527 1.7391 0.15849

Residuals 44 18.8556 0.42854

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

If the interaction btw your covariate and factor (*) is significant, you should explore which levels of the factor have different slopes from the others

1- One covariate and one factor

2- One covariate and two factors

3- Two covariates and one factor

2- Test whether the interaction is signifiant

Y = μ +

+ Interactions between factors

Main Effects

of Factors

+ Main effects of covariates

+ Residuals

+ Interactions between covariates and factors

You can only meet these goals if you have no interaction between your factor and covariate!

Analysis of Variance Table

Model 1: logMaxAbund ~ logMass

Model 2: logMaxAbund ~ logMass + Diet

Res.Df RSS Df Sum of Sq F Pr(>F)

1 52 25.184

2 48 21.837 4 3.3477 1.8396 0.1366

ancova.exemple2 <- lm(Y~X+Z, data=data)

Verify assumptions!

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

No interaction

We will only see the first case today, but the two others are pretty similar!

* A fixed variable is one that you are specifically interested in (bird mass) whereas a random variable is noise that you want to control for (i.e. site a bird was sampled in).

See Linear mixed-effect model (LMM) Workshop 5 next semester for more on this!

summary(ancova.exemple2)

1. Simple linear

regression

Very similar to what we have done so far!

Many levels have different slopes

One level of the factor has a different slope

Interaction btw logMass and Diet is non-significant

Running a t-test in R

Running a T test in R

Running a t-test in R

T-test

Group discussion

Violation of Assumptions

Running a T test in R

Running a T-test in R

Response variable: Continuous

Next, test the assumption of equality of variance using var.test() function

If p<0.01 (or 0.05), the null hypothesis of no difference between the two groups (H0) can be rejected, with a risk of 0.01 (or 0.05) that we made a mistake in this conclusion.

Are aquatic birds heavier than terrestrial birds?

Explanatory variable: Categorical with 2 levels

Use the function t.test()

First, lets visualize the data using the function boxplot()

Are aquatic birds heavier than non-aquatic birds?

Yes, aquatic birds are heavier than terrestrial birds

If variances between groups are not equal, can use corrections like the Welch correction (DEFAULT in R!)

var.test(logMass~Aquatic,data=bird)

uni.ttest1

ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird)

# Or use lm()

ttest.lm1 <- lm(logMass ~ Aquatic, data=bird)

Hypothesis (for a bilateral t-test)

t.test(Y~X2, data= data, alternative = "two.sided")

boxplot(logMass ~ Aquatic, data=bird,names=c("Non-Aquatic","Aquatic"))

H0: The mean of group 1 is equal to the mean of group 2
H1: The mean of group 1 is not equal to the mean of group 2

Response variable: Bird mass / num: Continuous

# Unilateral t-test

uni.ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird, alternative="less")

uni.ttest1

response variable

factor

(2 levels)

name of dataframe

Explanatory variable: Aquatic / 2 levels: 1 or 0 (yes or no)

There exists a difference in mass between the aquatic and terrestrial birds

Indicates that homogeneity of variance was respected (as we just tested)

If assumptions cannot be respected, the non-parametric equivalent of t-test is the Mann-Whitney test

Alternative hypothesis:

"two.sided" (default), "less", or "greater"

Hypothesis (for a unilateral t-test)

F test to compare two variances

data: logMass by Aquatic

F = 1.0725, num df = 38, denom df = 14, p-value = 0.9305

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.3996428 2.3941032

sample estimates:

ratio of variances

1.072452

Two Sample t-test

data: logMass by Aquatic

t = -7.7707, df = 52, p-value = 1.468e-10

alternative hypothesis: true difference in means is less than 0

95 percent confidence interval:

-Inf -1.039331

sample estimates:

mean in group 0 mean in group 1

1.583437 2.908289

Two Sample t-test

data: logMass by Aquatic

t = -7.7707, df = 52, p-value = 2.936e-10

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.6669697 -0.9827343

sample estimates:

mean in group 0 mean in group 1

1.583437 2.908289

H0: The mean of group 1 is not larger than the mean of group 2
H1: The mean of group 1 is smaller than the mean of group 2

You can verify that t.test() and lm() provide the same model:

The t-test is still a linear model and a specific case of ANOVA with one factor with 2 levels

What did you conclude?

Assumptions

If two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a paired t-test

Thus the function lm() can also be used

ttest1$statistic^2

anova(ttest.lm1)$F

# answer: F=60.3845 in both cases

Data follow a normal distribution
Equality of variance between groups

lm.t <-lm(Y~X2, data = data)

The ratio of variances is not statistically different from 1, therefore variances are equal

mean of the two groups

When the assumption of equal variance is met t^2 follows an F distribution

* The robustness of this test increases with sample size and is higher when groups have equal sizes

anova(lm.t)

We may now proceed with the t-test!

Review: Simple linear regression

Running a regression in R

Linear relationship between response (Y) and explanatory (X) variable

Download the birdsdiet dataset:

Response variable: Bird abundance / num: continuous

Least squares - most used method and corresponds to the default function in R

Step 1: Run your linear model

Explanatory variable: Bird mass / num: continuous

Set working directory

Yi = α + βXi + εi

Step 2: Verify assumptions

In R, the function lm() is used to fit a linear model

setwd()

bird<-read.csv("birdsdiet.csv")

Assumptions are not met

Observed (measured) value of Yi at Xi

Can you transform your variables (and does it make sense to do so)?

Assumptions

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

Visualize the data using the structure str() command:

Assumptions are met

lm1 <- lm(Y ~ X)

Explanatory variable

Predicted value of Yi at Xi

Yes

str(bird)

Continuous explanatory variable
Homogeneous and normally distributed error
Independent residuals
No outlier

Mean of all Yi

Response variable

New object containing the linear model we created

We first want to test if bird maximum abundance is a function of bird mass

Try GLM that might better suit your data

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

Go back to Step 1 with transformed variables

Step 3: Estimate parameters (regression coefficients), test significance, plot your model

VE: Error variance (residuals)

VR: Regression variance

VT: Total variance

R2 = VR/VT

lm1 <- lm(bird$MaxAbund ~ bird$Mass)

Leverage vs influence

Assumptions not met - where did we go wrong?

Step 2: Verify assumptions of lm1

Assumptions not met - where did we go wrong?

Step 2: Verify the assumptions of model lm2

Step 2: Verify assumptions of model lm2

Transform the data

Diagnostic plot # 1 - Residuals vs Fitted

Diagnostic plot # 2 - Scale Location

Diagnostic plot # 3 - Normal QQ

Diagnostic plot # 4 - Residuals vs Leverage

Running a regression in R

Diagnostic plot # 1 - Residuals vs Fitted

High leverage point and high influence

plot(lm1)

Plot Y ~ X with corresponding best-fit regression line, and a histogram of Y and X to explore their distributions

Can also use the Shapiro-Wilk and the Skewness tests to see if variables follow a normal distribuon:

No influential values

Lets try normalizing data with a log10() transformation

No leverage
Low influence

plot(lm2)

Points are outside the 0.5 limit of Cook's distance
These points have too much influence on the regression

Compares the distribution (quantiles) of the residuals of the current model to those of a normal distribution

Should show a scatter across and no pattern

Step 2: Verify assumptions using diagnostic plots

Examples of non-independence

(what we don't want)

plot(logMaxAbund ~ logMass, data=bird)

abline(lm2)

hist(log10(bird$MaxAbund))

hist(log10(bird$Mass))

Add the log-transformed variables to our dataframe

Looks for influential values

Example of independence (what we want!)

shapiro.test(bird$MaxAbund)

shapiro.test(bird$Mass)

Much improved!

plot(bird$MaxAbund ~ bird$Mass)

abline(lm1)

hist(bird$MaxAbund)

hist(bird$Mass)

bird$logMaxAbund <- log10(bird$MaxAbund)

bird$logMass <- log10(bird$Mass)

If points lie linearly on the 1:1 line, data follow a normal distribution

Heteroscedastic

Nonlinear

In both cases, distributions are significantly different from normal

Should show a scatter across and no pattern

2 x 2 picture on one plot

opar <- par(mfrow=c(2,2))

plot(lm1)

The function abline() adds the best-fit line
The function hist() produces a histogram of the variable

Leverage points: observations at extreme/ outlying values of the explanatory variable. Because they lack neighboring observations, the regression model passes close to leverage points. They may OR may not have a high influence on the regression.

Shapiro-Wilk normality test

data: bird$MaxAbund

W = 0.5831, p-value = 3.872e-11

Shapiro-Wilk normality test

data: bird$Mass

W = 0.5467, p-value = 1.155e-11

High leverage
No influence

par() sets the graphical parameters, for example, the mfrow argument sets the number of rows in the frame
plot() is a generic function to plot graphics in R

** Note: you should never remove outliers if you don't have good reasons to do so (ex: error of measurement)

Step 1: Re-run the analysis with log-transformed variables

The positive skewness values also indicate that the distributions are left skewed

High leverage point and reasonable influence

lm2 <- lm(bird$logMaxAbund ~ bird$logMass)

High leverage points with high influence can be identified with a Cook's distance greater than 0.5

skewness(bird$MaxAbund)

skewness(bird$Mass)

The output will provide the four built-in diagnostic plots of the lm() function

Here, point 32 has high leverage but its influence is acceptable (inside the 0.5 Cook's distance limits)

High leverage
High influence

[1] 3.167159 # Skewness of MaxAbund

[1] 3.146062 # Skewness of Mass

Solution: Transform your data or try another distribution than linear (Gaussian) (i.e., a generalized linear model (GLM): Poisson, binomial, negative binomial, etc.)

No pattern in the residuals

Strong pattern in the residuals

Group discussion

Step 3 : Estimate parameters and test their significance

Challenge 1 - Solution

R2-adj changed from 0.05 to 0.25 when we dropped aquatic birds:

Challenge 1

Group discussion

The function summary() is used to obtain parameter estimates, significance, etc:

Can we improve the model if we only use terrestrial birds?

The best model among the three models is lm3 (only terrestrial birds)

summary(lm2)

**You could exclude objects using "=!"

Can you write down the equation of the regression line for your model lm2?

Are the parameters significant?

What proportion of variance is explained by model lm2 ?

Examine the relationship between log(MaxAbund) and log(Mass) for passerine birds
Save the model object as lm4

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***

logMass -0.2361 0.1170 -2.019 0.0487 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6959 on 52 degrees of freedom

Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484

F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869

HINT: Passerine is also coded 0 and 1 (look at str(bird))

# Run the model

lm4 <- lm(logMaxAbund ~ logMass, data=bird, subset=bird$Passerine == 1)

# Examine the diagnostic plots

par(mfrow=c(2,2))

plot(lm4)

summary(lm4)

# Compare variance explained by lm2, lm3 and lm4

str(summary(lm4)) # Recall: we want adj.r.squared

summary(lm4)$adj.r.squared # R2-adj = -0.02

summary(lm2)$adj.r.squared # R2-adj = 0.05

summary(lm3)$adj.r.squared # R2-adj = 0.25

lm3 <- lm(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic) # removes aquatic birds (= TRUE)

# or equivalently

lm3 <- lm(logMaxAbund~logMass, data=bird, subset=bird$Aquatic == 0)

# Examine the diagnostic plots

par(mfrow=c(2,2))

plot(lm3)

summary(lm3)

# Compare both models

par(mfrow=c(1,2))

plot(logMaxAbund~logMass, data=bird)

plot(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic)

We can also call out specific parameters of the model, for example:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***

logMass -0.2361 0.1170 -2.019 0.0487 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6959 on 52 degrees of freedom

Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484

F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869

Compare the variance explained by lm2, lm3 and lm4

lm2$coef

str(summary(lm2))

summary(lm2)$coefficients

summary(lm2)$r.squared

Website: http://qcbs.ca/wiki/r/workshop4

1 with 3 levels or more

and/ or many factors

2 or more

GLM

3. ANOVA

Did we violate the model assumptions?

Running an ANOVA in R

Graphical representation

Running a one-way ANOVA in R

What if the assumptions were not met?

A posteriori test

Output of our ANOVA model

Running a one-way ANOVA in R

Types of ANOVA

Analysis of Variance (ANOVA)

Review: ANOVA

Running an ANOVA in R

Review: ANOVA

Bartlett test - equality of variance among groups

First, visualize the data using boxplot() (alphabetically by default)

Compare outputs:

The ANOVA results can be graphically illustrated using the barplot() function

Can use the function lm() to run an ANOVA

summary.lm(aov1)

summary(anov1)

bartlett.test(logMaxAbund ~ Diet, data=bird)

Generalization of the t-test to >2 groups and/or ≥ 2 factor levels

Transform the data - might normalize the residuals and homogenize the variances, and convert a multiplicative effect into an additive one

boxplot(logMaxAbund ~ Diet, data=bird)

One-Way ANOVA

Compares variation within and between groups in order to determine if group means differ

If ANOVA detects a significant effect, a post-hoc test is required to determine which treatment(s) differ from the others. This can be done with the TukeyHSD() function

Is abundance a function of diet?

Assumptions

anov1 <- lm(logMaxAbund ~ Diet, data=bird)

anova(anov1)

One factor with >2 levels

Response variable: MaxAbund / num: continuous

Factor levels in alphabetical order and all levels are compared to the reference level ("Insect")

Bartlett test of homogeneity of variances

data: logMaxAbund by Diet

Bartlett's K-squared = 7.4728, df = 4, p-value = 0.1129

data$logY <- log10(data$Y)

Normality of residuals
Equality of variance between groups

We can reorder factor levels according to group medians using the tapply() et sort() functions

Explanatory variable: Diet / factor with 5 levels

Hypothesis

TukeyHSD(aov(anov1),ordered=T)

sd <- tapply(bird$logMaxAbund,list(bird$Diet),sd)

means <- tapply(bird$logMaxAbund,list(bird$Diet),mean)

n <- length(bird$logMaxAbund)

se <- 1.96*sd/sqrt(n)

bp <- barplot(means)

segments(bp, means - se, bp, means + se, lwd=2)

segments(bp - 0.1, means - se, bp + 0.1, means - se, lwd=2)

segments(bp - 0.1, means + se, bp + 0.1, means + se, lwd=2)

aov1 <- aov(logMaxAbund ~ Diet, data=bird)

summary(aov1)

Only "Vertebrate" and "PlantInsect" differ

* see Workshop 1 for rules on data transformation

Two-Way ANOVA

Or use a specific function called aov()

med <- sort(tapply(bird$logMaxAbund, bird$Diet, median))

boxplot(logMaxAbund ~ factor(Diet, levels=names(med)))

For each factor and each interaction term:
H0: The mean of the response variable is equal between groups
H1: At least two groups have a different mean

* re-run your data with the transformed variables and verify the assumptions once again

Shapiro-Wilk test - Normality of residuals

Complementary test

2 factors or more
Each factor can have multiple levels
Need to test for interactions among factors

aov1 <- aov(logMaxAbund ~ Diet, data=bird)

summary(aov1)

shapiro.test(resid(anov1))

Significant difference btw Diet groups, but we don't know which ones!

+ Residuals

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.1539 0.1500 7.692 5.66e-10 ***

DietInsectVert -0.6434 0.4975 -1.293 0.2020

DietPlant 0.2410 0.4975 0.484 0.6303

DietPlantInsect 0.4223 0.2180 1.938 0.0585 .

DietVertebrate -0.3070 0.2450 -1.253 0.2161

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6709 on 49 degrees of freedom

Multiple R-squared: 0.188, Adjusted R-squared: 0.1217

F-statistic: 2.836 on 4 and 49 DF, p-value: 0.0341

When ANOVA detects a significant difference between groups, it does not tell you which group differs from which other group

Main Effects

of Factors

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

When an interaction term is significant: the effect of a given factor varies according to the levels of another factor

+ Interactions Between Factors

Df Sum Sq Mean Sq F value Pr(>F)

Diet 4 5.106 1.276 2.836 0.0341 *

Residuals 49 22.052 0.450

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Shapiro-Wilk normality test

data: resid(anov1)

W = 0.98, p-value = 0.4982

Kruskal-Wallis Test - non-parametric equivalent to ANOVA if you cannot (or do not wish to) transform the data

$Diet

diff lwr upr p adj

Vertebrate-InsectVert 0.3364295 -1.11457613 1.787435 0.9645742

Insect-InsectVert 0.6434334 -0.76550517 2.052372 0.6965047

Plant-InsectVert 0.8844338 -1.01537856 2.784246 0.6812494

PlantInsect-InsectVert 1.0657336 -0.35030287 2.481770 0.2235587

Insect-Vertebrate 0.3070039 -0.38670951 1.000717 0.7204249

Plant-Vertebrate 0.5480043 -0.90300137 1.999010 0.8211024

PlantInsect-Vertebrate 0.7293041 0.02128588 1.437322 0.0405485

Plant-Insect 0.2410004 -1.16793813 1.649939 0.9884504

PlantInsect-Insect 0.4223003 -0.19493574 1.039536 0.3117612

PlantInsect-Plant 0.1812999 -1.23473664 1.597336 0.9961844

Repeated measures?

Try it and compare the outputs!

A commonly used post-hoc test to answer this question is the Tukey's test

Just like the t-test, the robustness of ANOVA increases with sample size and if groups have equal sizes

kruskal.test(Y~X, data)

Note that ANOVA can be used for repeated measures, but we won't cover this today. Mixed-effect linear models can also be used for this kind of data (see Workshop 5!)

The constant is the mean of the values of the response variable

Both tests are non-significant; variances assumed to be equal and residuals assumed to be normally distributed

Two-way ANOVA

Challenge 2

Challenge 2 - Solution

anova.with.lm <- lm(Y~X*Z, data)

anov2 <- lm(logMaxAbund ~ Diet*Aquatic, data=bird)

summary(anov2)

anova(anov2)

According to law of parsimony, select model that explain the most variance with the least model parameters possible

Thus, remove the interaction term if non-signficant and re-evaluate

The "*" symbol indicates that the main effects as well as their interaction will be included in the model

Evaluate how log(MaxAbund) varies with Diet and Aquatic

anova.with.lm <- lm(Y~X+Z, data)

Analysis of Variance Table

Response: logMaxAbund

Df Sum Sq Mean Sq F value Pr(>F)

Diet 4 5.1059 1.27647 3.0378 0.02669 *

Aquatic 1 0.3183 0.31834 0.7576 0.38870

Diet:Aquatic 3 2.8250 0.94167 2.2410 0.09644 .

Residuals 45 18.9087 0.42019

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Hint: Test the factors Diet, Aquatic and their interaction in a two-way ANOVA model

e.g. lm(Y ~ A*B)

where A is your first factor, B is your second and "*" is the command for interaction in R

Analysis of Variance Table

Response: Y

Df Sum Sq Mean Sq F value Pr(>F)

X 4 5.1059 1.27647 3.0378 0.02669 *

Z 1 0.3183 0.31834 0.7576 0.38870

X:Z 3 2.8250 0.94167 2.2410 0.10689

Residuals 45 18.9087 0.42019

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

If use the "+" symbol, the main effects but not their interaction are included

Example of a non-significant interaction

5. Multiple linear

regression

Verify assumptions!

Multiple linear regression in R

However, the response variable (abund) is not linearly related to the explanatory variables.

Find the best-fit model

Multiple linear regression

Multiple linear regression in R

Explanatory variable: 2 or more continuous variables
Response variable: 1 continuous variable

Collinearity:

Run the multiple linear regression in R with abund as a function of clTma + NDVI + grass

par(mfrow=c(1,3))

plot(abund ~ clTma, data=Dickcissel)

plot(abund ~ NDVI, data=Dickcissel)

plot(abund ~ grass, data=Dickcissel)

Recall the principle of parcimony: Most variance explained using the least number of terms possible

Using the Dickcissel datast, test the effect of climate (clTma), productivity (NDVI) and grassland cover (grass) on bird abundance (abund)

Verify collinearity of all explanatory variables

Generalization of simple linear regression

plot(data=Dickcissel)

Remove the weakest variable

lm.mult <- lm(abund ~ clTma + NDVI + grass, data=Dickcissel)

summary(lm.mult)

Yi = α + β1X1i + β2X2i + β3X3i + ...+ βkXki+ ε

summary(lm.mult)

Dickcissel <- read.csv("Dickcissel.csv")

str(Dickcissel)

Assumptions

Verify diagnostic plots, as you did for the simple linear regression

If you see a pattern between any two variables, they may be collinear!
They are likely to explain the same variability of the response variable and the effect of one variable will be masked by the other

Keep only one of the collinear variables
Try multivariate analyses (see Ordination Workshop)
Try a pseudo-orthogonal analysis

All 3 variables are significant

Model explains ~11% of the dickcissel variance (R2-adj = 0.11)

Possible solutions:

Call:

lm(formula = abund ~ clTma + NDVI + grass, data = Dickcissel)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -83.60813 11.57745 -7.222 1.46e-12 ***

clTma 3.27299 0.40677 8.046 4.14e-15 ***

NDVI 0.13716 0.05486 2.500 0.0127 *

grass 10.41435 4.68962 2.221 0.0267 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.58 on 642 degrees of freedom

Multiple R-squared: 0.117, Adjusted R-squared: 0.1128

F-statistic: 28.35 on 3 and 642 DF, p-value: < 2.2e-16

'data.frame': 646 obs. of 15 variables:

$ abund : num 5 0.2 0.4 0 0 0 0 0 0 0 ...

$ Present : Factor w/ 2 levels "Absent","Present": 1 1 1 2 2 2 2 2 2 2 ...

$ clDD : num 5543 5750 5395 5920 6152 ...

$ clFD : num 83.5 67.5 79.5 66.7 57.6 59.2 59.5 51.5 47.4 46.3 ...

$ clTmi : num 9 9.6 8.6 11.9 11.6 10.8 10.8 11.6 13.6 13.5 ...

$ clTma : num 32.1 31.4 30.9 31.9 32.4 32.1 32.3 33 33.5 33.4 ...

$ clTmn : num 15.2 15.7 14.8 16.2 16.8 ...

$ clP : num 140 147 148 143 141 ...

$ NDVI : int -56 -44 -36 -49 -42 -49 -48 -50 -64 -58 ...

$ broadleaf: num 0.3866 0.9516 0.9905 0.0506 0.2296 ...

$ conif : num 0.0128 0.0484 0 0.9146 0.7013 ...

$ grass : num 0 0 0 0 0 0 0 0 0 0 ...

$ crop : num 0.2716 0 0 0.0285 0.044 ...

$ urban : num 0.2396 0 0 0 0.0157 ...

$ wetland : num 0 0 0 0 0 0 0 0 0 0 ...

par(mfrow=c(2,2))

plot(lm.mult)

Each of the explanatory variables must follow a normal distribution
Linear relationship between each explanatory variable and response variable
Explanatory variables are orthogonal (NO collinearity)
Residuals follow a normal distribution
Independence of residuals versus predicted values
Independence of the variance of residuals versus predicted values
No outliers

*same as simple linear regression!

* See Advanced section on polynomial regression for solution!

Variables selected by the step() function

Stepwise regression

Run a model with everything in except the "Present/ absence" variable

Use step() to iteratively select the best model,
i.e. model with lowest Akaike's Information Criterion (AIC)

Call:

lm(formula = abund ~ clDD + clFD + clTmi + clTma + clP + grass,

data = Dickcissel)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -4.457e+02 3.464e+01 -12.868 < 2e-16 ***

clDD 5.534e-02 8.795e-03 6.292 5.83e-10 ***

clFD 1.090e+00 1.690e-01 6.452 2.19e-10 ***

clTmi -6.717e+00 7.192e-01 -9.339 < 2e-16 ***

clTma 3.172e+00 1.288e+00 2.463 0.014030 *

clP 1.562e-01 4.707e-02 3.318 0.000959 ***

grass 1.066e+01 4.280e+00 2.490 0.013027 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.85 on 639 degrees of freedom

Multiple R-squared: 0.3207, Adjusted R-squared: 0.3144

F-statistic: 50.29 on 6 and 639 DF, p-value: < 2.2e-16

lm.full <- lm(abund ~ . - Present, data=Dickcissel)

summary(lm.full)

lm.step <- step(lm.full)

summary(lm.step)

?step

The model now accounts for 31.44% of the Dickcissel abundance variability

Optional section

(... if time permits)

Optional

1. Interpreting contrasts

Interpreting contrasts

We may want to relevel the baseline:

DEFAULT contrast compares each level of a factor to baseline level
The intercept is the baseline group and corresponds to the mean of the first (alphabetically) Diet level (Insect)

1) Compare the Plant diet to all other diet levels

Note, the DEFAULT contrast ("contr.treatment") is NOT orthogonal
To be orthogonal,

1) coefficients must sum to 0

2) any two contrast columns must sum to 0

bird$Diet2 <- relevel(bird$Diet, ref="Plant")

anov2 <- lm(logMaxAbund ~ Diet2, data=bird)

summary(anov2)

anova(anov2)

sum(contrasts(bird$Diet)[,1])

sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])

Add Intercept + coefficient estimates of each Diet level

2) Reorder multiple levels according to median

Change the contrast to make levels orthogonal (e.g. Helmert contrast will contrast the second level with the first, the third with the average of the first two, and so on)

tapply(bird$logMaxAbund, bird$Diet, mean)

coef(anov1)

coef(anov1)[1] + coef(anov1)[2] # InsectVert

coef(anov1)[1] + coef(anov1)[3] # Plant

bird$Diet2 <- factor(bird$Diet, levels=names(med))

anov2 <- lm(logMaxAbund ~ Diet2, data=bird)

summary(anov2)

anova(anov2)

options(contrasts=c("contr.helmert", "contr.poly"))

sum(contrasts(bird$Diet)[,1])

sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])

anov3 <- lm(logMaxAbund ~ Diet, data=bird)

summary(anov3)

What did you notice?

Did you see change in the significance of each Diet level?

2. Unbalanced ANOVA

Unbalanced ANOVA

The birdsdiet data is actually unbalanced (number of Aquatic and non-Aquatic is not equal)

table(bird$Aquatic)

Which means the order of the covariates changes the values of Sums of Squares

unb.anov1 <- lm(logMaxAbund ~ Aquatic + Diet, data=bird)

unb.anov2 <- lm(logMaxAbund ~ Diet + Aquatic, data=bird)

anova(unb.anov1)

anova(unb.anov2)

Now try type III ANOVA using the Anova() function. What have you noticed when using Anova()?

library(car)

Anova(unb.anov1,type="III")

Anova(unb.anov2,type="III")

3. Polynomial regression

Polynomial regression

When you know a degree, you can also give it a name:

For a polynomial with one variable (x), the degree is the largest exponent of that variable

As we noticed in the section on multiple linear regression, MaxAbund was non-linearly related to some variables.

To test for non-linear relationships, polynomial models of different degrees are compared.

A polynomial model looks like this:

this makes the degree 4

terms

this polynomial has 3 terms

Polynomial regression

Using the Dickcissel dataset, test the non-linear relationship between max abundance and temperature by comparing three sets of nested polynomial models (of degrees 0, 1, and 3):

Compare the polynomial models and determine which nested model we should keep

Run a summary of this model, report the regression formula, p-values and R2-adj

lm.linear <- lm(abund ~ clDD, data=Dickcissel)

lm.quad <- lm(abund ~ clDD + I(clDD^2), data=Dickcissel)

lm.cubic <- lm(abund ~ clDD + I(clDD^2) + I(clDD^3), data=Dickcissel)

4. Variance partitioning

Variance Partitioning

Test significance of each fraction

Proportion of variance explained by climate alone is 28.5% (given by X1|X2), by land cover alone is ~0% (X2|X1), and by both combined is 2.4%

Use varpart() to partition the variance in max abundance with all land cover variables in one set and all climate variables in the other set (leaving out NDVI)
Note: Collinear variables do not have to be removed prior to partitioning

Some of the selected explanatory variables in the multiple linear regression section were highly correlated

Collinearity between explanatory variables can be assessed using the variance inflation factor - vif() function of package ‘HH’
Variable with VIF >> 5 are considered collinearity

# Climate set

out.1 = rda(Dickcissel$abund, Dickcissel[ ,c("clDD",

"clFD","clTmi","clTma","clP")], Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])

anova(out.1, step=1000, perm.max=1000)

# Land cover set

out.2 = rda(Dickcissel$abund,

Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban", "wetland")], Dickcissel[ ,c("clDD","clFD","clTmi", "clTma","clP")])

anova(out.2, step=1000, perm.max=1000)

library(vegan)

part.lm = varpart(Dickcissel$abund,

Dickcissel[ ,c("clDD","clFD","clTmi","clTma","clP")],

Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])

part.lm

showvarparts(2)

plot(part.lm,digits=2,bg=rgb(48,225,210,80,maxColorValue=225),col="turquoise4")

vif(clDD ~ clFD + clTmi + clTma + clP + grass, data=Dickcissel)

Conclusion: the land cover fraction is non-significant once climate data is accounted for

Y = μ +

+ Residuals

+ Interactions between factors

You will see in the assumptions that there should be NO interaction between covariates and factors

Subsets variation in the response variable into additive effects of one or several factors and the interactions between them

Y = u +

Choose a template

Constellations (AI Assisted)

Illuminate your ideas with our captivating Constellations Prezi AI-assisted presentation template, merging celestial elegance with professional design to elevate your content and guide your audience through a stellar visual experience.

Modular - Dark (AI Assisted)

Revolutionize your presentations with our Modular Prezi AI-assisted presentation template, a versatile and customizable solution that adapts to your unique content, providing a visually stunning and cohesive framework for professionals, educators, and creatives.

Data Analysis (AI Assisted)

Transform data into insights with our Data Analysis Prezi AI-assisted presentation template, strategically designed to visually convey complex information, enabling impactful presentations that drive informed decision-making.

See more templates →

Presentations from around the world

Demystifying Loan Repayment & Forgiveness

Jessica Le Duc

El reportaje

Melina Rosario

Creative Report

Gabriele Roncoroni

See staff picks →

Learn more about creating dynamic, engaging presentations with Prezi

Why Prezi is better