Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

Learning objectives

Quebec Centre for Biodiversity Science

R Workshop Series

Type of explanatory variable

Continuous

Categorical

Continuous AND categorical

1 with 2

levels

Workshop 4:

Linear models

1 variable

GLM

4. ANCOVA

2. T test

ANCOVA with one covariate and one factor

Challenge 3

Challenge 3 - Solution

Running an ANCOVA in R

Types of ANCOVA

Review: ANCOVA

ANCOVA with one covariate and one factor

Analysis of Covariance (ANCOVA)

If the interaction is significant, outputs will look like these:

If your covariate and factor are significant, you will have a scenario that looks like this:

ancov1 <- lm(logMaxAbund ~ logMass*Diet, data=bird)

anova(ancov1)

Is Max Abund a function of Diet and Mass?

  • Remove the interaction term and re-evaluate the model (with only the main effects of Diet and logMass)
  • Combination of ANOVA and linear regression

Assumptions

  • If you want to compare the mean values of each factor, conditional on the effect of the other, the ajusted means should be visualized

  • The effect() function uses the output of the ANCOVA model to estimate the means of each factor level, corrected by the effect of the covariate
  • Different possible goals of this analysis:
  • If only your factor is significant, remove the covariate -> you will have a simple ANOVA

1- Run an ANCOVA to test the effect of Diet and Mass (as well as their interaction) on Maximum abundance

  • You can have any number of factors and/or covariates, but as their number increases, the interpretation of results gets more complex
  • Explanatory variables are a mix of continuous variable(s) (covariates) and categorial variable(s) (factors)
  • Response variable: MaxAbund / num : quantitative continuous

If not respected; try transforming your data

ancova.exemple <- lm(Y~X*Z, data=data)

  • If only your covariate is significant, remove your factor -> you will have a simple linear regression

ancov2 <- lm(logMaxAbund ~ logMass + Diet, data=bird)

1) Effect of factor and covariate on the response variable

2) Effect of factor on the response variable after removing effect

of covariate

3) Effect of covariate on response variable while controlling for the effect of factor

Frequently used ANCOVA

  • Subsets the observed variation:
  • Explanatory variables: Diet / factor with 5 levels

Mass / numeric continuous

summary(ancova.exemple)

  • Variables are fixed *
  • Residuals are normally distributed
  • Independence btw residuals and fitted values
  • Equal variance among factor levels
  • Same range for all covariates
  • No interaction between factors and covariate(s)
  • No outlier

anova(ancov2)

ancova.exemple <- lm(Y ~ X*Z, data=data)

# X = quantitative; Z = categorical

library(effects)

adj.means.ex <- effect('Z', ancova.exemple)

plot(adj.means.ex)

Analysis of Variance Table

Response: logMaxAbund

Df Sum Sq Mean Sq F value Pr(>F)

logMass 1 1.9736 1.97357 4.6054 0.03743 *

Diet 4 3.3477 0.83691 1.9530 0.11850

logMass:Diet 4 2.9811 0.74527 1.7391 0.15849

Residuals 44 18.8556 0.42854

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

  • If the interaction btw your covariate and factor (*) is significant, you should explore which levels of the factor have different slopes from the others

1- One covariate and one factor

2- One covariate and two factors

3- Two covariates and one factor

2- Test whether the interaction is signifiant

Y = μ +

+ Interactions between factors

Main Effects

of Factors

+ Main effects of covariates

+ Residuals

+ Interactions between covariates and factors

  • You can only meet these goals if you have no interaction between your factor and covariate!

Analysis of Variance Table

Model 1: logMaxAbund ~ logMass

Model 2: logMaxAbund ~ logMass + Diet

Res.Df RSS Df Sum of Sq F Pr(>F)

1 52 25.184

2 48 21.837 4 3.3477 1.8396 0.1366

ancova.exemple2 <- lm(Y~X+Z, data=data)

Verify assumptions!

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

No interaction

  • We will only see the first case today, but the two others are pretty similar!

* A fixed variable is one that you are specifically interested in (bird mass) whereas a random variable is noise that you want to control for (i.e. site a bird was sampled in).

See Linear mixed-effect model (LMM) Workshop 5 next semester for more on this!

summary(ancova.exemple2)

1. Simple linear

regression

  • Very similar to what we have done so far!

Many levels have different slopes

One level of the factor has a different slope

Interaction btw logMass and Diet is non-significant

Running a t-test in R

Running a T test in R

Running a t-test in R

T-test

Group discussion

Violation of Assumptions

Running a T test in R

Running a T-test in R

  • Response variable: Continuous
  • Next, test the assumption of equality of variance using var.test() function
  • If p<0.01 (or 0.05), the null hypothesis of no difference between the two groups (H0) can be rejected, with a risk of 0.01 (or 0.05) that we made a mistake in this conclusion.
  • Are aquatic birds heavier than terrestrial birds?
  • Explanatory variable: Categorical with 2 levels
  • Use the function t.test()
  • First, lets visualize the data using the function boxplot()

Are aquatic birds heavier than non-aquatic birds?

Yes, aquatic birds are heavier than terrestrial birds

  • If variances between groups are not equal, can use corrections like the Welch correction (DEFAULT in R!)

var.test(logMass~Aquatic,data=bird)

uni.ttest1

ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird)

# Or use lm()

ttest.lm1 <- lm(logMass ~ Aquatic, data=bird)

Hypothesis (for a bilateral t-test)

t.test(Y~X2, data= data, alternative = "two.sided")

boxplot(logMass ~ Aquatic, data=bird,names=c("Non-Aquatic","Aquatic"))

  • H0: The mean of group 1 is equal to the mean of group 2
  • H1: The mean of group 1 is not equal to the mean of group 2
  • Response variable: Bird mass / num: Continuous

# Unilateral t-test

uni.ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird, alternative="less")

uni.ttest1

response variable

factor

(2 levels)

name of dataframe

  • Explanatory variable: Aquatic / 2 levels: 1 or 0 (yes or no)

There exists a difference in mass between the aquatic and terrestrial birds

Indicates that homogeneity of variance was respected (as we just tested)

  • If assumptions cannot be respected, the non-parametric equivalent of t-test is the Mann-Whitney test

Alternative hypothesis:

"two.sided" (default), "less", or "greater"

Hypothesis (for a unilateral t-test)

F test to compare two variances

data: logMass by Aquatic

F = 1.0725, num df = 38, denom df = 14, p-value = 0.9305

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.3996428 2.3941032

sample estimates:

ratio of variances

1.072452

Two Sample t-test

data: logMass by Aquatic

t = -7.7707, df = 52, p-value = 1.468e-10

alternative hypothesis: true difference in means is less than 0

95 percent confidence interval:

-Inf -1.039331

sample estimates:

mean in group 0 mean in group 1

1.583437 2.908289

Two Sample t-test

data: logMass by Aquatic

t = -7.7707, df = 52, p-value = 2.936e-10

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.6669697 -0.9827343

sample estimates:

mean in group 0 mean in group 1

1.583437 2.908289

  • H0: The mean of group 1 is not larger than the mean of group 2
  • H1: The mean of group 1 is smaller than the mean of group 2
  • You can verify that t.test() and lm() provide the same model:
  • The t-test is still a linear model and a specific case of ANOVA with one factor with 2 levels
  • What did you conclude?

Assumptions

  • If two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a paired t-test
  • Thus the function lm() can also be used

ttest1$statistic^2

anova(ttest.lm1)$F

# answer: F=60.3845 in both cases

  • Data follow a normal distribution
  • Equality of variance between groups

lm.t <-lm(Y~X2, data = data)

  • The ratio of variances is not statistically different from 1, therefore variances are equal

mean of the two groups

When the assumption of equal variance is met t^2 follows an F distribution

* The robustness of this test increases with sample size and is higher when groups have equal sizes

anova(lm.t)

  • We may now proceed with the t-test!

Review: Simple linear regression

Running a regression in R

Linear relationship between response (Y) and explanatory (X) variable

Download the birdsdiet dataset:

  • Response variable: Bird abundance / num: continuous
  • Least squares - most used method and corresponds to the default function in R

Step 1: Run your linear model

  • Explanatory variable: Bird mass / num: continuous

Set working directory

Yi = α + βXi + εi

Step 2: Verify assumptions

In R, the function lm() is used to fit a linear model

setwd()

bird<-read.csv("birdsdiet.csv")

Assumptions are not met

Observed (measured) value of Yi at Xi

Can you transform your variables (and does it make sense to do so)?

Assumptions

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

Visualize the data using the structure str() command:

Assumptions are met

lm1 <- lm(Y ~ X)

Explanatory variable

Predicted value of Yi at Xi

Yes

No

str(bird)

  • Continuous explanatory variable
  • Homogeneous and normally distributed error
  • Independent residuals
  • No outlier

Mean of all Yi

Response variable

New object containing the linear model we created

  • We first want to test if bird maximum abundance is a function of bird mass

Try GLM that might better suit your data

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

Go back to Step 1 with transformed variables

Step 3: Estimate parameters (regression coefficients), test significance, plot your model

VE: Error variance (residuals)

VR: Regression variance

VT: Total variance

R2 = VR/VT

lm1 <- lm(bird$MaxAbund ~ bird$Mass)

Leverage vs influence

Assumptions not met - where did we go wrong?

Step 2: Verify assumptions of lm1

Assumptions not met - where did we go wrong?

Step 2: Verify the assumptions of model lm2

Step 2: Verify assumptions of model lm2

Transform the data

Diagnostic plot # 1 - Residuals vs Fitted

Diagnostic plot # 2 - Scale Location

Diagnostic plot # 3 - Normal QQ

Diagnostic plot # 4 - Residuals vs Leverage

Running a regression in R

Diagnostic plot # 1 - Residuals vs Fitted

High leverage point and high influence

plot(lm1)

  • Plot Y ~ X with corresponding best-fit regression line, and a histogram of Y and X to explore their distributions
  • Can also use the Shapiro-Wilk and the Skewness tests to see if variables follow a normal distribuon:

No influential values

  • Lets try normalizing data with a log10() transformation
  • No leverage
  • Low influence

plot(lm2)

  • Points are outside the 0.5 limit of Cook's distance
  • These points have too much influence on the regression
  • Compares the distribution (quantiles) of the residuals of the current model to those of a normal distribution
  • Should show a scatter across and no pattern

Step 2: Verify assumptions using diagnostic plots

Examples of non-independence

(what we don't want)

plot(logMaxAbund ~ logMass, data=bird)

abline(lm2)

hist(log10(bird$MaxAbund))

hist(log10(bird$Mass))

  • Add the log-transformed variables to our dataframe
  • Looks for influential values

Example of independence (what we want!)

shapiro.test(bird$MaxAbund)

shapiro.test(bird$Mass)

  • Much improved!

plot(bird$MaxAbund ~ bird$Mass)

abline(lm1)

hist(bird$MaxAbund)

hist(bird$Mass)

bird$logMaxAbund <- log10(bird$MaxAbund)

bird$logMass <- log10(bird$Mass)

  • If points lie linearly on the 1:1 line, data follow a normal distribution

Heteroscedastic

Nonlinear

In both cases, distributions are significantly different from normal

  • Should show a scatter across and no pattern

2 x 2 picture on one plot

opar <- par(mfrow=c(2,2))

plot(lm1)

  • The function abline() adds the best-fit line
  • The function hist() produces a histogram of the variable
  • Leverage points: observations at extreme/ outlying values of the explanatory variable. Because they lack neighboring observations, the regression model passes close to leverage points. They may OR may not have a high influence on the regression.

Shapiro-Wilk normality test

data: bird$MaxAbund

W = 0.5831, p-value = 3.872e-11

Shapiro-Wilk normality test

data: bird$Mass

W = 0.5467, p-value = 1.155e-11

  • High leverage
  • No influence
  • par() sets the graphical parameters, for example, the mfrow argument sets the number of rows in the frame
  • plot() is a generic function to plot graphics in R

** Note: you should never remove outliers if you don't have good reasons to do so (ex: error of measurement)

Step 1: Re-run the analysis with log-transformed variables

The positive skewness values also indicate that the distributions are left skewed

High leverage point and reasonable influence

lm2 <- lm(bird$logMaxAbund ~ bird$logMass)

  • High leverage points with high influence can be identified with a Cook's distance greater than 0.5

skewness(bird$MaxAbund)

skewness(bird$Mass)

The output will provide the four built-in diagnostic plots of the lm() function

  • Here, point 32 has high leverage but its influence is acceptable (inside the 0.5 Cook's distance limits)
  • High leverage
  • High influence

[1] 3.167159 # Skewness of MaxAbund

[1] 3.146062 # Skewness of Mass

  • Solution: Transform your data or try another distribution than linear (Gaussian) (i.e., a generalized linear model (GLM): Poisson, binomial, negative binomial, etc.)

No pattern in the residuals

Strong pattern in the residuals

Group discussion

Step 3 : Estimate parameters and test their significance

Challenge 1 - Solution

  • R2-adj changed from 0.05 to 0.25 when we dropped aquatic birds:

Challenge 1

Group discussion

  • The function summary() is used to obtain parameter estimates, significance, etc:
  • Can we improve the model if we only use terrestrial birds?
  • The best model among the three models is lm3 (only terrestrial birds)

summary(lm2)

**You could exclude objects using "=!"

  • Can you write down the equation of the regression line for your model lm2?

  • Are the parameters significant?

  • What proportion of variance is explained by model lm2 ?
  • Examine the relationship between log(MaxAbund) and log(Mass) for passerine birds
  • Save the model object as lm4

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***

logMass -0.2361 0.1170 -2.019 0.0487 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6959 on 52 degrees of freedom

Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484

F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869

HINT: Passerine is also coded 0 and 1 (look at str(bird))

# Run the model

lm4 <- lm(logMaxAbund ~ logMass, data=bird, subset=bird$Passerine == 1)

# Examine the diagnostic plots

par(mfrow=c(2,2))

plot(lm4)

summary(lm4)

# Compare variance explained by lm2, lm3 and lm4

str(summary(lm4)) # Recall: we want adj.r.squared

summary(lm4)$adj.r.squared # R2-adj = -0.02

summary(lm2)$adj.r.squared # R2-adj = 0.05

summary(lm3)$adj.r.squared # R2-adj = 0.25

lm3 <- lm(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic) # removes aquatic birds (= TRUE)

# or equivalently

lm3 <- lm(logMaxAbund~logMass, data=bird, subset=bird$Aquatic == 0)

# Examine the diagnostic plots

par(mfrow=c(2,2))

plot(lm3)

summary(lm3)

# Compare both models

par(mfrow=c(1,2))

plot(logMaxAbund~logMass, data=bird)

plot(logMaxAbund~logMass, data=bird, subset=!bird$Aquatic)

  • We can also call out specific parameters of the model, for example:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6724 0.2472 6.767 1.17e-08 ***

logMass -0.2361 0.1170 -2.019 0.0487 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6959 on 52 degrees of freedom

Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484

F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869

  • Compare the variance explained by lm2, lm3 and lm4

lm2$coef

str(summary(lm2))

summary(lm2)$coefficients

summary(lm2)$r.squared

Website: http://qcbs.ca/wiki/r/workshop4

1 with 3 levels or more

and/ or many factors

2 or more

GLM

3. ANOVA

Did we violate the model assumptions?

Running an ANOVA in R

Graphical representation

Running a one-way ANOVA in R

What if the assumptions were not met?

A posteriori test

Output of our ANOVA model

Running a one-way ANOVA in R

Types of ANOVA

Analysis of Variance (ANOVA)

Review: ANOVA

Running an ANOVA in R

Review: ANOVA

  • Bartlett test - equality of variance among groups
  • First, visualize the data using boxplot() (alphabetically by default)
  • Compare outputs:
  • The ANOVA results can be graphically illustrated using the barplot() function
  • Can use the function lm() to run an ANOVA

summary.lm(aov1)

summary(anov1)

bartlett.test(logMaxAbund ~ Diet, data=bird)

  • Generalization of the t-test to >2 groups and/or ≥ 2 factor levels
  • Transform the data - might normalize the residuals and homogenize the variances, and convert a multiplicative effect into an additive one

boxplot(logMaxAbund ~ Diet, data=bird)

One-Way ANOVA

Compares variation within and between groups in order to determine if group means differ

  • If ANOVA detects a significant effect, a post-hoc test is required to determine which treatment(s) differ from the others. This can be done with the TukeyHSD() function

Is abundance a function of diet?

Assumptions

anov1 <- lm(logMaxAbund ~ Diet, data=bird)

anova(anov1)

  • One factor with >2 levels
  • Response variable: MaxAbund / num: continuous
  • Factor levels in alphabetical order and all levels are compared to the reference level ("Insect")

Bartlett test of homogeneity of variances

data: logMaxAbund by Diet

Bartlett's K-squared = 7.4728, df = 4, p-value = 0.1129

data$logY <- log10(data$Y)

  • Normality of residuals
  • Equality of variance between groups
  • We can reorder factor levels according to group medians using the tapply() et sort() functions
  • Explanatory variable: Diet / factor with 5 levels

Hypothesis

TukeyHSD(aov(anov1),ordered=T)

sd <- tapply(bird$logMaxAbund,list(bird$Diet),sd)

means <- tapply(bird$logMaxAbund,list(bird$Diet),mean)

n <- length(bird$logMaxAbund)

se <- 1.96*sd/sqrt(n)

bp <- barplot(means)

segments(bp, means - se, bp, means + se, lwd=2)

segments(bp - 0.1, means - se, bp + 0.1, means - se, lwd=2)

segments(bp - 0.1, means + se, bp + 0.1, means + se, lwd=2)

aov1 <- aov(logMaxAbund ~ Diet, data=bird)

summary(aov1)

Only "Vertebrate" and "PlantInsect" differ

* see Workshop 1 for rules on data transformation

Two-Way ANOVA

  • Or use a specific function called aov()

med <- sort(tapply(bird$logMaxAbund, bird$Diet, median))

boxplot(logMaxAbund ~ factor(Diet, levels=names(med)))

  • For each factor and each interaction term:
  • H0: The mean of the response variable is equal between groups
  • H1: At least two groups have a different mean

* re-run your data with the transformed variables and verify the assumptions once again

  • Shapiro-Wilk test - Normality of residuals

Complementary test

  • 2 factors or more
  • Each factor can have multiple levels
  • Need to test for interactions among factors

aov1 <- aov(logMaxAbund ~ Diet, data=bird)

summary(aov1)

shapiro.test(resid(anov1))

Significant difference btw Diet groups, but we don't know which ones!

+ Residuals

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.1539 0.1500 7.692 5.66e-10 ***

DietInsectVert -0.6434 0.4975 -1.293 0.2020

DietPlant 0.2410 0.4975 0.484 0.6303

DietPlantInsect 0.4223 0.2180 1.938 0.0585 .

DietVertebrate -0.3070 0.2450 -1.253 0.2161

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6709 on 49 degrees of freedom

Multiple R-squared: 0.188, Adjusted R-squared: 0.1217

F-statistic: 2.836 on 4 and 49 DF, p-value: 0.0341

  • When ANOVA detects a significant difference between groups, it does not tell you which group differs from which other group

Main Effects

of Factors

$ Family : Factor w/ 53 levels "Anhingas","Auks& Puffins",..: 18 25 23 21 2 10 1 44 24 19 ...

$ MaxAbund : num 2.99 37.8 241.4 4.4 4.53 ...

$ AvgAbund : num 0.674 4.04 23.105 0.595 2.963 ...

$ Mass : num 716 5.3 35.8 119.4 315.5 ...

$ Diet : Factor w/ 5 levels "Insect","InsectVert",..: 5 1 4 5 2 4 5 1 1 5 ...

$ Passerine: int 0 1 1 0 0 0 0 0 0 0 ...

$ Aquatic : int 0 0 0 0 1 1 1 0 1 1 ...

  • When an interaction term is significant: the effect of a given factor varies according to the levels of another factor

+ Interactions Between Factors

Df Sum Sq Mean Sq F value Pr(>F)

Diet 4 5.106 1.276 2.836 0.0341 *

Residuals 49 22.052 0.450

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Shapiro-Wilk normality test

data: resid(anov1)

W = 0.98, p-value = 0.4982

  • Kruskal-Wallis Test - non-parametric equivalent to ANOVA if you cannot (or do not wish to) transform the data

$Diet

diff lwr upr p adj

Vertebrate-InsectVert 0.3364295 -1.11457613 1.787435 0.9645742

Insect-InsectVert 0.6434334 -0.76550517 2.052372 0.6965047

Plant-InsectVert 0.8844338 -1.01537856 2.784246 0.6812494

PlantInsect-InsectVert 1.0657336 -0.35030287 2.481770 0.2235587

Insect-Vertebrate 0.3070039 -0.38670951 1.000717 0.7204249

Plant-Vertebrate 0.5480043 -0.90300137 1.999010 0.8211024

PlantInsect-Vertebrate 0.7293041 0.02128588 1.437322 0.0405485

Plant-Insect 0.2410004 -1.16793813 1.649939 0.9884504

PlantInsect-Insect 0.4223003 -0.19493574 1.039536 0.3117612

PlantInsect-Plant 0.1812999 -1.23473664 1.597336 0.9961844

Repeated measures?

  • Try it and compare the outputs!
  • A commonly used post-hoc test to answer this question is the Tukey's test
  • Just like the t-test, the robustness of ANOVA increases with sample size and if groups have equal sizes

kruskal.test(Y~X, data)

  • Note that ANOVA can be used for repeated measures, but we won't cover this today. Mixed-effect linear models can also be used for this kind of data (see Workshop 5!)

The constant is the mean of the values of the response variable

  • Both tests are non-significant; variances assumed to be equal and residuals assumed to be normally distributed

Two-way ANOVA

Challenge 2

Challenge 2 - Solution

anova.with.lm <- lm(Y~X*Z, data)

anov2 <- lm(logMaxAbund ~ Diet*Aquatic, data=bird)

summary(anov2)

anova(anov2)

  • According to law of parsimony, select model that explain the most variance with the least model parameters possible

  • Thus, remove the interaction term if non-signficant and re-evaluate

The "*" symbol indicates that the main effects as well as their interaction will be included in the model

  • Evaluate how log(MaxAbund) varies with Diet and Aquatic

anova.with.lm <- lm(Y~X+Z, data)

Analysis of Variance Table

Response: logMaxAbund

Df Sum Sq Mean Sq F value Pr(>F)

Diet 4 5.1059 1.27647 3.0378 0.02669 *

Aquatic 1 0.3183 0.31834 0.7576 0.38870

Diet:Aquatic 3 2.8250 0.94167 2.2410 0.09644 .

Residuals 45 18.9087 0.42019

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Hint: Test the factors Diet, Aquatic and their interaction in a two-way ANOVA model

e.g. lm(Y ~ A*B)

where A is your first factor, B is your second and "*" is the command for interaction in R

Analysis of Variance Table

Response: Y

Df Sum Sq Mean Sq F value Pr(>F)

X 4 5.1059 1.27647 3.0378 0.02669 *

Z 1 0.3183 0.31834 0.7576 0.38870

X:Z 3 2.8250 0.94167 2.2410 0.10689

Residuals 45 18.9087 0.42019

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

If use the "+" symbol, the main effects but not their interaction are included

Example of a non-significant interaction

5. Multiple linear

regression

Verify assumptions!

Multiple linear regression in R

  • However, the response variable (abund) is not linearly related to the explanatory variables.

Find the best-fit model

Multiple linear regression

Multiple linear regression in R

  • Explanatory variable: 2 or more continuous variables
  • Response variable: 1 continuous variable

Collinearity:

  • Run the multiple linear regression in R with abund as a function of clTma + NDVI + grass

par(mfrow=c(1,3))

plot(abund ~ clTma, data=Dickcissel)

plot(abund ~ NDVI, data=Dickcissel)

plot(abund ~ grass, data=Dickcissel)

  • Recall the principle of parcimony: Most variance explained using the least number of terms possible

  • Using the Dickcissel datast, test the effect of climate (clTma), productivity (NDVI) and grassland cover (grass) on bird abundance (abund)
  • Verify collinearity of all explanatory variables

  • Generalization of simple linear regression

plot(data=Dickcissel)

  • Remove the weakest variable

lm.mult <- lm(abund ~ clTma + NDVI + grass, data=Dickcissel)

summary(lm.mult)

Yi = α + β1X1i + β2X2i + β3X3i + ...+ βkXki+ ε

summary(lm.mult)

Dickcissel <- read.csv("Dickcissel.csv")

str(Dickcissel)

Assumptions

  • Verify diagnostic plots, as you did for the simple linear regression

  • If you see a pattern between any two variables, they may be collinear!
  • They are likely to explain the same variability of the response variable and the effect of one variable will be masked by the other

  • Keep only one of the collinear variables
  • Try multivariate analyses (see Ordination Workshop)
  • Try a pseudo-orthogonal analysis

All 3 variables are significant

Model explains ~11% of the dickcissel variance (R2-adj = 0.11)

Possible solutions:

Call:

lm(formula = abund ~ clTma + NDVI + grass, data = Dickcissel)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -83.60813 11.57745 -7.222 1.46e-12 ***

clTma 3.27299 0.40677 8.046 4.14e-15 ***

NDVI 0.13716 0.05486 2.500 0.0127 *

grass 10.41435 4.68962 2.221 0.0267 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.58 on 642 degrees of freedom

Multiple R-squared: 0.117, Adjusted R-squared: 0.1128

F-statistic: 28.35 on 3 and 642 DF, p-value: < 2.2e-16

'data.frame': 646 obs. of 15 variables:

$ abund : num 5 0.2 0.4 0 0 0 0 0 0 0 ...

$ Present : Factor w/ 2 levels "Absent","Present": 1 1 1 2 2 2 2 2 2 2 ...

$ clDD : num 5543 5750 5395 5920 6152 ...

$ clFD : num 83.5 67.5 79.5 66.7 57.6 59.2 59.5 51.5 47.4 46.3 ...

$ clTmi : num 9 9.6 8.6 11.9 11.6 10.8 10.8 11.6 13.6 13.5 ...

$ clTma : num 32.1 31.4 30.9 31.9 32.4 32.1 32.3 33 33.5 33.4 ...

$ clTmn : num 15.2 15.7 14.8 16.2 16.8 ...

$ clP : num 140 147 148 143 141 ...

$ NDVI : int -56 -44 -36 -49 -42 -49 -48 -50 -64 -58 ...

$ broadleaf: num 0.3866 0.9516 0.9905 0.0506 0.2296 ...

$ conif : num 0.0128 0.0484 0 0.9146 0.7013 ...

$ grass : num 0 0 0 0 0 0 0 0 0 0 ...

$ crop : num 0.2716 0 0 0.0285 0.044 ...

$ urban : num 0.2396 0 0 0 0.0157 ...

$ wetland : num 0 0 0 0 0 0 0 0 0 0 ...

par(mfrow=c(2,2))

plot(lm.mult)

  • Each of the explanatory variables must follow a normal distribution
  • Linear relationship between each explanatory variable and response variable
  • Explanatory variables are orthogonal (NO collinearity)
  • Residuals follow a normal distribution
  • Independence of residuals versus predicted values
  • Independence of the variance of residuals versus predicted values
  • No outliers

*same as simple linear regression!

* See Advanced section on polynomial regression for solution!

Variables selected by the step() function

Stepwise regression

  • Run a model with everything in except the "Present/ absence" variable

  • Use step() to iteratively select the best model,
  • i.e. model with lowest Akaike's Information Criterion (AIC)

Call:

lm(formula = abund ~ clDD + clFD + clTmi + clTma + clP + grass,

data = Dickcissel)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -4.457e+02 3.464e+01 -12.868 < 2e-16 ***

clDD 5.534e-02 8.795e-03 6.292 5.83e-10 ***

clFD 1.090e+00 1.690e-01 6.452 2.19e-10 ***

clTmi -6.717e+00 7.192e-01 -9.339 < 2e-16 ***

clTma 3.172e+00 1.288e+00 2.463 0.014030 *

clP 1.562e-01 4.707e-02 3.318 0.000959 ***

grass 1.066e+01 4.280e+00 2.490 0.013027 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.85 on 639 degrees of freedom

Multiple R-squared: 0.3207, Adjusted R-squared: 0.3144

F-statistic: 50.29 on 6 and 639 DF, p-value: < 2.2e-16

lm.full <- lm(abund ~ . - Present, data=Dickcissel)

summary(lm.full)

lm.step <- step(lm.full)

summary(lm.step)

?step

The model now accounts for 31.44% of the Dickcissel abundance variability

Optional section

(... if time permits)

Optional

1. Interpreting contrasts

Interpreting contrasts

  • We may want to relevel the baseline:
  • DEFAULT contrast compares each level of a factor to baseline level
  • The intercept is the baseline group and corresponds to the mean of the first (alphabetically) Diet level (Insect)

1) Compare the Plant diet to all other diet levels

  • Note, the DEFAULT contrast ("contr.treatment") is NOT orthogonal
  • To be orthogonal,

1) coefficients must sum to 0

2) any two contrast columns must sum to 0

bird$Diet2 <- relevel(bird$Diet, ref="Plant")

anov2 <- lm(logMaxAbund ~ Diet2, data=bird)

summary(anov2)

anova(anov2)

sum(contrasts(bird$Diet)[,1])

sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])

  • Add Intercept + coefficient estimates of each Diet level

2) Reorder multiple levels according to median

  • Change the contrast to make levels orthogonal (e.g. Helmert contrast will contrast the second level with the first, the third with the average of the first two, and so on)

tapply(bird$logMaxAbund, bird$Diet, mean)

coef(anov1)

coef(anov1)[1] + coef(anov1)[2] # InsectVert

coef(anov1)[1] + coef(anov1)[3] # Plant

bird$Diet2 <- factor(bird$Diet, levels=names(med))

anov2 <- lm(logMaxAbund ~ Diet2, data=bird)

summary(anov2)

anova(anov2)

options(contrasts=c("contr.helmert", "contr.poly"))

sum(contrasts(bird$Diet)[,1])

sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2])

anov3 <- lm(logMaxAbund ~ Diet, data=bird)

summary(anov3)

  • What did you notice?
  • Did you see change in the significance of each Diet level?

2. Unbalanced ANOVA

Unbalanced ANOVA

  • The birdsdiet data is actually unbalanced (number of Aquatic and non-Aquatic is not equal)

table(bird$Aquatic)

  • Which means the order of the covariates changes the values of Sums of Squares

unb.anov1 <- lm(logMaxAbund ~ Aquatic + Diet, data=bird)

unb.anov2 <- lm(logMaxAbund ~ Diet + Aquatic, data=bird)

anova(unb.anov1)

anova(unb.anov2)

  • Now try type III ANOVA using the Anova() function. What have you noticed when using Anova()?

library(car)

Anova(unb.anov1,type="III")

Anova(unb.anov2,type="III")

3. Polynomial regression

Polynomial regression

  • When you know a degree, you can also give it a name:
  • For a polynomial with one variable (x), the degree is the largest exponent of that variable
  • As we noticed in the section on multiple linear regression, MaxAbund was non-linearly related to some variables.

  • To test for non-linear relationships, polynomial models of different degrees are compared.
  • A polynomial model looks like this:

this makes the degree 4

terms

this polynomial has 3 terms

Polynomial regression

  • Using the Dickcissel dataset, test the non-linear relationship between max abundance and temperature by comparing three sets of nested polynomial models (of degrees 0, 1, and 3):
  • Compare the polynomial models and determine which nested model we should keep

  • Run a summary of this model, report the regression formula, p-values and R2-adj

lm.linear <- lm(abund ~ clDD, data=Dickcissel)

lm.quad <- lm(abund ~ clDD + I(clDD^2), data=Dickcissel)

lm.cubic <- lm(abund ~ clDD + I(clDD^2) + I(clDD^3), data=Dickcissel)

4. Variance partitioning

Variance Partitioning

Variance Partitioning

  • Test significance of each fraction
  • Proportion of variance explained by climate alone is 28.5% (given by X1|X2), by land cover alone is ~0% (X2|X1), and by both combined is 2.4%
  • Use varpart() to partition the variance in max abundance with all land cover variables in one set and all climate variables in the other set (leaving out NDVI)
  • Note: Collinear variables do not have to be removed prior to partitioning
  • Some of the selected explanatory variables in the multiple linear regression section were highly correlated

  • Collinearity between explanatory variables can be assessed using the variance inflation factor - vif() function of package ‘HH’
  • Variable with VIF >> 5 are considered collinearity

# Climate set

out.1 = rda(Dickcissel$abund, Dickcissel[ ,c("clDD",

"clFD","clTmi","clTma","clP")], Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])

anova(out.1, step=1000, perm.max=1000)

# Land cover set

out.2 = rda(Dickcissel$abund,

Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban", "wetland")], Dickcissel[ ,c("clDD","clFD","clTmi", "clTma","clP")])

anova(out.2, step=1000, perm.max=1000)

library(vegan)

part.lm = varpart(Dickcissel$abund,

Dickcissel[ ,c("clDD","clFD","clTmi","clTma","clP")],

Dickcissel[ ,c("broadleaf","conif","grass","crop", "urban","wetland")])

part.lm

showvarparts(2)

plot(part.lm,digits=2,bg=rgb(48,225,210,80,maxColorValue=225),col="turquoise4")

vif(clDD ~ clFD + clTmi + clTma + clP + grass, data=Dickcissel)

  • Conclusion: the land cover fraction is non-significant once climate data is accounted for

Y = μ +

+ Residuals

+ Interactions between factors

You will see in the assumptions that there should be NO interaction between covariates and factors

  • Subsets variation in the response variable into additive effects of one or several factors and the interactions between them

Y = u +

Learn more about creating dynamic, engaging presentations with Prezi