Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

0. Intro

1. Exploring and Preparing Data

2. Canonical analyses

2.2. Partial redundancy analysis

2.3. Variation partitioning

3. Multivariate regression tree

4. Linear discriminant analysis

Overview

Follow along

Required packages

Intro

> install.packages("vegan")

> install.packages("labdsv")

> install.packages("plyr")

> install.packages("MASS")

> install.packages("mvpart")

> install.packages("MVPARTwrap")

> install.packages("rdaTest")

So you know how to ordinate which is great for data exploration

But what if we want to test hypotheses or make predictions?

  • This workshop is an extension of Workshop 9 but now with more:
  • Environmental data!
  • Fun!
  • Matrices!
  • Hypothesis testing!

Today we will be going over constrained ordination methods that allow us to integrate environmental variables to explain and predict variation

Data and R code available at:

qcbs.ca/wiki/r/r_workshop10

Navigate to the wiki page and download:

  • Rscript
  • DoubsEnv.csv
  • DoubsSpe.csv
  • Test data
  • mvpart package
  • MVPARTwrap package
  • rdaTest package

Recommendations:

  • create your own new script
  • refer to provided code only if needed
  • avoid copy pasting or running the code directly from script
  • remember to change the working directory to where your files are stored

After installing then load packages

> install.packages("packagename")

Doubs River Fish Dataset

Data Transformation

Doubs River Fish Dataset

Explore Doubs Dataset

Species Frequencies

decostand()

Doubs Environmental Data

Species Frequencies

Explore the content of the environmental data:

> ?decostand

Load the Doubs River species data (DoubsSpe.csv):

Take a look at the distribution of species frequencies:

How many zeros?

Explore the content of the fish community dataset:

> sum(spe==0)

> spe <- read.csv(file.choose(), row.names=1)

> spe <- spe[-8,] #remove site with no data

As with many data sets we have an abundance of zeros

However we don't want common absences to artificially increase similarity between sites

So we can apply a transformation, here we will use the Hellinger transformation using the decostand() function in the vegan package

> (ab=table(unlist(spe)))

> barplot(ab, las=1, xlab=”Abundance class”,

ylab=”Frequency”, col=grey(5:0/5))

Standardizing environmental variables is crucial as you cannot compare the effects of variables with different units:

> names(env) #names of objects

> dim(env) #dimensions

> str(env) #structure of objects

> head(env) #first 5 rows

> summary(env) #summary statistics

408!

MUST BE STANDARDIZED !

> env.z <- decostand(env, method="standardize")

Load the Doubs River environmental data (DoubsEnv.csv):

> names(spe) #names of objects

> dim(spe) #dimensions

> str(spe) #structure of objects

> head(spe) #first 5 rows

> summary(spe) #summary statistics

What proportion of zeros?

Explore collinearity by visualizing correlations between variables:

> env <- read.csv(file.choose(), row.names=1)

> env <- env[-8,] #remove site with no data

> spe.hel <- decostand(spe, method="hellinger")

Verneaux (1973) dataset:

  • characterization of fish communities
  • 27 different species
  • 30 different sites
  • 11 environmental variables

This centers and scales the variables to make your downstream analysis more appropriate:

> sum(spe==0)/(nrow(spe)*ncol(spe))

> pairs(env, main="Bivariate Plots of the Environmental Data" )

# PROCEED WITH CAUTION ONLY EXECUTE ONCE

>50% !

> apply(env.z, 2, mean)

> apply(env.z, 2, sd)

2.1. Redundancy analysis

Canonical Analyses

What does it all mean?

RDA plots

RDA in R

Redundancy analysis (RDA)

Plot aesthetics

Significance testing

Selecting variables

Scaling type 1

RDA output

Scaling type 2

Use anova.cca to test the significance of axes

You can make quick RDA biplots with the base plot() function but sometimes that gives you a plot that crowds your points or makes objects too small. You can change the aesthetics of the points, text and arrows.

Use the ordiR2step() function to select significant explanatory variables

In type 1 scaling the sites (here in blue) that are more similar to one another are closer to each other

  • Prepare the data
  • Run the RDA
  • Extract the results

> anova.cca(spe.rda.signif, step=1000)

> anova.cca(spe.rda.signif, step=1000)

Remember that there are two scaling options:

  • Type=1: distance among objects approximates their similarities
  • Type=2: angles between variables reflect their correlation

>points(scores(spe.rda.signif, display="sites", choices=c(1,2), scaling=1),

pch=21, col="black", bg="steelblue", cex=1.2)

  • Canonical analysis is a generic term for analyses that
  • identify relationship between species composition and environmental variables
  • are used to test hypotheses
  • make predictions

> ordiR2step(rda(spe.hel~1, data=env.z),scope=formula(spe.rda),direction="forward,R2scope=TRUE,pstep=1000)

> env.signif<-subset(env.z, select=c("alt","oxy","dbo"))

  • Direct extension of multiple regression
  • Models effect of explanatory matrix on a response matrix
  • Variables can be quantitative, qualitative or binary (1 or 0)
  • As with other ordinations the data should be transformed or standardized is units are disparate

> env.z<-subset(env.z, select=-das)

> spe.rda<-rda(spe.hel~.,data=env.z)

> summary(spe.rda, display=NULL)

Quebec Centre for Biodiversity Science

R Workshop Series

>arrows(0,0,scores(spe.rda.signif, display="species", choices=c(1), scaling=1),scores(spe.rda.signif, display="species", choices=c(2), scaling=1), col="black",length=0)

One of the most powerful aspects of RDA is the simultaneous visualization of your environmental species and site data

Now plot your RDA

In type 2 scaling the arrows and their angles are really informative. Variables that have longer arrows are stronger drivers of the variation. Variables that are opposite one another (oxy-dbo) have a negative relationship, same direction (oxy-alt) have a positive relationship

How many variables are retained by the forward selection? What is the R2 of the RDA with the retained variables?

> R2adj<-RsquaredAdj(spe.rda.signif)$adj.r.squared

Constrained proportion is the variance of Y explained by X (73.41%) and the unexplained variance of Y (26.59%)

Challenge 1 (Solution)

Challenge 1

Now run an RDA on mite species abundances using the environmental data as constraining variables

Test partial RDA significance

Plot results

Partial RDA

Why is it interesting?

Solution #2

Challenge #2

How does it work?

Partial RDA

Partial RDA on Doubs data

  • Compute RDA:

> mite.spe.subs <- rda(mite.spe.hel ~ Shrub + Topo

+ Condition(SubsDens + WaterCont + Substrate), data=mite.env)

  • Assess the effect of water chemistry on fish species abundances partialling out the effect of physiography

#Run the partial RDA

> spechem.physio<-rda(spe.hel, envchem, envtopo)

> summary(spechem.physio, display=NULL)

#or

> spechem.physio2<-rda(spe.hel ~ pH + dur + pho + nit + amm

+ oxy + dbo + Condition(alt + pen + deb), data=env)

> summary(spechem.physio2, display=NULL)

Run a partial RDA on the mite data:

  • control for the substrate variables SubsDens, Watercont and Substrate

  • A special case of the RDA
  • Involves an additional set of explanatory variables
  • The linear effects of X variables on the Y variables are adjusted for the effects of covariables W

#R2adj

> R2adj<- RsquareAdj(spechem.physio)$adj.r.squared

> R2adj

#Significance of axes

> anova.cca(spechem.physio, step=1000)

> anova.cca(spechem.physio, step=1000, by="axis")

In a nutshell...

  • RDA of Y on W
  • The residuals are extracted (new matrix Yres|W)
  • containing Y response variables without the effect of W
  • RDA of Yres|W on X

> plot(spechem.physio, scaling=2, main="Triplot partial RDA - scaling 2",

type="none", xlab=c("RDA1"), ylab=c("RDA2"), xlim=c(-1,1), ylim=c(-1,1))

> points(scores(spechem.physio, display="sites", choices=c(1,2),

scaling=2),pch=21, col="black", bg="steelblue", cex=1.2)

> text(scores(spechem.physio, display="species", choices=c(1),scaling=2),

scores(spechem.physio, display="species", choices=c(2), scaling=2),

labels=rownames(scores(spechem.physio, display="species", scaling=2)),

col="grey", cex=0.8)

> arrows(0,0,scores(spechem.physio, display="bp", choices=c(1),scaling=2),

scores(spechem.physio, display="bp", choices=c(2), scaling=2),col="red")

> text(scores(spechem.physio, display="bp", choices=c(1), scaling=2)+0.05,

scores(spechem.physio, display="bp", choices=c(2), scaling=2)+0.05,

labels=rownames(scores(spechem.physio, display="bp", choices=c(2),

scaling=2)),col="red", cex=1)

  • Assess the effects of environmental variables on species composition while taking into account the variation due to variables that are not the focus of the study
  • Control for well-known linear effects
  • Isolate the effect of one or more descriptor(s)
  • Analyse related samples

Workshop 10: Advanced multivariate analyses

> summary(mite.spe.subs, display="NULL")

> (R2adj <- RsquareAdj(mite.spe.subs)$adj.r.squared)

> anova.cca(mite.spe.subs, step=1000)

> anova.cca(mite.spe.subs, step=1000, by="axis")

  • Is the model significant
  • Which are the significant axes
  • What is the variance explained by the substrate variables?

> envtopo<-env.z[,c(1:3)] # physiography

> envchem<-env.z[,c(4:10)] # water quality

RDA is significant, but only the 1st canonical axis is significant. Shrub and Topo explained 9.81% of the variation in mite species composition, and substrate covariables explained 42.84 of the variation. R2adj of this RDA is 8.33%

X

W

~

Y

Variation partitioning

Going one step further...

Solution #3

Challenge #3

Solution #3

Variation partitioning

Venn diagram

Solution #3

  • We can retain only significant variables in each subsets and test the significance of each fraction

> plot(spe.part.all, digits=2)

To divide the variation of a response variable among 2, 3, or 4 explanatory data sets

Perform variation partitioning of the mite species data with significant substrate variables and significant (other) variables:

Useful when one wants to relate species abundance in a community to various types of environmental variables

  • large-scale vs small-scale
  • abiotic vs biotic

> #Varpart

> spe.part <- varpart(spe.hel, envchem.pars, envtopo.pars)

> plot(spe.part, digits=2)

> #Tests of significance

> anova.cca(rda(spe.hel, envchem.pars), step=1000) # [a+b]

> anova.cca(rda(spe.hel, envtopo.pars), step=1000) # [b+c]

> env.pars <- cbind(envchem.pars, envtopo.pars)

> anova.cca(rda(spe.hel, env.pars), step=1000) # [a+b+c]

> anova.cca(rda(spe.hel, envchem.pars, envtopo.pars),

step=1000) # [a]

> anova.cca(rda(spe.hel, envtopo.pars, envchem.pars),

step=1000) # [c]

envchem

envtopo

> #Varpart

> mite.spe.part <- varpart(mite.spe.hel, mite.subs.pars, mite.other.pars)

> plot(mite.spe.part, digits=2)

> #Tests of significance

> anova.cca(rda(mite.spe.hel, mite.subs.pars), step=1000) # [a+b]

> anova.cca(rda(mite.spe.hel, mite.other.pars), step=1000) # [b+c]

> mite.env.pars <- cbind(mite.subs.pars, mite.other.pars)

> anova.cca(rda(mite.spe.hel, mite.env.pars), step=1000) # [a+b+c]

> anova.cca(rda(mite.spe.hel, mite.subs.pars, mite.other.pars),

step=1000) # [a]

> anova.cca(rda(spe.hel, mite.other.pars, mite.subs.pars),

step=1000) # [c]

  • What proportion of the variation is explained by each data set?
  • What are the significant fractions?

> mite.subs=mite.env[,c(1,2,3)] # first set of variables

> mite.other=mite.env[,c(4,5)]

> #RDA of mite.subs

> rda.mite.subs <- rda(mite.spe.hel~., data=mite.subs)

> R2a.all.chem <- RsquareAdj(rda.mite.subs)$adj.r.squared

> ordiR2step(rda(mite.spe.hel~1, data=mite.subs),

scope= formula(rda.mite.subs),direction= "forward",

R2scope=TRUE, pstep=1000)

> names(mite.subs)

> mite.subs.pars <- mite.subs[, c( 2,3 )]

> #RDA for mite.other

> rda.mite.other <- rda(mite.spe.hel~., data=mite.other)

> R2a.all.other <- RsquareAdj(rda.mite.other)$adj.r.squared

> ordiR2step(rda(mite.spe.hel~1, data=mite.other),

scope=formula(rda.mite.other), direction= "forward",

R2scope=TRUE, pstep=1000)

> names(mite.other)

> mite.other.pars <- mite.other[, c(1,2)]

> #RDA of chemistry variables

> spe.chem <- rda(spe.hel~., data=envchem)

> #Select significant chemistry variables

> R2a.all.chem <- RsquareAdj(spe.chem)$adj.r.squared

> ordiR2step(rda(spe.hel~1, data=envchem), scope= formula(spe.chem),

direction= "forward", R2scope=TRUE, pstep=1000)

> names(envchem)

> envchem.pars <- envchem[, c( 4, 6, 7 )]

> #RDA with other environmental variables

> spe.topo <- rda(spe.hel~., data=envtopo)

> R2a.all.topo <- RsquareAdj(spe.topo)$adj.r.squared

> ordiR2step(rda(spe.hel~1, data=envtopo), scope= formula(spe.topo),

direction= "forward", R2scope=TRUE, pstep=1000)

> names(envtopo)

> envtopo.pars <- envtopo[, c(1,2)]

  • All the fractions are significant (p<0.001)

> library(vegan)

> ?varpart

> spe.part.all <- varpart(spe.hel, envchem,envtopo)

> spe.part.all

Legendre and Legendre 2012

All fractions are significant.

Website: http://qcbs.ca/wiki/r_workshop10

Cool Venn diagram

> install.packages("venneuler")

> library(venneuler)

> v <- venneuler(c(A=0.241,"A&B"= 0.233, B=0.112))

> v$labels <- c("Chem\n24%","Topo \n11%")

> v$colors <- c(.78,.6)

> plot(v)

MRT on the Doubs data

Multivariate regression tree (MRT)

Plot the tree

MRT terminology

Discriminant species

Compare tree sizes

Summary output

Cross-validation plot

MRT vs RDA

threshold

> summary(doubs.mrt)

We will use the function mvpart() from the... mvpart{} package

> # Using the CVRE criterion

> doubs.mrt.cvre<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,

margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)

> # Choosing to get a tree composed of 4 groups (leaves)

> doubs.mrt.cvre<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,

margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)

Are there species contributing highly to the explained variance at each split?

What are the sites contained in each leaf?

Let's generate an nicer and more informative output with MRT().

This function is available in MVPARTwrap library, and also calls the rdaTest library

  • A constrained clustering technique that:
  • partitions a quantitative response matrix by a matrix of explanatory variables constraining where to divide the response data
  • involves two procedures running simultaneously:
  • the constrained partitioning of the data
  • cross-validations
  • the end result in a tree
  • Leaf: terminal group of sites
  • Node: point where the data splits into two groups
  • characterized by a threshold value of a descriptor
  • Branch: each group formed by a split

> ?mvpart

> # Remove "distance from source" from raw environmental data set

> env <- subset(env, select = -das)

> # Create the tree

> doubs.mrt<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,

margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)

  • barplot showing species abundances in sites grouped in this leaf
  • relative error
  • number of sites

Advantages of the MRT:

  • does not assume a linear relationship between Y and X matrices
  • robust in dealing with missing values
  • robust in dealing with colinearity among descriptors
  • works well with raw descriptor value
  • generates an output that is easy to interpret

Note that the code does not change at all...

The selection of the tree size must be made via the cross-validation graph

> doubs.mrt.wrap<-MRT(doubs.mrt, percent=10,

species=colnames(spe.hel))

> summary(doubs.mvpart.wrap)

xv= "pick" # so that we choose where to cut the tree

xval=now(spe.hel) # number of times cross-validation occurs

xvmult=100 # number of permutations in the cross-validation

  • The main discriminant species are TRU, VAI and ABL.
  • ABL is the most indicative species of the sites at lower altitude

CP stands for "complexity parameter". In other words, the variance explained at each node.

reciprocal R2: 43.7%

Solution #4

Challenge #4

Solution #4

Create a multivariate regression tree for the mite data.

> mite.mrt<-mvpart(data.matrix(mite.spe.hel)~.,

mite.env,legend=FALSE, margin=0.01, cp=0,

xv="pick", xval=nrow(mite.spe.hel), xvmult=100,which=4)

> summary(mite.mrt)

> mite.mrt.wrap<-MRT(mite.mrt, percent=10,

species=colnames(mite.spe.hel))

> summary(mite.mrt.wrap)

  • Select the minimum size of tree within 1 SE of the CVRE
  • What is the proportion of variance explained by this tree?
  • How many leaves does it contain?
  • What are the discriminant species?
  • 25.6% of the variation in the mite species assemblage across sites is explained by the partition of the sites based on water content of the substrate at a threshold value of 385.1 mg/l.
  • LCIL is a discriminant species of sites with higher water content

Challenge 5 solution

LDA output

Run the LDA

A priori grouping

Challenge 5 solution

Challenge 5

LDA

Making predictions

Predicted groupings

We will use the grouping factor in spa.group in column 4

The LDA output can then be used to determine which group each object (site) was classified in based on the LDA and tell you how accurate each grouping was

Now that we have the relationship between the environmental variables and the groupings we can get new sites and classify them based on this relationship

Use the LDA result to predict the grouping of the ClassifyMe data

Run an LDA for the mite env data (only first two vars) based on four latitudinal groups you create from the mite.xy data set

#read in new sites

>classify.me<-read.csv("classifyme.csv", header = T)

#predict grouping of new data

>predict.group<-predict(LDA, newdata=classify.me)

#give classification for each new site

>group.new<-predict.group$class

  • For our example we want to first make the a priori grouping for the Doubs fish data
  • We will chose latitude since we have the x, y coordinates for each site
  • We will divide the latitude into three groups based on the range of the y coordinates

>LDA.mite<-lda(mite.env[,1:2],mite.xy.group[,4])

>mite.class <- predict(LDA.mite)$class

>mite.post <- predict(LDA.mite)$posterior

>mite.table <- table(mite.xy.group[,4], mite.class)

>diag(prop.table(mite.table, 1))

#run LDA

>LDA<-lda(env,spa.group[,4])>

#classification of the objects based on LDA

>spe.class <- predict(LDA)$class

#posterior probabilities of the objects to belong to the groups

>spe.post <- predict(LDA)$posterior

>mite.xy$site<-seq(1:70)

(max(mite.xy[,2])-min(mite.xy[,2]))/4

>mite.xy.group<-ddply(.data=mite.xy, .variables=.(x, y, site), .fun= summarise, group = if(y <= 2.5) 1 else if (y <= 4.9) 2 else if (y <= 7.3) 3 else 4)

>mite.xy.group<-mite.xy.group[with(mite.xy.group, order(site)), ]

  • Linear discriminant analysis (LDA) is a cool technique that allows you to determine how much variation is explained by an a priori grouping (like latitude or ecoytype)

  • You can also use it to make predictions about what group a new entry belongs to

  • An example could be classifying whether a fish comes from a lake or ocean population based on variables you collected from the different populations

The syntax for the LDA is lda(variables you want to use to explain the grouping, grouping variable)

Are ordered in the same order as they appear in the data frame so site 31 is in group 2, site 32 in group 1 and so on

#add site numbers

>numbers<-(1:30)

>numbers<-numbers[!numbers%in%8]

>spa$site<-numbers

#make groups based on lattitude y<82=group1, 82<y<156=group2, y>156=group3

>spa.group<-ddply(.data=spa, .variables=.(x, y, site), .fun= summarise, group = if(y <= 82) 1 else if (y <= 156) 2 else 3)

#order by site

>spa.group<-spa.group[with(spa.group, order(site)), ]

Notice the abundance of 0s

Learn more about creating dynamic, engaging presentations with Prezi