0. Intro
1. Exploring and Preparing Data
2. Canonical analyses
2.2. Partial redundancy analysis
2.3. Variation partitioning
3. Multivariate regression tree
4. Linear discriminant analysis
Overview
Follow along
Required packages
Intro
> install.packages("vegan")
> install.packages("labdsv")
> install.packages("plyr")
> install.packages("MASS")
> install.packages("mvpart")
> install.packages("MVPARTwrap")
> install.packages("rdaTest")
So you know how to ordinate which is great for data exploration
But what if we want to test hypotheses or make predictions?
- This workshop is an extension of Workshop 9 but now with more:
- Environmental data!
- Fun!
- Matrices!
- Hypothesis testing!
Today we will be going over constrained ordination methods that allow us to integrate environmental variables to explain and predict variation
Data and R code available at:
qcbs.ca/wiki/r/r_workshop10
Navigate to the wiki page and download:
- Rscript
- DoubsEnv.csv
- DoubsSpe.csv
- Test data
- mvpart package
- MVPARTwrap package
- rdaTest package
Recommendations:
- create your own new script
- refer to provided code only if needed
- avoid copy pasting or running the code directly from script
- remember to change the working directory to where your files are stored
After installing then load packages
> install.packages("packagename")
Doubs River Fish Dataset
Data Transformation
Doubs River Fish Dataset
Explore Doubs Dataset
Species Frequencies
decostand()
Doubs Environmental Data
Species Frequencies
Explore the content of the environmental data:
> ?decostand
Load the Doubs River species data (DoubsSpe.csv):
Take a look at the distribution of species frequencies:
Explore the content of the fish community dataset:
> sum(spe==0)
> spe <- read.csv(file.choose(), row.names=1)
> spe <- spe[-8,] #remove site with no data
As with many data sets we have an abundance of zeros
However we don't want common absences to artificially increase similarity between sites
So we can apply a transformation, here we will use the Hellinger transformation using the decostand() function in the vegan package
> (ab=table(unlist(spe)))
> barplot(ab, las=1, xlab=”Abundance class”,
ylab=”Frequency”, col=grey(5:0/5))
Standardizing environmental variables is crucial as you cannot compare the effects of variables with different units:
> names(env) #names of objects
> dim(env) #dimensions
> str(env) #structure of objects
> head(env) #first 5 rows
> summary(env) #summary statistics
> env.z <- decostand(env, method="standardize")
Load the Doubs River environmental data (DoubsEnv.csv):
> names(spe) #names of objects
> dim(spe) #dimensions
> str(spe) #structure of objects
> head(spe) #first 5 rows
> summary(spe) #summary statistics
What proportion of zeros?
Explore collinearity by visualizing correlations between variables:
> env <- read.csv(file.choose(), row.names=1)
> env <- env[-8,] #remove site with no data
> spe.hel <- decostand(spe, method="hellinger")
Verneaux (1973) dataset:
- characterization of fish communities
- 27 different species
- 30 different sites
- 11 environmental variables
This centers and scales the variables to make your downstream analysis more appropriate:
> sum(spe==0)/(nrow(spe)*ncol(spe))
> pairs(env, main="Bivariate Plots of the Environmental Data" )
# PROCEED WITH CAUTION ONLY EXECUTE ONCE
> apply(env.z, 2, mean)
> apply(env.z, 2, sd)
2.1. Redundancy analysis
Canonical Analyses
What does it all mean?
RDA plots
RDA in R
Redundancy analysis (RDA)
Plot aesthetics
Significance testing
Selecting variables
Scaling type 1
RDA output
Scaling type 2
Use anova.cca to test the significance of axes
You can make quick RDA biplots with the base plot() function but sometimes that gives you a plot that crowds your points or makes objects too small. You can change the aesthetics of the points, text and arrows.
Use the ordiR2step() function to select significant explanatory variables
In type 1 scaling the sites (here in blue) that are more similar to one another are closer to each other
- Prepare the data
- Run the RDA
- Extract the results
> anova.cca(spe.rda.signif, step=1000)
> anova.cca(spe.rda.signif, step=1000)
Remember that there are two scaling options:
- Type=1: distance among objects approximates their similarities
- Type=2: angles between variables reflect their correlation
>points(scores(spe.rda.signif, display="sites", choices=c(1,2), scaling=1),
pch=21, col="black", bg="steelblue", cex=1.2)
- Canonical analysis is a generic term for analyses that
- identify relationship between species composition and environmental variables
- are used to test hypotheses
- make predictions
> ordiR2step(rda(spe.hel~1, data=env.z),scope=formula(spe.rda),direction="forward,R2scope=TRUE,pstep=1000)
> env.signif<-subset(env.z, select=c("alt","oxy","dbo"))
- Direct extension of multiple regression
- Models effect of explanatory matrix on a response matrix
- Variables can be quantitative, qualitative or binary (1 or 0)
- As with other ordinations the data should be transformed or standardized is units are disparate
> env.z<-subset(env.z, select=-das)
> spe.rda<-rda(spe.hel~.,data=env.z)
> summary(spe.rda, display=NULL)
Quebec Centre for Biodiversity Science
R Workshop Series
>arrows(0,0,scores(spe.rda.signif, display="species", choices=c(1), scaling=1),scores(spe.rda.signif, display="species", choices=c(2), scaling=1), col="black",length=0)
One of the most powerful aspects of RDA is the simultaneous visualization of your environmental species and site data
Now plot your RDA
In type 2 scaling the arrows and their angles are really informative. Variables that have longer arrows are stronger drivers of the variation. Variables that are opposite one another (oxy-dbo) have a negative relationship, same direction (oxy-alt) have a positive relationship
How many variables are retained by the forward selection? What is the R2 of the RDA with the retained variables?
> R2adj<-RsquaredAdj(spe.rda.signif)$adj.r.squared
Constrained proportion is the variance of Y explained by X (73.41%) and the unexplained variance of Y (26.59%)
Challenge 1 (Solution)
Challenge 1
Now run an RDA on mite species abundances using the environmental data as constraining variables
Test partial RDA significance
Plot results
Partial RDA
Why is it interesting?
Solution #2
Challenge #2
How does it work?
Partial RDA
Partial RDA on Doubs data
> mite.spe.subs <- rda(mite.spe.hel ~ Shrub + Topo
+ Condition(SubsDens + WaterCont + Substrate), data=mite.env)
- Assess the effect of water chemistry on fish species abundances partialling out the effect of physiography
#Run the partial RDA
> spechem.physio<-rda(spe.hel, envchem, envtopo)
> summary(spechem.physio, display=NULL)
#or
> spechem.physio2<-rda(spe.hel ~ pH + dur + pho + nit + amm
+ oxy + dbo + Condition(alt + pen + deb), data=env)
> summary(spechem.physio2, display=NULL)
Run a partial RDA on the mite data:
- control for the substrate variables SubsDens, Watercont and Substrate
- A special case of the RDA
- Involves an additional set of explanatory variables
- The linear effects of X variables on the Y variables are adjusted for the effects of covariables W
#R2adj
> R2adj<- RsquareAdj(spechem.physio)$adj.r.squared
> R2adj
#Significance of axes
> anova.cca(spechem.physio, step=1000)
> anova.cca(spechem.physio, step=1000, by="axis")
In a nutshell...
- RDA of Y on W
- The residuals are extracted (new matrix Yres|W)
- containing Y response variables without the effect of W
- RDA of Yres|W on X
> plot(spechem.physio, scaling=2, main="Triplot partial RDA - scaling 2",
type="none", xlab=c("RDA1"), ylab=c("RDA2"), xlim=c(-1,1), ylim=c(-1,1))
> points(scores(spechem.physio, display="sites", choices=c(1,2),
scaling=2),pch=21, col="black", bg="steelblue", cex=1.2)
> text(scores(spechem.physio, display="species", choices=c(1),scaling=2),
scores(spechem.physio, display="species", choices=c(2), scaling=2),
labels=rownames(scores(spechem.physio, display="species", scaling=2)),
col="grey", cex=0.8)
> arrows(0,0,scores(spechem.physio, display="bp", choices=c(1),scaling=2),
scores(spechem.physio, display="bp", choices=c(2), scaling=2),col="red")
> text(scores(spechem.physio, display="bp", choices=c(1), scaling=2)+0.05,
scores(spechem.physio, display="bp", choices=c(2), scaling=2)+0.05,
labels=rownames(scores(spechem.physio, display="bp", choices=c(2),
scaling=2)),col="red", cex=1)
- Assess the effects of environmental variables on species composition while taking into account the variation due to variables that are not the focus of the study
- Control for well-known linear effects
- Isolate the effect of one or more descriptor(s)
- Analyse related samples
Workshop 10: Advanced multivariate analyses
> summary(mite.spe.subs, display="NULL")
> (R2adj <- RsquareAdj(mite.spe.subs)$adj.r.squared)
> anova.cca(mite.spe.subs, step=1000)
> anova.cca(mite.spe.subs, step=1000, by="axis")
- Is the model significant
- Which are the significant axes
- What is the variance explained by the substrate variables?
> envtopo<-env.z[,c(1:3)] # physiography
> envchem<-env.z[,c(4:10)] # water quality
RDA is significant, but only the 1st canonical axis is significant. Shrub and Topo explained 9.81% of the variation in mite species composition, and substrate covariables explained 42.84 of the variation. R2adj of this RDA is 8.33%
Variation partitioning
Going one step further...
Solution #3
Challenge #3
Solution #3
Variation partitioning
Venn diagram
Solution #3
- We can retain only significant variables in each subsets and test the significance of each fraction
> plot(spe.part.all, digits=2)
To divide the variation of a response variable among 2, 3, or 4 explanatory data sets
Perform variation partitioning of the mite species data with significant substrate variables and significant (other) variables:
Useful when one wants to relate species abundance in a community to various types of environmental variables
- large-scale vs small-scale
- abiotic vs biotic
> #Varpart
> spe.part <- varpart(spe.hel, envchem.pars, envtopo.pars)
> plot(spe.part, digits=2)
> #Tests of significance
> anova.cca(rda(spe.hel, envchem.pars), step=1000) # [a+b]
> anova.cca(rda(spe.hel, envtopo.pars), step=1000) # [b+c]
> env.pars <- cbind(envchem.pars, envtopo.pars)
> anova.cca(rda(spe.hel, env.pars), step=1000) # [a+b+c]
> anova.cca(rda(spe.hel, envchem.pars, envtopo.pars),
step=1000) # [a]
> anova.cca(rda(spe.hel, envtopo.pars, envchem.pars),
step=1000) # [c]
> #Varpart
> mite.spe.part <- varpart(mite.spe.hel, mite.subs.pars, mite.other.pars)
> plot(mite.spe.part, digits=2)
> #Tests of significance
> anova.cca(rda(mite.spe.hel, mite.subs.pars), step=1000) # [a+b]
> anova.cca(rda(mite.spe.hel, mite.other.pars), step=1000) # [b+c]
> mite.env.pars <- cbind(mite.subs.pars, mite.other.pars)
> anova.cca(rda(mite.spe.hel, mite.env.pars), step=1000) # [a+b+c]
> anova.cca(rda(mite.spe.hel, mite.subs.pars, mite.other.pars),
step=1000) # [a]
> anova.cca(rda(spe.hel, mite.other.pars, mite.subs.pars),
step=1000) # [c]
- What proportion of the variation is explained by each data set?
- What are the significant fractions?
> mite.subs=mite.env[,c(1,2,3)] # first set of variables
> mite.other=mite.env[,c(4,5)]
> #RDA of mite.subs
> rda.mite.subs <- rda(mite.spe.hel~., data=mite.subs)
> R2a.all.chem <- RsquareAdj(rda.mite.subs)$adj.r.squared
> ordiR2step(rda(mite.spe.hel~1, data=mite.subs),
scope= formula(rda.mite.subs),direction= "forward",
R2scope=TRUE, pstep=1000)
> names(mite.subs)
> mite.subs.pars <- mite.subs[, c( 2,3 )]
> #RDA for mite.other
> rda.mite.other <- rda(mite.spe.hel~., data=mite.other)
> R2a.all.other <- RsquareAdj(rda.mite.other)$adj.r.squared
> ordiR2step(rda(mite.spe.hel~1, data=mite.other),
scope=formula(rda.mite.other), direction= "forward",
R2scope=TRUE, pstep=1000)
> names(mite.other)
> mite.other.pars <- mite.other[, c(1,2)]
> #RDA of chemistry variables
> spe.chem <- rda(spe.hel~., data=envchem)
> #Select significant chemistry variables
> R2a.all.chem <- RsquareAdj(spe.chem)$adj.r.squared
> ordiR2step(rda(spe.hel~1, data=envchem), scope= formula(spe.chem),
direction= "forward", R2scope=TRUE, pstep=1000)
> names(envchem)
> envchem.pars <- envchem[, c( 4, 6, 7 )]
> #RDA with other environmental variables
> spe.topo <- rda(spe.hel~., data=envtopo)
> R2a.all.topo <- RsquareAdj(spe.topo)$adj.r.squared
> ordiR2step(rda(spe.hel~1, data=envtopo), scope= formula(spe.topo),
direction= "forward", R2scope=TRUE, pstep=1000)
> names(envtopo)
> envtopo.pars <- envtopo[, c(1,2)]
- All the fractions are significant (p<0.001)
> library(vegan)
> ?varpart
> spe.part.all <- varpart(spe.hel, envchem,envtopo)
> spe.part.all
Legendre and Legendre 2012
All fractions are significant.
Website: http://qcbs.ca/wiki/r_workshop10
Cool Venn diagram
> install.packages("venneuler")
> library(venneuler)
> v <- venneuler(c(A=0.241,"A&B"= 0.233, B=0.112))
> v$labels <- c("Chem\n24%","Topo \n11%")
> v$colors <- c(.78,.6)
> plot(v)
MRT on the Doubs data
Multivariate regression tree (MRT)
Plot the tree
MRT terminology
Discriminant species
Compare tree sizes
Summary output
Cross-validation plot
MRT vs RDA
> summary(doubs.mrt)
We will use the function mvpart() from the... mvpart{} package
> # Using the CVRE criterion
> doubs.mrt.cvre<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,
margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)
> # Choosing to get a tree composed of 4 groups (leaves)
> doubs.mrt.cvre<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,
margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)
Are there species contributing highly to the explained variance at each split?
What are the sites contained in each leaf?
Let's generate an nicer and more informative output with MRT().
This function is available in MVPARTwrap library, and also calls the rdaTest library
- A constrained clustering technique that:
- partitions a quantitative response matrix by a matrix of explanatory variables constraining where to divide the response data
- involves two procedures running simultaneously:
- the constrained partitioning of the data
- cross-validations
- the end result in a tree
- Leaf: terminal group of sites
- Node: point where the data splits into two groups
- characterized by a threshold value of a descriptor
- Branch: each group formed by a split
> ?mvpart
> # Remove "distance from source" from raw environmental data set
> env <- subset(env, select = -das)
> # Create the tree
> doubs.mrt<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,
margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)
- barplot showing species abundances in sites grouped in this leaf
- relative error
- number of sites
Advantages of the MRT:
- does not assume a linear relationship between Y and X matrices
- robust in dealing with missing values
- robust in dealing with colinearity among descriptors
- works well with raw descriptor value
- generates an output that is easy to interpret
Note that the code does not change at all...
The selection of the tree size must be made via the cross-validation graph
> doubs.mrt.wrap<-MRT(doubs.mrt, percent=10,
species=colnames(spe.hel))
> summary(doubs.mvpart.wrap)
xv= "pick" # so that we choose where to cut the tree
xval=now(spe.hel) # number of times cross-validation occurs
xvmult=100 # number of permutations in the cross-validation
- The main discriminant species are TRU, VAI and ABL.
- ABL is the most indicative species of the sites at lower altitude
CP stands for "complexity parameter". In other words, the variance explained at each node.
Solution #4
Challenge #4
Solution #4
Create a multivariate regression tree for the mite data.
> mite.mrt<-mvpart(data.matrix(mite.spe.hel)~.,
mite.env,legend=FALSE, margin=0.01, cp=0,
xv="pick", xval=nrow(mite.spe.hel), xvmult=100,which=4)
> summary(mite.mrt)
> mite.mrt.wrap<-MRT(mite.mrt, percent=10,
species=colnames(mite.spe.hel))
> summary(mite.mrt.wrap)
- Select the minimum size of tree within 1 SE of the CVRE
- What is the proportion of variance explained by this tree?
- How many leaves does it contain?
- What are the discriminant species?
- 25.6% of the variation in the mite species assemblage across sites is explained by the partition of the sites based on water content of the substrate at a threshold value of 385.1 mg/l.
- LCIL is a discriminant species of sites with higher water content
Challenge 5 solution
LDA output
Run the LDA
A priori grouping
Challenge 5 solution
Challenge 5
LDA
Making predictions
Predicted groupings
We will use the grouping factor in spa.group in column 4
The LDA output can then be used to determine which group each object (site) was classified in based on the LDA and tell you how accurate each grouping was
Now that we have the relationship between the environmental variables and the groupings we can get new sites and classify them based on this relationship
Use the LDA result to predict the grouping of the ClassifyMe data
Run an LDA for the mite env data (only first two vars) based on four latitudinal groups you create from the mite.xy data set
#read in new sites
>classify.me<-read.csv("classifyme.csv", header = T)
#predict grouping of new data
>predict.group<-predict(LDA, newdata=classify.me)
#give classification for each new site
>group.new<-predict.group$class
- For our example we want to first make the a priori grouping for the Doubs fish data
- We will chose latitude since we have the x, y coordinates for each site
- We will divide the latitude into three groups based on the range of the y coordinates
>LDA.mite<-lda(mite.env[,1:2],mite.xy.group[,4])
>mite.class <- predict(LDA.mite)$class
>mite.post <- predict(LDA.mite)$posterior
>mite.table <- table(mite.xy.group[,4], mite.class)
>diag(prop.table(mite.table, 1))
#run LDA
>LDA<-lda(env,spa.group[,4])>
#classification of the objects based on LDA
>spe.class <- predict(LDA)$class
#posterior probabilities of the objects to belong to the groups
>spe.post <- predict(LDA)$posterior
>mite.xy$site<-seq(1:70)
(max(mite.xy[,2])-min(mite.xy[,2]))/4
>mite.xy.group<-ddply(.data=mite.xy, .variables=.(x, y, site), .fun= summarise, group = if(y <= 2.5) 1 else if (y <= 4.9) 2 else if (y <= 7.3) 3 else 4)
>mite.xy.group<-mite.xy.group[with(mite.xy.group, order(site)), ]
- Linear discriminant analysis (LDA) is a cool technique that allows you to determine how much variation is explained by an a priori grouping (like latitude or ecoytype)
- You can also use it to make predictions about what group a new entry belongs to
- An example could be classifying whether a fish comes from a lake or ocean population based on variables you collected from the different populations
The syntax for the LDA is lda(variables you want to use to explain the grouping, grouping variable)
Are ordered in the same order as they appear in the data frame so site 31 is in group 2, site 32 in group 1 and so on
#add site numbers
>numbers<-(1:30)
>numbers<-numbers[!numbers%in%8]
>spa$site<-numbers
#make groups based on lattitude y<82=group1, 82<y<156=group2, y>156=group3
>spa.group<-ddply(.data=spa, .variables=.(x, y, site), .fun= summarise, group = if(y <= 82) 1 else if (y <= 156) 2 else 3)
#order by site
>spa.group<-spa.group[with(spa.group, order(site)), ]