Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

QCBS R workshop 10

No description


on 19 April 2017

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of QCBS R workshop 10

This workshop is an extension of Workshop 9 but now with more:
Environmental data!
Hypothesis testing!
Quebec Centre for Biodiversity Science

R Workshop Series

Workshop 10: Advanced multivariate analyses

1. Exploring and Preparing Data
2. Canonical analyses
3. Multivariate regression tree
0. Intro
Data and R code available at:

Navigate to the wiki page and download:
Test data
mvpart package
MVPARTwrap package
rdaTest package

Follow along
> install.packages("vegan")
> install.packages("labdsv")
> install.packages("plyr")
> install.packages("MASS")
> install.packages("mvpart")
> install.packages("MVPARTwrap")
> install.packages("rdaTest")
Required packages
So you know how to ordinate which is great for data exploration

But what if we want to test hypotheses or make predictions?
Follow along
create your own new script
refer to provided code only if needed
avoid copy pasting or running the code directly from script
remember to change the working directory to where your files are stored

Doubs River Fish Dataset
Verneaux (1973) dataset:
characterization of fish communities
27 different species
30 different sites
11 environmental variables

Doubs River Fish Dataset
> spe <- read.csv(file.choose(), row.names=1)
> spe <- spe[-8,]
#remove site with no data
Load the Doubs River species data (DoubsSpe.csv):
Load the Doubs River environmental data (DoubsEnv.csv):
> env <- read.csv(file.choose(), row.names=1)
> env <- env[-8,]
#remove site with no data
Explore Doubs Dataset
> names(spe)
#names of objects

> dim(spe)

> str(spe)
#structure of objects
> head(spe)
#first 5 rows
> summary(spe)
#summary statistics
Explore the content of the fish community dataset:
Species Frequencies
> (ab=table(unlist(spe)))
> barplot(ab, las=1, xlab=”Abundance class”,
ylab=”Frequency”, col=grey(5:0/5))
Take a look at the distribution of species frequencies:
Notice the abundance of 0s
> sum(spe==0)
How many zeros?
What proportion of zeros?
> sum(spe==0)/(nrow(spe)*ncol(spe))
>50% !
Data Transformation
> spe.hel <- decostand(spe, method="hellinger")
As with many data sets we have an abundance of zeros

However we don't want common absences to artificially increase similarity between sites

So we can apply a transformation, here we will use the Hellinger transformation using the decostand() function in the vegan package
Doubs Environmental Data
> names(env)
#names of objects

> dim(env)

> str(env)
#structure of objects
> head(env)
#first 5 rows
> summary(env)
#summary statistics
Explore the content of the environmental data:
> pairs(env, main="Bivariate Plots of the Environmental Data" )
Explore collinearity by visualizing correlations between variables:
Species Frequencies
Doubs Environmental Data
MRT terminology
Leaf: terminal group of sites
Node: point where the data splits into two groups
characterized by a threshold value of a descriptor
Branch: each group formed by a split
MRT on the Doubs data
Plot the tree
Compare tree sizes
Cross-validation plot
Multivariate regression tree (MRT)
A constrained clustering technique that:
partitions a quantitative response matrix by a matrix of explanatory variables
where to divide the response data
involves two procedures running simultaneously:
the constrained partitioning of the data
the end result in a tree
> env.z <- decostand(env, method="standardize")
Standardizing environmental variables is crucial as you cannot compare the effects of variables with different units:
This centers and scales the variables to make your downstream analysis more appropriate:
> apply(env.z, 2, mean)
> apply(env.z, 2, sd)
> ?decostand
Challenge #4
Create a multivariate regression tree for the mite data.

Solution #4
Solution #4
RDA output
RDA plots
Selecting variables
Use the ordiR2step() function to select significant explanatory variables
Scaling type 1
> ordiR2step(rda(spe.hel~1, data=env.z),scope=formula(spe.rda),direction="forward,R2scope=TRUE,pstep=1000)
> env.signif<-subset(env.z, select=c("alt","oxy","dbo"))
Significance testing
Use anova.cca to test the significance of axes
> anova.cca(spe.rda.signif, step=1000)
> anova.cca(spe.rda.signif, step=1000)
Scaling type 2
Challenge 1
Variation partitioning
Variation partitioning
Useful when one wants to relate species abundance in a community to various types of environmental variables
large-scale vs small-scale
abiotic vs biotic
> library(vegan)
> ?varpart

> spe.part.all <- varpart(spe.hel, envchem,envtopo)
> spe.part.all
Venn diagram
Going one step further...
We can retain only significant variables in each subsets and test the significance of each fraction
Going one step further...
> #Varpart
> spe.part <- varpart(spe.hel, envchem.pars, envtopo.pars)
> plot(spe.part, digits=2)

> #Tests of significance
> anova.cca(rda(spe.hel, envchem.pars), step=1000) # [a+b]
> anova.cca(rda(spe.hel, envtopo.pars), step=1000) # [b+c]

> env.pars <- cbind(envchem.pars, envtopo.pars)
> anova.cca(rda(spe.hel, env.pars), step=1000) # [a+b+c]
> anova.cca(rda(spe.hel, envchem.pars, envtopo.pars),
step=1000) # [a]
> anova.cca(rda(spe.hel, envtopo.pars, envchem.pars),
step=1000) # [c]
Variation partitioning
Challenge #3
Perform variation partitioning of the mite species data with significant substrate variables and significant (other) variables:

What proportion of the variation is explained by each data set?
What are the significant fractions?
Solution #3
Solution #3
Solution #3
Canonical Analyses
Canonical analysis is a generic term for analyses that
identify relationship between species composition and environmental variables
are used to test hypotheses
make predictions
Redundancy analysis (RDA)
Direct extension of multiple regression
Models effect of explanatory matrix on a response matrix
Variables can be quantitative, qualitative or binary (1 or 0)
As with other ordinations the data should be transformed or standardized is units are disparate
2.1. Redundancy analysis
2.2. Partial redundancy analysis
2.3. Variation partitioning
4. Linear discriminant analysis
A special case of the RDA
Involves an additional set of explanatory variables
The linear effects of X variables on the Y variables are adjusted for the effects of covariables W
In a nutshell...
RDA of Y on W
The residuals are extracted (new matrix Yres|W)
containing Y response variables without the effect of W
RDA of Yres|W on X
Assess the effects of environmental variables on species composition while taking into account the variation due to variables that are not the focus of the study
Control for well-known linear effects
Isolate the effect of one or more descriptor(s)
Analyse related samples
Assess the effect of water chemistry on fish species abundances partialling out the effect of physiography
> envtopo<-env.z[,c(1:3)] # physiography
> envchem<-env.z[,c(4:10)] # water quality
Partial RDA
Plot results
Challenge #2
Run a partial RDA on the mite data:
control for the substrate variables SubsDens, Watercont and Substrate

Is the model significant
Which are the significant axes
What is the variance explained by the substrate variables?

Solution #2
Compute RDA:
> mite.spe.subs <- rda(mite.spe.hel ~ Shrub + Topo
+ Condition(SubsDens + WaterCont + Substrate), data=mite.env)
> summary(mite.spe.subs, display="NULL")
> (R2adj <- RsquareAdj(mite.spe.subs)$adj.r.squared)
> anova.cca(mite.spe.subs, step=1000)
> anova.cca(mite.spe.subs, step=1000, by="axis")
Linear discriminant analysis (LDA) is a cool technique that allows you to determine how much variation is explained by an a priori grouping (like latitude or ecoytype)

You can also use it to make predictions about what group a new entry belongs to

An example could be classifying whether a fish comes from a lake or ocean population based on variables you collected from the different populations
A priori
For our example we want to first make the
a priori
grouping for the Doubs fish data
We will chose latitude since we have the x, y coordinates for each site
We will divide the latitude into three groups based on the range of the y coordinates
Run the LDA
#run LDA
Making predictions
LDA output
Predicted groupings
#read in new sites
>classify.me<-read.csv("classifyme.csv", header = T)

#predict grouping of new data
>predict.group<-predict(LDA, newdata=classify.me)

#give classification for each new site
Challenge 5
Run an LDA for the mite env data (only first two vars) based on four latitudinal groups you create from the mite.xy data set
After installing then load packages
> install.packages("packagename")
Today we will be going over constrained ordination methods that allow us to integrate environmental variables to explain and predict variation

RDA in R
Prepare the data
Run the RDA
Extract the results
> env.z<-subset(env.z, select=-das)
> spe.rda<-rda(spe.hel~.,data=env.z)
> summary(spe.rda, display=NULL)
Constrained proportion is the variance of Y explained by X (73.41%) and the unexplained variance of Y (26.59%)

How many variables are retained by the forward selection? What is the R2 of the RDA with the retained variables?

> R2adj<-RsquaredAdj(spe.rda.signif)$adj.r.squared
One of the most powerful aspects of RDA is the simultaneous visualization of your environmental species and site data

Now plot your RDA
Remember that there are two scaling options:
Type=1: distance among objects approximates their similarities
Type=2: angles between variables reflect their correlation
Now run an RDA on mite species abundances using the environmental data as constraining variables
Challenge 1 (Solution)
Partial RDA
How does it work?
Why is it interesting?
Partial RDA on Doubs data
#Run the partial RDA
> spechem.physio<-rda(spe.hel, envchem, envtopo)
> summary(spechem.physio, display=NULL)

> spechem.physio2<-rda(spe.hel ~ pH + dur + pho + nit + amm
+ oxy + dbo + Condition(alt + pen + deb), data=env)
> summary(spechem.physio2, display=NULL)
Test partial RDA significance
> R2adj<- RsquareAdj(spechem.physio)$adj.r.squared
> R2adj

#Significance of axes
> anova.cca(spechem.physio, step=1000)
> anova.cca(spechem.physio, step=1000, by="axis")
> plot(spechem.physio, scaling=2, main="Triplot partial RDA - scaling 2",
type="none", xlab=c("RDA1"), ylab=c("RDA2"), xlim=c(-1,1), ylim=c(-1,1))

> points(scores(spechem.physio, display="sites", choices=c(1,2),
scaling=2),pch=21, col="black", bg="steelblue", cex=1.2)

> text(scores(spechem.physio, display="species", choices=c(1),scaling=2),
scores(spechem.physio, display="species", choices=c(2), scaling=2),
labels=rownames(scores(spechem.physio, display="species", scaling=2)),
col="grey", cex=0.8)

> arrows(0,0,scores(spechem.physio, display="bp", choices=c(1),scaling=2),
scores(spechem.physio, display="bp", choices=c(2), scaling=2),col="red")

> text(scores(spechem.physio, display="bp", choices=c(1), scaling=2)+0.05,
scores(spechem.physio, display="bp", choices=c(2), scaling=2)+0.05,
labels=rownames(scores(spechem.physio, display="bp", choices=c(2),
scaling=2)),col="red", cex=1)
RDA is significant, but only the 1st canonical axis is significant. Shrub and Topo explained 9.81% of the variation in mite species composition, and substrate covariables explained 42.84 of the variation. R2adj of this RDA is 8.33%
To divide the variation of a response variable among 2, 3, or 4 explanatory data sets
Legendre and Legendre 2012
> plot(spe.part.all, digits=2)
> #RDA of chemistry variables
> spe.chem <- rda(spe.hel~., data=envchem)

> #Select significant chemistry variables
> R2a.all.chem <- RsquareAdj(spe.chem)$adj.r.squared
> ordiR2step(rda(spe.hel~1, data=envchem), scope= formula(spe.chem),
direction= "forward", R2scope=TRUE, pstep=1000)
> names(envchem)
> envchem.pars <- envchem[, c( 4, 6, 7 )]

> #RDA with other environmental variables
> spe.topo <- rda(spe.hel~., data=envtopo)
> R2a.all.topo <- RsquareAdj(spe.topo)$adj.r.squared
> ordiR2step(rda(spe.hel~1, data=envtopo), scope= formula(spe.topo),
direction= "forward", R2scope=TRUE, pstep=1000)
> names(envtopo)
> envtopo.pars <- envtopo[, c(1,2)]
All the fractions are significant (p<0.001)
> mite.subs=mite.env[,c(1,2,3)] # first set of variables
> mite.other=mite.env[,c(4,5)]

> #RDA of mite.subs
> rda.mite.subs <- rda(mite.spe.hel~., data=mite.subs)
> R2a.all.chem <- RsquareAdj(rda.mite.subs)$adj.r.squared
> ordiR2step(rda(mite.spe.hel~1, data=mite.subs),
scope= formula(rda.mite.subs),direction= "forward",
R2scope=TRUE, pstep=1000)
> names(mite.subs)
> mite.subs.pars <- mite.subs[, c( 2,3 )]

> #RDA for mite.other
> rda.mite.other <- rda(mite.spe.hel~., data=mite.other)
> R2a.all.other <- RsquareAdj(rda.mite.other)$adj.r.squared
> ordiR2step(rda(mite.spe.hel~1, data=mite.other),
scope=formula(rda.mite.other), direction= "forward",
R2scope=TRUE, pstep=1000)
> names(mite.other)
> mite.other.pars <- mite.other[, c(1,2)]
> #Varpart
> mite.spe.part <- varpart(mite.spe.hel, mite.subs.pars, mite.other.pars)
> plot(mite.spe.part, digits=2)

> #Tests of significance
> anova.cca(rda(mite.spe.hel, mite.subs.pars), step=1000) # [a+b]
> anova.cca(rda(mite.spe.hel, mite.other.pars), step=1000) # [b+c]

> mite.env.pars <- cbind(mite.subs.pars, mite.other.pars)
> anova.cca(rda(mite.spe.hel, mite.env.pars), step=1000) # [a+b+c]
> anova.cca(rda(mite.spe.hel, mite.subs.pars, mite.other.pars),
step=1000) # [a]
> anova.cca(rda(spe.hel, mite.other.pars, mite.subs.pars),
step=1000) # [c]
All fractions are significant.
Advantages of the MRT:

does not assume a linear relationship between Y and X matrices
robust in dealing with missing values
robust in dealing with colinearity among descriptors
works well with raw descriptor value
generates an output that is easy to interpret
> ?mvpart

> # Remove "distance from source" from raw environmental data set
> env <- subset(env, select = -das)

> # Create the tree
> doubs.mrt<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,
margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)
We will use the function mvpart() from the... mvpart{} package
xv= "pick" # so that we choose where to cut the tree
xval=now(spe.hel) # number of times cross-validation occurs
xvmult=100 # number of permutations in the cross-validation

barplot showing species abundances in sites grouped in this leaf
relative error
number of sites
reciprocal R2: 43.7%

> # Using the CVRE criterion
> doubs.mrt.cvre<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,
margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)

> # Choosing to get a tree composed of 4 groups (leaves)
> doubs.mrt.cvre<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,
margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)

Note that the code does not change at all...
The selection of the tree size must be made via the cross-validation graph
Summary output
> summary(doubs.mrt)
CP stands for "complexity parameter". In other words, the variance explained at each node.
Discriminant species
> doubs.mrt.wrap<-MRT(doubs.mrt, percent=10,
> summary(doubs.mvpart.wrap)
Are there species contributing highly to the explained variance at each split?
What are the sites contained in each leaf?

Let's generate an nicer and more informative output with MRT().
This function is available in MVPARTwrap library, and also calls the rdaTest library
The main discriminant species are TRU, VAI and ABL.
ABL is the most indicative species of the sites at lower altitude
Select the minimum size of tree within 1 SE of the CVRE
What is the proportion of variance explained by this tree?
How many leaves does it contain?
What are the discriminant species?
> mite.mrt<-mvpart(data.matrix(mite.spe.hel)~.,
mite.env,legend=FALSE, margin=0.01, cp=0,
xv="pick", xval=nrow(mite.spe.hel), xvmult=100,which=4)
> summary(mite.mrt)

> mite.mrt.wrap<-MRT(mite.mrt, percent=10,
> summary(mite.mrt.wrap)
25.6% of the variation in the mite species assemblage across sites is explained by the partition of the sites based on water content of the substrate at a threshold value of 385.1 mg/l.
LCIL is a discriminant species of sites with higher water content
Plot aesthetics
What does it all mean?
In type 1 scaling the sites (here in blue) that are more similar to one another are closer to each other
In type 2 scaling the arrows and their angles are really informative. Variables that have longer arrows are stronger drivers of the variation. Variables that are opposite one another (oxy-dbo) have a negative relationship, same direction (oxy-alt) have a positive relationship
You can make quick RDA biplots with the base plot() function but sometimes that gives you a plot that crowds your points or makes objects too small. You can change the aesthetics of the points, text and arrows.
>points(scores(spe.rda.signif, display="sites", choices=c(1,2), scaling=1),
pch=21, col="black", bg="steelblue", cex=1.2)
>arrows(0,0,scores(spe.rda.signif, display="species", choices=c(1), scaling=1),scores(spe.rda.signif, display="species", choices=c(2), scaling=1), col="black",length=0)
#add site numbers

#make groups based on lattitude y<82=group1, 82<y<156=group2, y>156=group3
>spa.group<-ddply(.data=spa, .variables=.(x, y, site), .fun= summarise, group = if(y <= 82) 1 else if (y <= 156) 2 else 3)

#order by site
>spa.group<-spa.group[with(spa.group, order(site)), ]
We will use the grouping factor in spa.group in column 4
The syntax for the LDA is lda(variables you want to use to explain the grouping, grouping variable)
The LDA output can then be used to determine which group each object (site) was classified in based on the LDA and tell you how accurate each grouping was
#classification of the objects based on LDA
>spe.class <- predict(LDA)$class

#posterior probabilities of the objects to belong to the groups
>spe.post <- predict(LDA)$posterior

Now that we have the relationship between the environmental variables and the groupings we can get new sites and classify them based on this relationship

Use the LDA result to predict the grouping of the ClassifyMe data
Are ordered in the same order as they appear in the data frame so site 31 is in group 2, site 32 in group 1 and so on
Challenge 5 solution

>mite.xy.group<-ddply(.data=mite.xy, .variables=.(x, y, site), .fun= summarise, group = if(y <= 2.5) 1 else if (y <= 4.9) 2 else if (y <= 7.3) 3 else 4)

>mite.xy.group<-mite.xy.group[with(mite.xy.group, order(site)), ]

Challenge 5 solution
>mite.class <- predict(LDA.mite)$class
>mite.post <- predict(LDA.mite)$posterior
>mite.table <- table(mite.xy.group[,4], mite.class)
>diag(prop.table(mite.table, 1))

Cool Venn diagram
> install.packages("venneuler")
> library(venneuler)
> v <- venneuler(c(A=0.241,"A&B"= 0.233, B=0.112))
> v$labels <- c("Chem\n24%","Topo \n11%")
> v$colors <- c(.78,.6)
> plot(v)
Full transcript