Environmental data!

Fun!

Matrices!

Hypothesis testing!

**Quebec Centre for Biodiversity Science**

R Workshop Series

R Workshop Series

**Workshop 10: Advanced multivariate analyses**

**Website**

:

http://qcbs.ca/wiki/r_workshop10

:

http://qcbs.ca/wiki/r_workshop10

**1. Exploring and Preparing Data**

**2. Canonical analyses**

**3. Multivariate regression tree**

**0. Intro**

Intro

Data and R code available at:

qcbs.ca/wiki/r/r_workshop10

Navigate to the wiki page and download:

Rscript

DoubsEnv.csv

DoubsSpe.csv

Test data

mvpart package

MVPARTwrap package

rdaTest package

Follow along

> install.packages("vegan")

> install.packages("labdsv")

> install.packages("plyr")

> install.packages("MASS")

> install.packages("mvpart")

> install.packages("MVPARTwrap")

> install.packages("rdaTest")

Required packages

So you know how to ordinate which is great for data exploration

But what if we want to test hypotheses or make predictions?

**Overview**

Follow along

Recommendations:

create your own new script

refer to provided code only if needed

avoid copy pasting or running the code directly from script

remember to change the working directory to where your files are stored

Doubs River Fish Dataset

Verneaux (1973) dataset:

characterization of fish communities

27 different species

30 different sites

11 environmental variables

Doubs River Fish Dataset

> spe <- read.csv(file.choose(), row.names=1)

> spe <- spe[-8,]

#remove site with no data

Load the Doubs River species data (DoubsSpe.csv):

Load the Doubs River environmental data (DoubsEnv.csv):

> env <- read.csv(file.choose(), row.names=1)

> env <- env[-8,]

#remove site with no data

Explore Doubs Dataset

> names(spe)

#names of objects

> dim(spe)

#dimensions

> str(spe)

#structure of objects

> head(spe)

#first 5 rows

> summary(spe)

#summary statistics

Explore the content of the fish community dataset:

Species Frequencies

> (ab=table(unlist(spe)))

> barplot(ab, las=1, xlab=”Abundance class”,

ylab=”Frequency”, col=grey(5:0/5))

Take a look at the distribution of species frequencies:

Notice the abundance of 0s

> sum(spe==0)

How many zeros?

What proportion of zeros?

> sum(spe==0)/(nrow(spe)*ncol(spe))

408!

>50% !

Data Transformation

> spe.hel <- decostand(spe, method="hellinger")

As with many data sets we have an abundance of zeros

However we don't want common absences to artificially increase similarity between sites

So we can apply a transformation, here we will use the Hellinger transformation using the decostand() function in the vegan package

Doubs Environmental Data

> names(env)

#names of objects

> dim(env)

#dimensions

> str(env)

#structure of objects

> head(env)

#first 5 rows

> summary(env)

#summary statistics

Explore the content of the environmental data:

> pairs(env, main="Bivariate Plots of the Environmental Data" )

Explore collinearity by visualizing correlations between variables:

decostand()

MUST BE STANDARDIZED !

MRT vs RDA

Species Frequencies

Doubs Environmental Data

MRT terminology

Leaf: terminal group of sites

Node: point where the data splits into two groups

characterized by a threshold value of a descriptor

Branch: each group formed by a split

MRT on the Doubs data

Plot the tree

Compare tree sizes

Cross-validation plot

# PROCEED WITH CAUTION ONLY EXECUTE ONCE

Multivariate regression tree (MRT)

A constrained clustering technique that:

partitions a quantitative response matrix by a matrix of explanatory variables

constraining

where to divide the response data

involves two procedures running simultaneously:

the constrained partitioning of the data

cross-validations

the end result in a tree

> env.z <- decostand(env, method="standardize")

Standardizing environmental variables is crucial as you cannot compare the effects of variables with different units:

This centers and scales the variables to make your downstream analysis more appropriate:

> apply(env.z, 2, mean)

> apply(env.z, 2, sd)

> ?decostand

Challenge #4

Create a multivariate regression tree for the mite data.

Solution #4

Solution #4

RDA output

RDA plots

Selecting variables

Use the ordiR2step() function to select significant explanatory variables

Scaling type 1

> ordiR2step(rda(spe.hel~1, data=env.z),scope=formula(spe.rda),direction="forward,R2scope=TRUE,pstep=1000)

> env.signif<-subset(env.z, select=c("alt","oxy","dbo"))

Significance testing

Use anova.cca to test the significance of axes

> anova.cca(spe.rda.signif, step=1000)

> anova.cca(spe.rda.signif, step=1000)

Scaling type 2

Challenge 1

Variation partitioning

Variation partitioning

Useful when one wants to relate species abundance in a community to various types of environmental variables

large-scale vs small-scale

abiotic vs biotic

> library(vegan)

> ?varpart

> spe.part.all <- varpart(spe.hel, envchem,envtopo)

> spe.part.all

Venn diagram

Going one step further...

We can retain only significant variables in each subsets and test the significance of each fraction

Going one step further...

> #Varpart

> spe.part <- varpart(spe.hel, envchem.pars, envtopo.pars)

> plot(spe.part, digits=2)

> #Tests of significance

> anova.cca(rda(spe.hel, envchem.pars), step=1000) # [a+b]

> anova.cca(rda(spe.hel, envtopo.pars), step=1000) # [b+c]

> env.pars <- cbind(envchem.pars, envtopo.pars)

> anova.cca(rda(spe.hel, env.pars), step=1000) # [a+b+c]

> anova.cca(rda(spe.hel, envchem.pars, envtopo.pars),

step=1000) # [a]

> anova.cca(rda(spe.hel, envtopo.pars, envchem.pars),

step=1000) # [c]

Variation partitioning

Challenge #3

Perform variation partitioning of the mite species data with significant substrate variables and significant (other) variables:

What proportion of the variation is explained by each data set?

What are the significant fractions?

Solution #3

Solution #3

Solution #3

Canonical Analyses

Canonical analysis is a generic term for analyses that

identify relationship between species composition and environmental variables

are used to test hypotheses

make predictions

Redundancy analysis (RDA)

Direct extension of multiple regression

Models effect of explanatory matrix on a response matrix

Variables can be quantitative, qualitative or binary (1 or 0)

As with other ordinations the data should be transformed or standardized is units are disparate

**2.1. Redundancy analysis**

**2.2. Partial redundancy analysis**

**2.3. Variation partitioning**

**4. Linear discriminant analysis**

A special case of the RDA

Involves an additional set of explanatory variables

The linear effects of X variables on the Y variables are adjusted for the effects of covariables W

In a nutshell...

RDA of Y on W

The residuals are extracted (new matrix Yres|W)

containing Y response variables without the effect of W

RDA of Yres|W on X

Assess the effects of environmental variables on species composition while taking into account the variation due to variables that are not the focus of the study

Control for well-known linear effects

Isolate the effect of one or more descriptor(s)

Analyse related samples

Assess the effect of water chemistry on fish species abundances partialling out the effect of physiography

> envtopo<-env.z[,c(1:3)] # physiography

> envchem<-env.z[,c(4:10)] # water quality

Partial RDA

Plot results

Challenge #2

Run a partial RDA on the mite data:

control for the substrate variables SubsDens, Watercont and Substrate

Is the model significant

Which are the significant axes

What is the variance explained by the substrate variables?

Solution #2

Compute RDA:

> mite.spe.subs <- rda(mite.spe.hel ~ Shrub + Topo

+ Condition(SubsDens + WaterCont + Substrate), data=mite.env)

> summary(mite.spe.subs, display="NULL")

> (R2adj <- RsquareAdj(mite.spe.subs)$adj.r.squared)

> anova.cca(mite.spe.subs, step=1000)

> anova.cca(mite.spe.subs, step=1000, by="axis")

LDA

Linear discriminant analysis (LDA) is a cool technique that allows you to determine how much variation is explained by an a priori grouping (like latitude or ecoytype)

You can also use it to make predictions about what group a new entry belongs to

An example could be classifying whether a fish comes from a lake or ocean population based on variables you collected from the different populations

A priori

grouping

For our example we want to first make the

a priori

grouping for the Doubs fish data

We will chose latitude since we have the x, y coordinates for each site

We will divide the latitude into three groups based on the range of the y coordinates

Run the LDA

#run LDA

>LDA<-lda(env,spa.group[,4])>

Making predictions

LDA output

Predicted groupings

#read in new sites

>classify.me<-read.csv("classifyme.csv", header = T)

#predict grouping of new data

>predict.group<-predict(LDA, newdata=classify.me)

#give classification for each new site

>group.new<-predict.group$class

Challenge 5

Run an LDA for the mite env data (only first two vars) based on four latitudinal groups you create from the mite.xy data set

After installing then load packages

> install.packages("packagename")

Today we will be going over constrained ordination methods that allow us integrate environmental variables to explain and predict variation

RDA in R

Prepare the data

Run the RDA

Extract the results

> env.z<-subset(env.z, select=-das)

> spe.rda<-rda(spe.hel~.,data=env.z)

> summary(spe.rda, display=NULL)

Constrained proportion is the variance of Y explained by X (73.41%) and the unexplained variance of Y (26.59%)

How many variables are retained by the forward selection? What is the R2 of the RDA with the retained variables?

> R2adj<-RsquaredAdj(spe.rda.signif)$adj.r.squared

One of the most powerful aspects of RDA is the simultaneous visualization of your environmental species and site data

Now plot your RDA

Remember that there are two scaling options:

Type=1: distance among objects approximates their similarities

Type=2: angles between variables reflect their correlation

Now run an RDA on mite species abundances using the environmental data as constraining variables

Challenge 1 (Solution)

Y

X

W

~

Partial RDA

How does it work?

Why is it interesting?

Partial RDA on Doubs data

#Run the partial RDA

> spechem.physio<-rda(spe.hel, envchem, envtopo)

> summary(spechem.physio, display=NULL)

#or

> spechem.physio2<-rda(spe.hel ~ pH + dur + pho + nit + amm

+ oxy + dbo + Condition(alt + pen + deb), data=env)

> summary(spechem.physio2, display=NULL)

Test partial RDA significance

#R2adj

> R2adj<- RsquareAdj(spechem.physio)$adj.r.squared

> R2adj

#Significance of axes

> anova.cca(spechem.physio, step=1000)

> anova.cca(spechem.physio, step=1000, by="axis")

> plot(spechem.physio, scaling=2, main="Triplot partial RDA - scaling 2",

type="none", xlab=c("RDA1"), ylab=c("RDA2"), xlim=c(-1,1), ylim=c(-1,1))

> points(scores(spechem.physio, display="sites", choices=c(1,2),

scaling=2),pch=21, col="black", bg="steelblue", cex=1.2)

> text(scores(spechem.physio, display="species", choices=c(1),scaling=2),

scores(spechem.physio, display="species", choices=c(2), scaling=2),

labels=rownames(scores(spechem.physio, display="species", scaling=2)),

col="grey", cex=0.8)

> arrows(0,0,scores(spechem.physio, display="bp", choices=c(1),scaling=2),

scores(spechem.physio, display="bp", choices=c(2), scaling=2),col="red")

> text(scores(spechem.physio, display="bp", choices=c(1), scaling=2)+0.05,

scores(spechem.physio, display="bp", choices=c(2), scaling=2)+0.05,

labels=rownames(scores(spechem.physio, display="bp", choices=c(2),

scaling=2)),col="red", cex=1)

RDA is significant, but only the 1st canonical axis is significant. Shrub and Topo explained 9.81% of the variation in mite species composition, and substrate covariables explained 42.84 of the variation. R2adj of this RDA is 8.33%

To divide the variation of a response variable among 2, 3, or 4 explanatory data sets

Legendre and Legendre 2012

> plot(spe.part.all, digits=2)

envchem

envtopo

> #RDA of chemistry variables

> spe.chem <- rda(spe.hel~., data=envchem)

> #Select significant chemistry variables

> R2a.all.chem <- RsquareAdj(spe.chem)$adj.r.squared

> ordiR2step(rda(spe.hel~1, data=envchem), scope= formula(spe.chem),

direction= "forward", R2scope=TRUE, pstep=1000)

> names(envchem)

> envchem.pars <- envchem[, c( 4, 6, 7 )]

> #RDA with other environmental variables

> spe.topo <- rda(spe.hel~., data=envtopo)

> R2a.all.topo <- RsquareAdj(spe.topo)$adj.r.squared

> ordiR2step(rda(spe.hel~1, data=envtopo), scope= formula(spe.topo),

direction= "forward", R2scope=TRUE, pstep=1000)

> names(envtopo)

> envtopo.pars <- envtopo[, c(1,2)]

All the fractions are significant (p<0.001)

> mite.subs=mite.env[,c(1,2,3)] # first set of variables

> mite.other=mite.env[,c(4,5)]

> #RDA of mite.subs

> rda.mite.subs <- rda(mite.spe.hel~., data=mite.subs)

> R2a.all.chem <- RsquareAdj(rda.mite.subs)$adj.r.squared

> ordiR2step(rda(mite.spe.hel~1, data=mite.subs),

scope= formula(rda.mite.subs),direction= "forward",

R2scope=TRUE, pstep=1000)

> names(mite.subs)

> mite.subs.pars <- mite.subs[, c( 2,3 )]

> #RDA for mite.other

> rda.mite.other <- rda(mite.spe.hel~., data=mite.other)

> R2a.all.other <- RsquareAdj(rda.mite.other)$adj.r.squared

> ordiR2step(rda(mite.spe.hel~1, data=mite.other),

scope=formula(rda.mite.other), direction= "forward",

R2scope=TRUE, pstep=1000)

> names(mite.other)

> mite.other.pars <- mite.other[, c(1,2)]

> #Varpart

> mite.spe.part <- varpart(mite.spe.hel, mite.subs.pars, mite.other.pars)

> plot(mite.spe.part, digits=2)

> #Tests of significance

> anova.cca(rda(mite.spe.hel, mite.subs.pars), step=1000) # [a+b]

> anova.cca(rda(mite.spe.hel, mite.other.pars), step=1000) # [b+c]

> mite.env.pars <- cbind(mite.subs.pars, mite.other.pars)

> anova.cca(rda(mite.spe.hel, mite.env.pars), step=1000) # [a+b+c]

> anova.cca(rda(mite.spe.hel, mite.subs.pars, mite.other.pars),

step=1000) # [a]

> anova.cca(rda(spe.hel, mite.other.pars, mite.subs.pars),

step=1000) # [c]

All fractions are significant.

Advantages of the MRT:

does not assume a linear relationship between Y and X matrices

robust in dealing with missing values

robust in dealing with colinearity among descriptors

works well with raw descriptor value

generates an output that is easy to interpret

> ?mvpart

> # Remove "distance from source" from raw environmental data set

> env <- subset(env, select = -das)

> # Create the tree

> doubs.mrt<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,

margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)

We will use the function mvpart() from the... mvpart{} package

xv= "pick" # so that we choose where to cut the tree

xval=now(spe.hel) # number of times cross-validation occurs

xvmult=100 # number of permutations in the cross-validation

threshold

barplot showing species abundances in sites grouped in this leaf

relative error

number of sites

reciprocal R2: 43.7%

> # Using the CVRE criterion

> doubs.mrt.cvre<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,

margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)

> # Choosing to get a tree composed of 4 groups (leaves)

> doubs.mrt.cvre<- mvpart((data.matrix(spe.hel)~., env, legend=FALSE,

margin=0.01, cp=0, xv="pick", xval=nrow(spe.hel),xvmult=100,which=4)

Note that the code does not change at all...

The selection of the tree size must be made via the cross-validation graph

Summary output

> summary(doubs.mrt)

CP stands for "complexity parameter". In other words, the variance explained at each node.

Discriminant species

> doubs.mrt.wrap<-MRT(doubs.mrt, percent=10,

species=colnames(spe.hel))

> summary(doubs.mvpart.wrap)

Are there species contributing highly to the explained variance at each split?

What are the sites contained in each leaf?

Let's generate an nicer and more informative output with MRT().

This function is available in MVPARTwrap library, and also calls the rdaTest library

The main discriminant species are TRU, VAI and ABL.

ABL is the most indicative species of the sites at lower altitude

Select the minimum size of tree within 1 SE of the CVRE

What is the proportion of variance explained by this tree?

How many leaves does it contain?

What are the discriminant species?

> mite.mrt<-mvpart(data.matrix(mite.spe.hel)~.,

mite.env,legend=FALSE, margin=0.01, cp=0,

xv="pick", xval=nrow(mite.spe.hel), xvmult=100,which=4)

> summary(mite.mrt)

> mite.mrt.wrap<-MRT(mite.mrt, percent=10,

species=colnames(mite.spe.hel))

> summary(mite.mrt.wrap)

25.6% of the variation in the mite species assemblage across sites is explained by the partition of the sites based on water content of the substrate at a threshold value of 385.1 mg/l.

LCIL is a discriminant species of sites with higher water content

Plot aesthetics

What does it all mean?

In type 1 scaling the sites (here in blue) that are more similar to one another are closer to each other

In type 2 scaling the arrows and their angles are really informative. Variables that have longer arrows are stronger drivers of the variation. Variables that are opposite one another (oxy-dbo) have a negative relationship, same direction (oxy-alt) have a positive relationship

You can make quick RDA biplots with the base plot() function but sometimes that gives you a plot that crowds your points or makes objects too small. You can change the aesthetics of the points, text and arrows.

>points(scores(spe.rda.signif, display="sites", choices=c(1,2), scaling=1),

pch=21, col="black", bg="steelblue", cex=1.2)

>arrows(0,0,scores(spe.rda.signif, display="species", choices=c(1), scaling=1),scores(spe.rda.signif, display="species", choices=c(2), scaling=1), col="black",length=0)

#add site numbers

>numbers<-(1:30)

>numbers<-numbers[!numbers%in%8]

>spa$site<-numbers

#make groups based on lattitude y<82=group1, 82<y<156=group2, y>156=group3

>spa.group<-ddply(.data=spa, .variables=.(x, y, site), .fun= summarise, group = if(y <= 82) 1 else if (y <= 156) 2 else 3)

#order by site

>spa.group<-spa.group[with(spa.group, order(site)), ]

We will use the grouping factor in spa.group in column 4

The syntax for the LDA is lda(variables you want to use to explain the grouping, grouping variable)

The LDA output can then be used to determine which group each object (site) was classified in based on the LDA and tell you how accurate each grouping was

#classification of the objects based on LDA

>spe.class <- predict(LDA)$class

#posterior probabilities of the objects to belong to the groups

>spe.post <- predict(LDA)$posterior

Now that we have the relationship between the environmental variables and the groupings we can get new sites and classify them based on this relationship

Use the LDA result to predict the grouping of the ClassifyMe data

Are ordered in the same order as they appear in the data frame so site 31 is in group 2, site 32 in group 1 and so on

Challenge 5 solution

>mite.xy$site<-seq(1:70)

(max(mite.xy[,2])-min(mite.xy[,2]))/4

>mite.xy.group<-ddply(.data=mite.xy, .variables=.(x, y, site), .fun= summarise, group = if(y <= 2.5) 1 else if (y <= 4.9) 2 else if (y <= 7.3) 3 else 4)

>mite.xy.group<-mite.xy.group[with(mite.xy.group, order(site)), ]

Challenge 5 solution

>LDA.mite<-lda(mite.env[,1:2],mite.xy.group[,4])

>mite.class <- predict(LDA.mite)$class

>mite.post <- predict(LDA.mite)$posterior

>mite.table <- table(mite.xy.group[,4], mite.class)

>diag(prop.table(mite.table, 1))

Cool Venn diagram

> install.packages("venneuler")

> library(venneuler)

> v <- venneuler(c(A=0.241,"A&B"= 0.233, B=0.112))

> v$labels <- c("Chem\n24%","Topo \n11%")

> v$colors <- c(.78,.6)

> plot(v)