Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

QCBS R Workshop 3

ggplot2, tidyr and dplyr
by

CSBQ QCBS

on 14 November 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of QCBS R Workshop 3

Quebec Centre for Biodiversity Science

R Workshop Series

Workshop 3:
ggplot2, tidyr, dplyr

Website
:
qcbs.ca/wiki/r/r_workshop3

Hadleyverse
Intro - ggplot2
Why use R for plotting?
Intro - ggplot2
Intro - ggplot2
You !
Intro - ggplot2
Intro - ggplot2
Outline - ggplot2
Basic scatter plot
Plotting function: "qplot" (quick plot)
> install.packages("ggplot2")
> library(ggplot2)
Required packages
Divided workflow
Intro - ggplot2
Why use R for plotting?
Reproducible workflow
Beautiful and flexible graphics!
Have you created plots?
What kind of plot?
Which software?
Have you plotted in R?
base R, lattice?
ggplot2?
Code and HTML available at:
qcbs.ca/wiki/r/workshop3

Recommendation:
create your own new script
refer to provided code only if needed
avoid copy pasting or running the code directly from script

ggplot2 is also hosted on github:
https://github.com/hadley/ggplot2

Follow along
1. Your first R plot:
Basic scatter plot
Challenge 1
2. Grammar of graphics
More advanced plots
Available plot elements and when to use them
Challenge 2
3. Saving a plot
4. Fine tuning your plot
Colours
Themes
5. Miscellaneous cool stuff

> ?qplot
arguments:
data
x
y
...
Basic scatter plot
Basic scatter plot
> ?iris
> head(iris)
> str(iris)
> names(iris)
Look at pre-loaded "iris" dataset:
> qplot(data = iris,
x = Sepal.Length,
y = Sepal.Width)
Basic scatter plot (categorical)
> qplot(data = iris,
x = Species,
y = Sepal.Width)
Less basic scatter plot
> ?qplot
arguments:
data
x
y
...
xlab
ylab
main
Less basic scatter plot
> qplot(data = iris,
x = Sepal.Length,

xlab = "Sepal Length (mm)",
y = Sepal.Width,

ylab = "Sepal Width (mm)",

main = "Sepal dimensions")
Challenge # 1
Produce a basic plot with built in data

> ?CO2
> data(CO2)
> ?BOD
> data(BOD)
Grammar of graphics (gg)
A graphic is made of elements (layers)
data:
aesthetics (aes)
transformation
geoms (geometric objects)
axis (coordinate system)
scales


Grammar of graphics (gg)
Aesthetics (aes) make data visible:
x,y : position along the x and y axis
colour: the colour of the point
group: what group a point belongs to
shape: the figure used to plot a point
linetype: the type of line used (solid, dashed, etc)
size: the size of the point or line
alpha: the transparency of the point


Grammar of graphics (gg)
Geometric objects (geoms)
point: scatterplot
line: line plot, where lines connect points by increasing x value
path: line plot, where lines connect points in sequence of appearance
boxplot: box-and-whisker plots, for categorical y data
bar: barplots
histogram: histograms (for 1-dimensional data)


Grammar of graphics (gg)
Edit any single element to produce a new graph
e.g. changing the coordinate system


How it works
1. Create a simple plot object




2. Add graphical layers/complexity




3. Repeat step 2 until satisfied, then unleash the beast!




> plot.object <- qplot() OR ggplot()
> plot.object <- plot.object + layer()
> plot.object
qplot() vs ggplot()
> qplot(data = iris,
x = Sepal.Length,
xlab = "Sepal Length (mm)",
y = Sepal.Width,
ylab = "Sepal Width (mm)",
main = "Sepal dimensions")
Scatter plot as R object
> ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
xlab("Sepal Length (mm)") +
ylab("Sepal Width (mm)") +
ggtitle("Sepal dimensions")
> basic.plot <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()+
xlab("Sepal Length (mm)")+
ylab("Sepal Width (mm)")+
ggtitle("Sepal dimensions")
> basic.plot
Plot with colour and shape
> basic.plot <- basic.plot +
aes(colour = Species, shape = Species)
> basic.plot
Plot linear regressions
Add aesthetics:




Add a geom (e.g. linear smoothing):




> linear.smooth.plot <- basic.plot +
geom_smooth(method = "lm", se = FALSE)
> linear.smooth.plot
Challenge # 2
Produce a colourful plot with linear regression (or other smoothing) from built in data

> ?CO2
> data(CO2)
> ?msleep
> data(msleep)
Using facets and groups
> CO2.plot <- ggplot(data = CO2, aes(x = conc, y = uptake, colour = Treatment)) +
geom_point() +
xlab("CO2 Concentration (mL/L)") +
ylab("CO2 Uptake (umol/m^2 sec)") +
ggtitle("CO2 uptake in grass plants")
> CO2.plot
Facets
> CO2.plot <- CO2.plot +
facet_grid(. ~ Type)
> CO2.plot
> plot.object <- plot.object +
facet_grid(rows ~ columns)
Basic syntax:
Example --> CO2 plot by Type:
Groups
> CO2.plot +
geom_line()
Add a geom line:
Groups
> CO2.plot <- CO2.plot + geom_line(aes(group = Plant))
> CO2.plot
Specify groups:
Available elements
Credits: Software and Programer Efficiency Group
Additional resources
> help(package = ggplot2)
http://docs.ggplot2.org
Challenge # 3
Explore a new geom and other plot elements with your own data or built in data
> ?msleep
> data(msleep)
> ?OrchardSprays
> data(OrchardSprays)
Saving plots in RStudio
Saving plots with code
> ggsave("CO2 plot.pdf",
CO2.plot,
height = 8.5,
width = 11,
units = "in")
ggsave() will write directly to your working directory all in one line of code and you can specify the name of the file and the dimensions of the plot:
Fine tuning: colours
> CO2.plot +
scale_colour_manual(values = c("nonchilled" = "red",
"chilled" = "blue"))
Fine tuning: scales
> CO2.plot +
scale_y_continuous(name = "CO2 uptake rate",
breaks = seq(5, 50, by = 10),
labels = seq(5, 50, by = 10),
trans = "log10")
Fine tuning: themes
Set theme:




Themes:




Edit your own theme:




> plot.object + theme()
theme_grey(), theme_bw(), etc...
> mytheme <- theme_bw() +
theme(plot.title = element_text(colour = "red")) +
theme(legend.position = c(0.9, 0.9))
> CO2.plot + mytheme
Fine tuning: RColorBrewer
> install.packages("RColorBrewer")
> library(RColorBrewer)
> basic.plot + scale_color_brewer(palette = "Dark2")
BONUS!
Fine tuning:
à la Wes Anderson
Any Wes Anderson fans?
> devtools::install_github("wesanderson", "karthik")
> library(wesanderson)
> base.plot +
scale_color_manual(values = wes_palette("GrandBudapest", 3))
Fine tuning: ggthemes
https://github.com/jrnold/ggthemes




Themes:
Tufte
The Economist
Five Thirty Eight




> install.packages("ggthemes")
> library(ggthemes)
> ggplot(data = OrchardSprays, aes(x = treatment, y = decrease)) +
geom_tufteboxplot() + theme_tufte()
https://github.com/karthik/wesanderson
Base R plotting
> plot (iris)
> ?plot
Similar to qplot syntax
Many powerful defaults
> lm <- lm(Sepal.Length ~
Petal.Width,
data = iris)
> plot(lm)

BONUS TIP!
ggplot2 online GUI: http://rweb.stat.ucla.edu/ggplot2/
BONUS TIP!
Any ecologists? Any vegan users?
ggvegan can be used to build biplots in ggplot2

Note: great looking ordination biplots can be made with regular ggplot2 --> extract scores and build plots layer by layer!
Solution # 1
Produce a basic plot with the CO2 dataset

> qplot(data = CO2,
x = conc,
xlab = "CO2 Concentration (mL/L)",
y = uptake,
ylab = "CO2 Uptake (umol/m^2 sec)",
main = "CO2 uptake in grass plants")
Solution # 2
> CO2.plot <- ggplot(data = CO2, aes(x = conc, y = uptake, colour = Treatment)) +
geom_point() +
xlab("CO2 Concentration (mL/L)") +
ylab("CO2 Uptake (umol/m^2 sec)") +
ggtitle("CO2 uptake in grass plants") +
geom_smooth(method = "loess")
> CO2.plot
A colourful plot with loess smoothing from built in data

Solution # 3
> box.plot <- ggplot(data = OrchardSprays,
aes(x = treatment, y = decrease)) +
geom_boxplot()
> box.plot
Explore a new geom and other plot elements with your own data or built in data:
Intro - Data Formats
Use of long vs. wide data
ID variable (e.g. site)
Variable 1
Variable 2
ID 1 Measured value Measured value Measured value
ID 2 Measured value Measured value Measured value
WIDE-FORMAT
Variable 3
LONG- FORMAT
ID variable (e.g. site)
Factor
Measured value
ID 1 Variable 1 #
ID 1 Variable 2 #
ID 1 Variable 3 #
ID 2 Variable 1 #
ID 2 Variable 2 #
ID 2 Variable 3 #
ID 3 Variable 1 #
ID 3 Variable 2 #
ID 3 Variable 3 #

ID 3 Measured value Measured value Measured value
Long-format data
has a column for possible variables and column for the values of those variables.

Wide-format data
has a separate column for each variable or factor in your study.

Wide-format data can be used for some basic plotting in ggplot2, but more complex plots require long-format (example to come!).

dplyr, lm(), glm(), gam() all require long-format data.
Exercises - ggplot and tidyr
grid.arrange()
in the package
gridExtra
can be used to put these plots into one figure.
library(gridExtra)
combo.box <- grid.arrange(ozone.box, solar.box, temp.box, wind.box, nrow = 2)
Note that the scales on the individual y-axes are not the same.
Exercises - ggplot and tidyr
You can continue using the wide-format of airquality to make individual plots of each variable showing day measurements for each month.
> ozone.plot <- ggplot(airquality, aes(x = Day, y = Ozone)) + geom_point() +
geom_smooth() + facet_wrap(~ Month, nrow = 2)

> solar.plot <- ggplot(airquality, aes(x = Day, y = Solar.R)) + geom_point() +
geom_smooth() + facet_wrap(~ Month, nrow = 2)

> wind.plot <- ggplot(airquality, aes(x = Day, y = Wind)) + geom_point() +
geom_smooth() + facet_wrap(~ Month, nrow = 2)

> temp.plot <- ggplot(airquality, aes(x = Day, y = Temp)) + geom_point() +
geom_smooth() + facet_wrap(~ Month, nrow = 2)
Can use grid.arrange() again (Ugly!)
> combo.facets <- grid.arrange(ozone.plot, solar.plot, wind.plot, temp.plot, nrow = 4)
Intro - dplyr
> install.packages("dplyr")
> library(dplyr)
Required packages
Data manipulation
by()
Ever used any of these functions:
sapply()
lapply()
do.call()
split()
merge()
rbind()
subset()
apply()
(... and cried a little?)
Intro - dplyr
The dplyr mission:
distill data manipulation tasks into a few easy verbs
make it fun to play with in RStudio
make it super wicked fast (C++)
make it scalable (plug into SQL)
Basic dplyr functions
These 4 functions tackle all the common manipulations we require when working with data frames:
select()
: columns from a data frame
filter()
: filter rows according to defined criteria
arrange()
: re-order data based on criteria
mutate()
: create or transform values in a column
Data manipulation
What have you used in the past?
select()
selects variables (columns) of interest
> ?select
arguments:
data
column1
column2
...
select()
Example: suppose we are only interested in Ozone through time within ''airquality''
> ozone <- select(airquality, Ozone,
Month, Day)
> head(ozone)
filter()
filter rows that meet logical criteria
> ?filter
arguments:
data
logical statement 1
logical statement 2
...
filter()
Example: suppose we are only interested in running an analysis on the month of August during high temperature events
> august <- filter(airquality,
Month == 8,
Temp >= 90)
> head(august)
arrange()
re-orders data based on criteria, by default in ascending order (see desc() for descending)
> ?arrange
arguments:
data
1st column to be sorted
2nd column to be sorted
...
arrange()
Example: suppose we just imported a messy dataset
> air_mess <- sample_frac(airquality, 1)
> head(air_mess)
arrange()
Example: now lets sort it so that our dataset is in its original chronological order)
> air_chron <- arrange(air_mess, Month, Day)
> head(air_chron)
mutate()
Create new variables or transform existing ones in a column
> ?mutate
arguments:
data
expression 1
expression 2
...
mutate()
Example: suppose we want to convert degrees F into degrees C, and add these to a new column
> airquality_C <- mutate(airquality,
Temp_C = (Temp-32)*(5/9))
> head(airquality_C)
magrittr
Usually data manipulation requires multiple steps.
The maggritr package offers a pipe operator which allows us to link multiple operations
%>%
magrittr
> install.packages("magrittr")
> library(magrittr)
Required packages
magrittr
> june_C <- mutate(
filter(airquality, Month == 6),
Temp_C = (Temp-32)*(5/9)
)
Example: Suppose we are only interested in the month of June, and that our analysis requires degrees Celsius. Let's create the required data frame by combining 2 dplyr verbs we've learned
As you can see, wrapping the functions one inside the other is awkward to read and write, as the order of operations starts on the inside and works its way out. Also step by step would be redundant and write a lot of objects to the workspace.
magrittr
> june_C <- airquality %>%
filter(Month == 6) %>%
mutate(Temp_C = (Temp-32)*(5/9))
Alternatively, we can use maggritr's pipe operator to link these successive operations:
Here the output on the RHS of the pipe becomes the input on the LHS of the pipe. This reads and writes in the same order as the operations.
magrittr
> june_C <- airquality %>%
filter(Month == 6) %>%
mutate(Temp_C = (Temp-32)*(5/9))
Even for 2 steps, which is most elegant in your opinion?
> june_C <- mutate(
filter(airquality, Month == 6),
Temp_C = (Temp-32)*(5/9)
)
OR
dplyr - groups and summaries
The true jedi nature of the dplyr package is revealed when multiple verbs are combined. dplyr also allows us to perform these operations to groups within the data and/or to aggregate information within groups:
group_by()
: group data frame by a factor for downstream operations (usually summarise)
summarise()
: summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)
dplyr: Split-Apply-Combine
The group_by() function is key here, in that it allows to perform Split-Apply-Combine operations easily:
group_by() & summarise()
> month_sum <- airquality %>%
group_by(Month) %>%
summarise(mean_temp = mean(Temp),
sd_temp = sd(Temp))
> head(month_sum)
Example: suppose we are interested in the mean temperature and standard deviation within each month in the ''airquality'' dataset.
Challenge # 5
Using the ChickWeight dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study. Employ dplyr verbs and the %>% operator.
Solution # 5
1. First we use group_by() to divide the dataset by "Chick"

2. Using summarise(), we can then calculate the weight gain within each group

> weight_gain <- ChickWeight %>%
group_by(Chick) %>%
summarise(weight_gain = max(weight) - min(weight))
> weight_gain
dplyr jedi ninja
Challenge # 6
Using the ChickWeight dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study. Employ dplyr verbs and the %>% operator.
HINT 1: first() and last() functions may be useful here
HINT 2: Within ''group_by()'', the multiple groups create a layered onion, and each subsequent single use of the ''summarise()'' function peels of the outer layer of the onion (starting by the external layer, which is the last one listed within the group_by() function.
dplyr ninja jedi Solution
1. First we use group_by() to divide the dataset by ''Diet" AND ''Chick"
2. Using summarise(), we can then calculate the weight gain within each group (as before)
3. Finally, we use a second summarise (by "Diet", the remaining group) to calculate a mean

> diet_summ <- ChickWeight %>%
group_by(Diet, Chick) %>%
summarise(weight_gain = max(weight) - min(weight)) %>%
summarise(mean_gain = mean(weight_gain))
> diet_summ
You dplyr journey
Today, we just scratched the surface of what you can accomplish in dplyr. Here is some more:

Joining tables: left_join(), inner_join(), etc...
plugging into SQL dbases
Check out the wiki for some more teaching resources to discover new ways of manipulating your data
# Note that vector formats (e.g. pdf, svg, etc...) are often a preferable choice compared to raster formats (jpeg, png, etc...).
> ?pdf
> ?jpeg
Other methods:

Tidying allows you to manipulate the structure of your data while preserving all original information.






"gather" data: convert from wide to long-format
"spread" data: convert from long to wide-format


tidyr - tidying your data
tidyr - Installation

Required packages:


> install.packages("tidyr")
> library(tidyr)
gather() columns into rows
Function arguments:

dataset
name of new column containing variable names (A)
name of new column containing variable values (B)
names of columns we wish to gather (C)

Wide format
Long format
A
B
C
Example
tree dimensions

Let's pretend you send out your field assistant to measure the diameter at breast height (DBH) and height of three tree species for you. The result is this messy (“wide”) data set:


> messy <- data.frame(Species = c("Oak", "Elm", "Ash"),
DBH = c(12, 20, 13),
Height = c(56, 85, 55))
Wide format
gather() - example

Transform the "messy" dataset into long format:
> messy.long <- gather(messy, Measurement, cm, DBH, Height)
Wide format
Long format
spread() rows into columns
Function arguments:

dataset
name of column containing variable names (A)
name of column containing variable values (B)

Wide format
Long format
A
B
Spread() - example

Transform the "messy.long" dataset into wide format:


> messy.wide <- spread(messy.long, Measurement, cm)
Wide format
Long format
Challenge # 4
Using the ''airquality'' dataset, ''gather()'' all the columns (except Month and Day) into rows. Then ''spread()'' the resulting dataset to return the same data format as the original data:

> ?airquality
> data(airquality)
Solution # 4
# Note that the syntax used here indicates that we wish to gather ALL the columns except "Month" and "Day", it is equivalent to writing: "Ozone, Solar.R, Temp, Wind"
> air.long <- gather(airquality, variable, value, -Month, -Day)
> head(air.long)
> air.wide <- spread(air.long , variable, value)
> head(air.wide)
seperate() columns
Function arguments:

data
: dataset
col
: name of column we wish to seperate (A)
into
: names of new columns (B, C, D)
by
: character which indicates where to seperate

A
B
C
D
separate() - example
Create a fictional dataset about fish and plankton:
> set.seed(8)
> really.messy <- data.frame(id = 1:4,
trt = sample(rep(c('control', 'farm'), each = 2)),
zooplankton.T1 = runif(4),
fish.T1 = runif(4),
zooplankton.T2 = runif(4),
fish.T2 = runif(4))
separate() - example
First we need to convert the "really.messy" dataset into long format:
> really.messy.long <- gather(really.messy, taxa, count, -id, -trt)
> head(really.messy.long)
separate() - example
Then, if we are interested in the time variable (T1 and T2), we ca separate it into its own column:
> really.messy.long.sep <- separate(really.messy.long, taxa, into = c("species", "time"), sep = "\\.")
> head(really.messy.long.sep)
# Note that we are using "\\." instead of "." because the period is a wild card in the R langage
Exercises - ggplot and tidyr
head(airquality)
Inspection of "airquality" reveals it is in wide format: the variables Ozone, Solar.R, Wind and Temp each have their own column.
Lets use ggplot2 to create plots for each variable:
fMonth <- factor(airquality$Month) # Convertit la variable "Month" en facteur.

ozone.box <- ggplot(airquality, aes(x = fMonth, y = Ozone)) + geom_boxplot()
solar.box <- ggplot(airquality, aes(x = fMonth, y = Solar.R)) + geom_boxplot()
temp.box <- ggplot(airquality, aes(x = fMonth, y = Temp)) + geom_boxplot()
wind.box <- ggplot(airquality, aes(x = fMonth, y = Wind)) + geom_boxplot()
Exercises - ggplot and tidyr
Convert airquality to long-format to use facet_wrap() for the different variables as opposed to by month.
> air.long <- gather(airquality, variable, value, -Month, -Day)
> air.wide <- spread(air.long , variable, value)
fMonth.long <- factor(air.long$Month)
weather <- ggplot(air.long, aes(x = fMonth.long, y = value)) +
geom_boxplot() +
facet_wrap(~ variable, nrow = 2)
Use the long-format data to make some plots in ggplot2.
Compare the "weather" plot with "combo.box".

Exercises - ggplot and tidyr
The variables in the weather plot are on the same scale, which makes it difficult to see the variability in "Wind" or "Temp".We can free the y axis in each panel as follows:
weather <- weather + facet_wrap(~ variable, nrow = 2, scales = "free")
Exercices - ggplot and tidyr
We can also use the long format data (air.long) to create a plot with all the variables included on a single plot:
meteo2 <- ggplot(air.long, aes(x = Day, y = value, colour = variable)) +
geom_point() +
facet_wrap(~ Month, nrow = 1)
# Inverse the order of Month and Day to observe the changes
summarise()
Create summaries of data using aggregating functions

# Note that the difference between the max and the min weight doesn't necessarily correspond to the weight gain between the beginning and the end of the experiment. Inspect Chick #18 in the original data!!!!
Full transcript