Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

QCBS R Workshop 2

No description
by

CSBQ QCBS

on 16 September 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of QCBS R Workshop 2

You can simply type the path of the working directory using setwd('path'). For example:






Alternatively:


Quebec Centre for Biodiversity Science

R Workshop Series

Workshop 2:
Loading and manipulating data

Website
: http://qcbs.ca/wiki/r/workshop2

The Script (s)
Harder Challenge
This is probably what your data or the data you downloaded looks like
You can fix the data frame in R (or not…)
Please give it a try before looking at the script provided
Work with your neighbors and have FUN!

It is possible to do all your data preparation work within R
Saves time for large datasets
Keeps original data intact
Can switch between long and wide easily (more on this in workshop 4)
Useful page:
https://www.zoology.ubc.ca/~schluter/R/data/
Preparing data for R
Importing Data
Notice that R-Studio now provides information on the CO2 data in your workspace
Working Directory
Commands & comments
annotating someone’s script is a good way to learn
remember what you did
tell collaborators what you did
good step towards reproducible science
Create an R script
R Scripts

You can download the data & script from:

http://qcbs.ca/wiki/r/workshop2

Save the files somewhere convenient on your computer and remember the path to the directory that contains the files.


Download today’s data
1. Writing a script
Learning Objectives
CO2 <- read.csv("CO2_broken.csv",
sep = "",
skip = 2,
na.strings = c("NA","na","cannot_read_notes"))

str(CO2)
?read.csv
head()
str()
class()
unique()
levels()
which()
droplevels()
Fixing CO2_broken
Try to load, explore, plot, and save your own data in R*
If it doesn’t load properly, try to make the appropriate changes
Remember to clear your workspace first!
When you are finished, try opening your exported data in excel
Use your data
Column values match their intended use
No text in numeric columns
do not include spaces!
NA (not available) is allowed, but blank entries will automatically be replaced with NA so no need to add yourself.
Avoid numeric values for data that does not have numeric meaning
Subject, Replicate, Treatment: 1,2,3 -> A,B,C or S1,S2,S3 or …
Preparing data for R
Short informative column headings
starting with a letter
no spaces
Preparing data for R
You can make a section heading using four # signs
Section Headings
Housekeeping
The first
command
at the top of all scripts should be: rm(list = ls()). This:
Broken data
head(CO2)
Broken data
.csv

Comma separated files
(.csv) in Data folder
can be created from almost all applications (Excel, LibreOffice, GoogleDocs)
file -> save as csv…
Preparing data for R
File to write (name)
Object (name)
write.csv(CO2,file="CO2_new.csv")
Exporting data
CO2<­read.csv(“CO2_good.csv”, header=FALSE)
Looking at Data
data(), head(), str(), names(), attributes(), summary(), ncol(), nrow()
Looking at Data
CO2
head(CO2)
names(CO2)
attributes(CO2)
ncol(CO2)
nrow(CO2)
summary(CO2)
look at the whole dataframe
look at the first few rows
names of the columns in the dataframe
attributes of the dataframe
number of columns
number of rows
summary statistics
Housekeeping
# Clear R workspace
rm(list = ls())
?rm
?ls
CO2<-read.csv("CO2_broken.csv", sep = "", skip = 2)
head(CO2)
no notes!
no additional headings!
no merged cells!
Preparing data for R
Saving your Workspace
Save
# Reload your data
Load(“CO2_project_Data.RData")
head(CO2) # looking good!
(Run object name to see output)
Column
(name)
Data frame
(name)
Function
New object
(name)
conc_mean <- mean(CO2$conc)
Try with standard deviation as well:
Data exploration
New object in R
File name within quotation marks ('file' or ''file'')
Importing Data
?read.csv
CO2 <- read.csv("CO2_good.csv", header = TRUE)
What is this?
A text file that contains all of the commands that you will use

Once written and saved, your script file allows you to make changes and re-run analyses with minimal effort!
Just highlight text and click "run" or press command+enter (Mac) or ctrl+enter (PC)
Create an R script
Clearing the workspace
Clears R memory
Prevents errors such as use of older data
Demo – add some test data to R and then see how rm(list=ls()) removes it
commenting/documenting
The '#' symbol in a script tells R to ignore anything remaining on this line of the script when running commands
# This is a comment not a command
Allows you to move quickly between sections and hide sections
#### Heading name ####
Tells R where your scripts and data are. You need to set the right working directory to load a data file.
Type in the console to see your working directory:
getwd()
When you load a script, RStudio automatically sets the directory to the folder containing the script.
a “/” separates folders and file
Import data into R using read.csv:
Recall:
to find out what arguments the function requires, use help “?”
Load the data with:
Calculate and save the mean of one of your columns
conc_mean
sd()
# Saving an R workspace file
save.image(file="CO2_project_Data.RData")
# Clear your memory
rm(list = ls())
Clear
Reload
.csv
*If you don’t have your own data, work with your neighbour
Read in the file "CO2_broken.csv”
HINT:
There are 4 problems!
Some useful functions:
look at some of the options for how to load a .csv
ERROR 1
ERROR 2
The data appears to be lumped into one column
> head(CO2)
NOTE..It.rain.a.lot.in.Quebec.during.sampling due.to.excessive X X.1 X.2 X.3
1 falling on my notebook numerous values can't be read rain NA NA NA NA NA
2 Plant\tType\tTreatment\tconc\tuptake NA NA NA NA NA
3 Qn1\tQuebec\tnonchilled\t95\t16 NA NA NA NA NA
4 Qn1\tQuebec\tnonchilled\t175\t30.4 NA NA NA NA NA
5 Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes NA NA NA NA NA
6 Qn1\tQuebec\tnonchilled\t350\t37.2 NA NA NA NA NA
The data does not start until the third line of the file, so you end up with notes on the file as the headings.
"conc" and "uptake" variables are considered factors instead of numbers, because there are comments in the numeric columns
'data.frame': 84 obs. of 5 variables:
$ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
$ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...
$ conc : Factor w/ 8 levels "1000","175","250",..: 7 2 3 4 5 8 1 7 2 3 ...
$ uptake : Factor w/ 76 levels "10.5","10.6",..: 15 39 76 54 50 61 63 9 32 53 ...
str(CO2)
Due to missing values entered as "cannot_read_notes" and "na"
Recall that R only recognizes "NA" (capital)
Broken data
There are only two treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments.
$ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 ...
str(CO2)
levels(CO2$Treatment)
unique(CO2$Treatment)
[1] nonchilled nnchilled chilled chiled
Levels: chiled chilled nnchilled nonchilled
CO2$Treatment[which(CO2$Treatment=="nnchilled")]="nonchilled"
Broken data
Identify all rows that contain "nnchilled" and replace with "nonchilled"
Identify all rows that contain "chiled" and replace with "chilled"
CO2$Treatment[which(CO2$Treatment=="chiled")]="chilled"
Header
It is recommended that you start your script with a header using comments:
A<-"Test"
A
rm(list=ls())
A
Remember
R is ready for commands when you see the chevron '>' in the console. If you don't see it, press ESC
R is case sensitive!
Setting working directory
setwd('/Users/vincentfugere/Desktop/QCBS_R')
setwd(choose.dir())
2. Loading, exploring and saving data
3. Fixing a broken
data frame

Important
: if your operating system or CSV editor (e.g. Excel) is in French, then use read.csv2
?read.csv2
Tells R that first line contains column names not data
Looking at Data
str(CO2)
structure of the dataframe

Data exploration
Plot of all variable combinations (very useful!)
plot(CO2)
Want to see if one of your variable is normally distributed? Try hist()
hist(CO2$uptake)
*Recall from workshop 1 that the dollar sign is used to extract a column from a data frame.
Check data types with str() again. What is wrong here? (Don't forget to re-load data with header = T afterward)
Very useful to check data type (mode) of all columns to make sure R loaded data properly. Common problems:

Factors loaded as text (character) or vice versa
Factor includes too many levels because of typo
Data (integer or numeric) is loaded as character because of typo (e.g. a space or a "," instead of a "." for decimal numbers)
?apply
Data exploration
Use apply() to calculate the means of the last two columns of the data frame (i.e. the columns that contain continuous data)
?apply
Data exploration
apply(CO2[,4:5], MARGIN = 2, FUN = mean)
Note: the CO2 dataset includes repeated measurements of CO2 uptake from six plants from Quebec and six plants from Mississippi at several levels of ambient CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.
CO2 <- read.csv("CO2_broken.csv",sep = "")
Broken data
ERROR 1
-

Solution
Re-import the data, but specify the separation among entries
The sep argument tells R what character separates the values on each line of the file
Here, "TAB" was used instead of ","
Header
Solution
Skip two lines when loading the file using the "skip" argument:
ERROR 3
Broken data
?read.csv
Broken data
ERROR 3
-

Solution
Tell R that all of NA, "na", and "cannot_read_notes" should be considered NA. Then because all other values in those columns are numbers, $conc and $uptake will be loaded as numeric/integer.
ERROR 4
ERROR 4
-

Solution
Drop unused levels from factor
CO2 <- droplevels(CO2); str(CO2)
Fixed!
head(CO2)
Use apply() to calculate the means of the last two columns of data frame (i.e. the columns that contain data)
Note that the file will be created in your working directory.
CO2 <- read.csv("CO2_broken.csv")
head(CO2)
# looks messy...
CO2
# indeed!
The workspace refers to all the objects that you create during an R session.
Mean concentration is 435 and standard deviation is 295.92!
setwd('C:/Users/Johanna/Documents/PhD/R_Workshop2')
This gives you a pop up to navigate to appropriate directory (might not work on a mac).
You can also set your working directory by clicking on session / set working directory / choose directory
Thank you for attending, and please provide us with feedback to help us improve the workshop series!
Feedback:
http://tinyurl.com/QCBS-R-Comment

Display contents of the directory
You can display contents of the working directory using dir().
dir()
[1] "co2_broken.csv" "co2_good.csv" "script_workshop2.r"
It helps to:
Check that the file you plan to open is present in the folder that R is currently working in
Check for correct spelling (
e.g.
"CO2_good.csv" instead of "co2_good.csv")
Reminder from Workshop 1: Accessing data
A data frame called
mydata
>mydata[1,]
>mydata[,1]
>mydata[2,3]
>mydata$Variable1
Extracts the content of row 2 / column 3
Extracts the first row
Both extract the first column
Full transcript