basic data analysis using r
TRANSCRIPT
Basic Data Analysis using R
C. Tobin Magle, PhD02-08-2017
10:00-11:00 a.m.Morgan Library
Computer Classroom 175Based on http://www.datacarpentry.org/R-ecology-lesson/
Outline
• Intro to R and R studio
• Operators and functions
• Data Frames
• Factors
What is R? R Studio?
• R – a programming language + software that interprets it
• RStudio – popular software to write R scripts and interact with the R software
• Need both
Why learn R
• Research Reproducibility
• Widely used, 10000+ “packages”
• Works on many data types
• Produced high-quality graphics
• Free, open source, cross platform
R Studio Interface
Setup a working directory
• Start RStudio • File > New project > New directory > Empty project• Enter a name for this new folder and choose a convenient
location for it (working directory)• Click on “Create project”• Create a data folder in your working directory• Create a new R script (File > New File > R script) and save it in
your working directory
Organize your working directory
Script vs console
• Both accept commands
• Console: runs the commands• Doesn’t save*
• Script: commands you want to save for later; • These commands need to be sent to the console to be run• Ctrl-enter to send from script to console
The assignment operator
• The command 5+5 yields the answer 10• Prints 10 to the console• But does not save the number 10 anywhere
• The assignment operator saves values into objects• <object> <- <value>• Weight_kg <- 55
• Short key for the assignment operator: alt- dash
You can do math on variables
• Example: conversion from kg to lb
• 2.2* weight_kg
• weight_lb <- 2.2*weight_kg
Functions and arguments
• Functions are canned scripts• Predefined, packages, “home-made”
• Accepts arguments (input)
• Return a value (output)
• Examples: sqrt, round• args(round)
Working with data
• Store tables in a type of object called a “data frame”• Rows = observations• Cols = variables
• Can download using download.file• download.file("https://ndownloader.figshare.com/files/2292169",
"data/portal_data_joined.csv")
• Read data using read.csv function• surveys <- read.csv('data/portal_data_joined.csv')
Inspecting data frames
• head(surveys) = look at first 6 rows (all columns)• str(surveys) = structure # rows, cols, data types• nrow(surveys) = number of columns• ncol(surveys) = number of columns• names(surveys) = column names• summary(surveys) = does summary stats for each column
Subsetting (Using brackets)
• Row column format: surveys[row,column] • surveys[1,2] = first row, second column
• Leave it blank = surveys[,column]• surveys[1,] = first row, all column• surveys[,1] = first column, all rows
• Ranges = surveys[range, column]• surveys[1:3, 7] = rows 1-3, 7th column
By column name
• surveys["species_id"] # Result is a data.frame • surveys[, "species_id"] # Result is a vector• surveys[["species_id"]] # Result is a vector • surveys$species_id # Result is a vector
Factors• Represent categorical data• Can be ordered or unordered• Critical for stats and plotting• Stored as integers with text labels – be careful!
• Orders labels by alpha order of text labels• Functions
• sex <- factor(c("male", "female", "female", "male"))• levels(sex) • nlevels(sex)• sex <- factor(sex, levels = c("male", "female"))
Converting factors
• With characters• as.character(sex)
• With numbers• f <- factor(c(1990, 1983, 1977, 1998, 1990)) • as.numeric(f) # wrong! and there is no warning...
as.numeric(as.character(f)) # works... • as.numeric(levels(f))[f] # The recommended way.
Example: plotting factors
• plot(survey$sex)
• what’s with the unlabeled bar?
Renaming levels
• Label missing values• sex <- surveys$sex # subset the column• head(sex) # look at first 6 records• levels(sex) # look at the factor levels• levels(sex)[1] <- "missing" # change the first label to “missing”• levels(sex) # look at factor levels again• head(sex) # see where missing values were
What if you don’t want to use levels?
• Argument: stringsAsFactors=FALSE
## Compare the difference between when the data are being read as ## `factor`, and when they are being read as `character`.
surveys <- read.csv("data/portal_data_joined.csv", stringsAsFactors = TRUE) str(surveys)
surveys <- read.csv("data/portal_data_joined.csv", stringsAsFactors = FALSE) str(surveys)
## Convert the column "plot_type" into a factor surveys$plot_type <- factor(surveys$plot_type)
Need help?
• Email: [email protected]
• Data Management Services website: http://lib.colostate.edu/services/data-management
• Data Carpentry: http://www.datacarpentry.org/• R Ecology Lesson: http://www.datacarpentry.org/OpenRefine-ecology-lesson/