introduction to using r
DESCRIPTION
R is the lingua franca of statistical computing. One of its attractions for users of statistics is that it encompasses an enormous range of modern statistical methods developed by world-leading statistics researchers. In order to exploit its capabilities for data analysis and statistics, a basic understanding of the core functions is required. In this session we will cover all of the preliminaries that are common to all uses of R, with particular focus on the topics of functions and data objects. Statistical methods are not formally covered, although some basic functions will be demonstrated.TRANSCRIPT
An Introduction to R
Graeme L. Hickey
31st October 2014
Graeme L. Hickey An Introduction to R 31st October 2014 1 / 125
Getting ready to use R
Getting ready to use R
Graeme L. Hickey An Introduction to R 31st October 2014 2 / 125
Getting ready to use R
Logistics
Owing to room change, there will unfortunately be less hands-onexperience (possibly none)I will email the slides to all who registered – no need to make notesWe will take a short break during
Graeme L. Hickey An Introduction to R 31st October 2014 3 / 125
Getting ready to use R
What is R?
Derives from a proprietary software packaged called S-Plus
“R is a free software programming language and softwareenvironment for statistical computing and graphics” Wikipedia(2014)
The “lingua franca of data analysts” The New York Times(2009)
Used worldwide by bioinformaticians, data scientists, high-level statisticians,app developers, . . .
Graeme L. Hickey An Introduction to R 31st October 2014 4 / 125
Getting ready to use R
Why use R?
Keep whole analysis together (data processing, analysis, publicationfigures, reports)Reproducible researchState of the art statistical methods are wrapped up in ‘R packages’It’s freeIt’s cross platform (Windows, OSX, Linux) compatibleIt will be extensively used in EPH / IGH statistical training
Graeme L. Hickey An Introduction to R 31st October 2014 5 / 125
Getting ready to use R
Objectives
The primary objective is for you to be able to apply statisticalfunctions available in R to your own data
To achieve this we should be able to:
Understand the core concepts of R and its syntaxBe able to read and write data filesBe able to interrogate a datasetBe able to use functions and optionsBe able to write a simple function
Graeme L. Hickey An Introduction to R 31st October 2014 6 / 125
Getting ready to use R
How to install R?
Download and install from: http://www.r-project.orgCan use in isolation, but combing with an IDE front-end makes lifeeasier when starting outRecommend using R Studio: http://www.rstudio.comOnce both programs installed, only ever need to run R Studio
Graeme L. Hickey An Introduction to R 31st October 2014 7 / 125
Getting ready to use R
R Studio
Graeme L. Hickey An Introduction to R 31st October 2014 8 / 125
Getting ready to use R
R Console vs. Script Editor
Would you consider writing your thesis using a typewriter?
Don’t just use the console – not reproducible!Always write analysis as an R ScriptFile -> New File -> R ScriptHighlight code and press Ctrl + Enter to execute
Graeme L. Hickey An Introduction to R 31st October 2014 9 / 125
R as a calculator
R as a calculator
Graeme L. Hickey An Introduction to R 31st October 2014 10 / 125
R as a calculator
Simple maths
1 + 2
## [1] 3
13 * 17
## [1] 221
((6 * 7) + 4 - 7) / 2^8
## [1] 0.1523438
Graeme L. Hickey An Introduction to R 31st October 2014 11 / 125
R as a calculator
Routine mathematical functions and constants
exp(3)
## [1] 20.08554
sin(2 * pi) - 1
## [1] -1
atan(1) ^ 2
## [1] 0.6168503
All of these examples use base R functions – we’ll revisit these laterGraeme L. Hickey An Introduction to R 31st October 2014 12 / 125
R as a calculator
Other ‘numbers’ to look out for
1 / 0
## [1] Inf
-1 / 0
## [1] -Inf
0 / 0
## [1] NaN
If you see these, you have probably done something wrong!Graeme L. Hickey An Introduction to R 31st October 2014 13 / 125
Data objects
Data objects
Graeme L. Hickey An Introduction to R 31st October 2014 14 / 125
Data objects
Assignment operator
We can tell R to remember things so that we can call them laterWe do this using either assignment operators = or <-
x <- 5x
## [1] 5
x = 5x
## [1] 5
Graeme L. Hickey An Introduction to R 31st October 2014 15 / 125
Data objects
A warning!
R is case SENSITIVE
This is a common error and can lead to a great deal of pain in tracing bugs!
x # Lowercase
## [1] 5
X # Uppercase
## Error in eval(expr, envir, enclos): object 'X' not found
Graeme L. Hickey An Introduction to R 31st October 2014 16 / 125
Data objects
Another warning!
Names cannot begin with numbers or include spaces
gra.eme_30 <- 5 # OK30graeme <- 5 # Not allowed!gra eme <- 5 # Not allowed either!
Graeme L. Hickey An Introduction to R 31st October 2014 17 / 125
Data objects
Task
What do you think the following lines of R code will output at the end?
x <- 2y <- piz <- (sin(y) + x)^2z
Try it!
Graeme L. Hickey An Introduction to R 31st October 2014 18 / 125
Data objects
Solution
x <- 2y <- piz <- (sin(y) + x)^2z
## [1] 4
Graeme L. Hickey An Introduction to R 31st October 2014 19 / 125
Data objects
Task
What do you think the following lines of R code will output at the end?
x <- 2x <- x + 5x
Try it!
Graeme L. Hickey An Introduction to R 31st October 2014 20 / 125
Data objects
Solution
x <- 2x <- x + 5x
## [1] 7
Graeme L. Hickey An Introduction to R 31st October 2014 21 / 125
Data objects
Vectors
We often have more than a single number, which we combine into avector using the function, e.g.
c(184, 162, 145, 200, 178, 154, 172, 142)
## [1] 184 162 145 200 178 154 172 142
heights <- c(184, 162, 145, 200, 178, 154, 172, 142)heights
## [1] 184 162 145 200 178 154 172 142
Graeme L. Hickey An Introduction to R 31st October 2014 22 / 125
Data objects
Task
What do you think the following lines of R code will output at the end?
heights/10 + 1
Try it!
Graeme L. Hickey An Introduction to R 31st October 2014 23 / 125
Data objects
Solution
heights/10 + 1
## [1] 19.4 17.2 15.5 21.0 18.8 16.4 18.2 15.2
Graeme L. Hickey An Introduction to R 31st October 2014 24 / 125
Data objects
Selection
We might want to select the 5-th value from a vectorWe use square brackets for this, e.g.
heights[5]
## [1] 178
Graeme L. Hickey An Introduction to R 31st October 2014 25 / 125
Data objects
Task
What do you think the following lines of R code will output at the end?
heights[c(1, 3, 5)]
Try it!
Graeme L. Hickey An Introduction to R 31st October 2014 26 / 125
Data objects
Solution
heights[c(1, 3, 5)]
## [1] 184 145 178
Graeme L. Hickey An Introduction to R 31st October 2014 27 / 125
Data objects
Logic
Boils down to something being TRUE or FALSE
x > y asks: is x greater than y?x < y asks: is x less than y?
x == y asks: is x equal to y?x >= y asks: is x greater than or equal to y?x <= y asks: is x less than or equal to y?
Graeme L. Hickey An Introduction to R 31st October 2014 28 / 125
Data objects
Basic examples
5 < 10
## [1] TRUE
3 > 5
## [1] FALSE
sin(pi) == cos(pi) + 1
## [1] FALSE
Graeme L. Hickey An Introduction to R 31st October 2014 29 / 125
Data objects
Logic
We can combine logical statements, for example
(5 < 10) & (3 < 5)
## [1] TRUE
(1 > 2) | (3 > 4)
## [1] FALSE
Graeme L. Hickey An Introduction to R 31st October 2014 30 / 125
Data objects
Logic & selection
We can use a logical vector to pick out elements of a vector so long as thelogical vector and the vector of data are the same length
logic.vec <- c(TRUE, FALSE, TRUE, FALSE, TRUE,FALSE, TRUE, FALSE)
heights[logic.vec]
## [1] 184 145 178 172
Task: How do you extract heights greater than 160cm?
Graeme L. Hickey An Introduction to R 31st October 2014 31 / 125
Data objects
SolutionHow to extract heights greater than 160cm?
i <- (heights > 160)i
## [1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
heights[i]
## [1] 184 162 200 178 172
And once you understand, simply. . .
heights[heights > 160]
## [1] 184 162 200 178 172Graeme L. Hickey An Introduction to R 31st October 2014 32 / 125
Data objects
Character data
Vectors don’t just store numbersThey can store items of class: integer, numeric, character, date,factor, etc.For character data, just put things inside quotation marks
subjects <- c("Bob", "Amy", "Amy", "Bob", "Amy","Bob", "Bob", "Amy", "Amy")
subjects
## [1] "Bob" "Amy" "Amy" "Bob" "Amy" "Bob" "Bob" "Amy" "Amy"
Graeme L. Hickey An Introduction to R 31st October 2014 33 / 125
Data objects
MatricesThese are generalizations of vectors: instead of being one vector, we havemultiple columns of vectors:
matrix(heights, nrow = 2)
## [,1] [,2] [,3] [,4]## [1,] 184 145 178 172## [2,] 162 200 154 142
matrix(subjects, nrow = 3)
## [,1] [,2] [,3]## [1,] "Bob" "Bob" "Bob"## [2,] "Amy" "Amy" "Amy"## [3,] "Amy" "Bob" "Amy"
Graeme L. Hickey An Introduction to R 31st October 2014 34 / 125
Data objects
MatricesWe can apply the same arithmetic as per vectors
myMat <- matrix(heights, nrow = 2)myMat
## [,1] [,2] [,3] [,4]## [1,] 184 145 178 172## [2,] 162 200 154 142
0.5*myMat + 3
## [,1] [,2] [,3] [,4]## [1,] 95 75.5 92 89## [2,] 84 103.0 80 74
Graeme L. Hickey An Introduction to R 31st October 2014 35 / 125
Data objects
Matrices
Each row and column has to have data of the same type (e.g. numeric,character, logical) — you can’t mix-and-matchMost useful when do linear algebra (e.g. PCA, solve systems ofequations)R often coerces into matrix form when required by functionsIf you want different data types, need to use objects calleddata.frames
Graeme L. Hickey An Introduction to R 31st October 2014 36 / 125
Data objects
Data frames
Think of these like Microsoft Excel spreadsheetsColumns represent different variables, e.g. age, sex, number of cells,. . .Rows represent samples, e.g. patients, testsLike matrices, they are a generalization of vectors, but can storedifferent types of data
Graeme L. Hickey An Introduction to R 31st October 2014 37 / 125
Data objects
Data frames
R has some pre-installed data frames, including the infamous Sir RonaldFisher iris dataset1, to allow us to practice
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa
. . .
1R. A. Fisher (1936). The use of multiple measurements in taxonomic problems.Annals of Eugenics 7 (2): 179–188.
Graeme L. Hickey An Introduction to R 31st October 2014 38 / 125
Data objects
Selection in data frames
Earlier, we learnt how to select individual elements from a vectorFor a data frame the same principles apply, except there are now 2dimensions: rows and columns (note the order!)
Graeme L. Hickey An Introduction to R 31st October 2014 39 / 125
Data objects
Selection in data frames
There are 3 primary methods of selecting data from data frames
1 Square brackets2 Using the dollar ($) operator (for columns only)3 Using the subset function (we won’t discuss this today)
They all do the same thing (sort of), and you can combine these methods
Graeme L. Hickey An Introduction to R 31st October 2014 40 / 125
Data objects
Selection using square brackets
One method of selection is the square brackets:
dat[i , ] would select the i-th row (which is a vector)dat[ , j] would select the j-th column (which is a vector)dat[i, j] would select the value from the i-th row and j-th column
Graeme L. Hickey An Introduction to R 31st October 2014 41 / 125
Data objects
iris[1, 1]
## [1] 5.1
iris[ , 1]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8## [14] 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0## [27] 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4## [40] 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4## [53] 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6## [66] 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7## [79] 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5## [92] 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3## [105] 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5## [118] 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2## [131] 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8## [144] 6.8 6.7 6.7 6.3 6.5 6.2 5.9
Graeme L. Hickey An Introduction to R 31st October 2014 42 / 125
Data objects
Selection using square brackets
i and j don’t have to be single numbers, they can be:
vectors of numberslogical vectors (which need to be the same length as the rows orcolumns)
iris[c(1, 3) , ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa
Graeme L. Hickey An Introduction to R 31st October 2014 43 / 125
Data objects
Selection using the dollar operator
Each column in a data frame should have a nameWe use dat$foo1 to extract the column called foo1 from a dataframe called dat, e.g.
iris$Petal.Width
## [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1## [14] 0.1 0.2 0.4 0.4 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2## [27] 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2 0.2 0.1 0.2## [40] 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4 1.5## [53] 1.5 1.3 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3## [66] 1.4 1.5 1.0 1.5 1.1 1.8 1.3 1.5 1.2 1.3 1.4 1.4 1.7## [79] 1.5 1.0 1.1 1.0 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3 1.2## [92] 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9 2.1 1.8## [105] 2.2 2.1 1.7 1.8 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8## [118] 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8 1.8 1.8 2.1 1.6## [131] 1.9 2.0 2.2 1.5 1.4 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9## [144] 2.3 2.5 2.3 1.9 2.0 2.3 1.8
Graeme L. Hickey An Introduction to R 31st October 2014 44 / 125
Data objects
Tasks
1 Select all rows of the iris data where the sepal length is >7.6cm2 Extract the sepal lengths of iris flowers sp. virginica with petal widths
>2.4cm
N.B. there are multiple ways of solving these problems
Graeme L. Hickey An Introduction to R 31st October 2014 45 / 125
Data objects
Solution (1)
iris[iris$Sepal.Length > 7.6, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width## 118 7.7 3.8 6.7 2.2## 119 7.7 2.6 6.9 2.3## 123 7.7 2.8 6.7 2.0## 132 7.9 3.8 6.4 2.0## 136 7.7 3.0 6.1 2.3## Species## 118 virginica## 119 virginica## 123 virginica## 132 virginica## 136 virginica
Graeme L. Hickey An Introduction to R 31st October 2014 46 / 125
Data objects
Solution (2)I’ll break this one into pieces to make it clearer. . .lvec1 <- (iris$Petal.Width > 2.4)lvec2 <- (iris$Species == "virginica")iris2 <- iris[lvec1 & lvec2, ]iris2
## Sepal.Length Sepal.Width Petal.Length Petal.Width## 101 6.3 3.3 6.0 2.5## 110 7.2 3.6 6.1 2.5## 145 6.7 3.3 5.7 2.5## Species## 101 virginica## 110 virginica## 145 virginica
iris2$Sepal.Length
## [1] 6.3 7.2 6.7
Graeme L. Hickey An Introduction to R 31st October 2014 47 / 125
Data objects
I could have combined all of this into a single line. . .
iris[(iris$Petal.Width > 2.4) &(iris$Species == "virginica"), ]$Sepal.Length
## [1] 6.3 7.2 6.7
Graeme L. Hickey An Introduction to R 31st October 2014 48 / 125
Data objects
Factors
An important class of data in R are factorsThey are categorical variables, e.g. gender, countryThey are similar to character data, except that R is “aware” of them,which allows us to do lots of clever things with our data
Graeme L. Hickey An Introduction to R 31st October 2014 49 / 125
Data objects
iris$Species
## [1] setosa setosa setosa setosa setosa## [6] setosa setosa setosa setosa setosa## [11] setosa setosa setosa setosa setosa## [16] setosa setosa setosa setosa setosa## [21] setosa setosa setosa setosa setosa## [26] setosa setosa setosa setosa setosa## [31] setosa setosa setosa setosa setosa## [36] setosa setosa setosa setosa setosa## [41] setosa setosa setosa setosa setosa## [46] setosa setosa setosa setosa setosa## [51] versicolor versicolor versicolor versicolor versicolor## [56] versicolor versicolor versicolor versicolor versicolor## [61] versicolor versicolor versicolor versicolor versicolor## [66] versicolor versicolor versicolor versicolor versicolor## [71] versicolor versicolor versicolor versicolor versicolor## [76] versicolor versicolor versicolor versicolor versicolor## [81] versicolor versicolor versicolor versicolor versicolor## [86] versicolor versicolor versicolor versicolor versicolor## [91] versicolor versicolor versicolor versicolor versicolor## [96] versicolor versicolor versicolor versicolor versicolor## [101] virginica virginica virginica virginica virginica## [106] virginica virginica virginica virginica virginica## [111] virginica virginica virginica virginica virginica## [116] virginica virginica virginica virginica virginica## [121] virginica virginica virginica virginica virginica## [126] virginica virginica virginica virginica virginica## [131] virginica virginica virginica virginica virginica## [136] virginica virginica virginica virginica virginica## [141] virginica virginica virginica virginica virginica## [146] virginica virginica virginica virginica virginica## Levels: setosa versicolor virginica
Graeme L. Hickey An Introduction to R 31st October 2014 50 / 125
Data objects
Matrices and data frames too limited?
What if you need something more than a flat matrix or data.frame?E.g. recording 100 measurements for 70 subjects at 25 time points?array and ?list
Graeme L. Hickey An Introduction to R 31st October 2014 51 / 125
Functions
Functions
Graeme L. Hickey An Introduction to R 31st October 2014 52 / 125
Functions
What are functions?
In short, you put something in and get something outIn order to do interesting things with out data and apply the wealth ofstatistical methods available, we need to understand about functionsfirst
Graeme L. Hickey An Introduction to R 31st October 2014 53 / 125
Functions
Recognising a function
Functions must have an assigned nameFunctions are applied using round bracketsFunctions generally take arguments (either required or optional)Arguments can be anything, depending on the function, but often oneof them will be some data
e.g. myFunc(x)
Graeme L. Hickey An Introduction to R 31st October 2014 54 / 125
Functions
Base R functions
R has lots of built in functions, which you can apply to most data objects
Graeme L. Hickey An Introduction to R 31st October 2014 55 / 125
Functions
summary
summary(iris)
## Sepal.Length Sepal.Width Petal.Length## Min. :4.300 Min. :2.000 Min. :1.000## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600## Median :5.800 Median :3.000 Median :4.350## Mean :5.843 Mean :3.057 Mean :3.758## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100## Max. :7.900 Max. :4.400 Max. :6.900## Petal.Width Species## Min. :0.100 setosa :50## 1st Qu.:0.300 versicolor:50## Median :1.300 virginica :50## Mean :1.199## 3rd Qu.:1.800## Max. :2.500
Graeme L. Hickey An Introduction to R 31st October 2014 56 / 125
Functions
head & tail
If you want to inspect a data frame, you don’t want to look at thewhole thingWe use either the head() or tail() functionsOr if using R Studio, click the Environment tab
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa
Graeme L. Hickey An Introduction to R 31st October 2014 57 / 125
Functions
ncol, nrow, dim
ncol(iris)
## [1] 5
nrow(iris)
## [1] 150
dim(iris)
## [1] 150 5
Graeme L. Hickey An Introduction to R 31st October 2014 58 / 125
Functions
length
dim() doesn’t work on vectors, so we have to use length()
length(heights)
## [1] 8
Graeme L. Hickey An Introduction to R 31st October 2014 59 / 125
Functions
names
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length"## [4] "Petal.Width" "Species"
Graeme L. Hickey An Introduction to R 31st October 2014 60 / 125
Functions
Mathematical functions
We saw some of these before, e.g.
sin(pi) # Not zero as R is using numerical approximation
## [1] 1.224647e-16
exp(pi*4)
## [1] 286751.3
Graeme L. Hickey An Introduction to R 31st October 2014 61 / 125
Functions
Statistical functions
mean(heights)
## [1] 167.125
sd(heights)
## [1] 20.09575
range(heights)
## [1] 142 200
Graeme L. Hickey An Introduction to R 31st October 2014 62 / 125
Functions
Warnings & Errors
Sometimes we get messages listed as Warning and Error
Warnings mean that the function was able to do something, but what itreturns may not be what you were expecting
Errors mean that the function aborted as something did not makesense
Don’t ignore either unless you are 100% confident why ithappened!
Graeme L. Hickey An Introduction to R 31st October 2014 63 / 125
Functions
mean(iris)
## Warning in mean.default(iris): argument is not numeric or## logical: returning NA
## [1] NA
sin(Pi)
## Error in eval(expr, envir, enclos): object 'Pi' not found
Graeme L. Hickey An Introduction to R 31st October 2014 64 / 125
Functions
Tasks
1 How many iris samples have petal widths >2cm?2 Of these, what is the mean and SD of their petal lengths?
Graeme L. Hickey An Introduction to R 31st October 2014 65 / 125
Functions
Solutions
x <- iris[iris$Petal.Width > 2, ]nrow(x)
## [1] 23
mean(x$Petal.Length)
## [1] 5.76087
sd(x$Petal.Length)
## [1] 0.4793358
Graeme L. Hickey An Introduction to R 31st October 2014 66 / 125
Functions
seq
Some functions take multiple arguments and it is often best to formallydeclare them, e.g.
seq(from = 1, to = 10, by = 2)
## [1] 1 3 5 7 9
But if confident of the order, we could just apply
seq(1, 10, 2)
## [1] 1 3 5 7 9
Graeme L. Hickey An Introduction to R 31st October 2014 67 / 125
Functions
Shorthand trick
We can replace the function seq(x, y, by = 1) with x:y, e.g.
seq(1, 10, 1)
## [1] 1 2 3 4 5 6 7 8 9 10
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
Graeme L. Hickey An Introduction to R 31st October 2014 68 / 125
Functions
Making a data frame from vectorsIf we have vectors v1, v2, v3, then we can make our own data frame usingthe data.frame function
ID <- seq(1, 8, 1)heights.m <- heights / 10 # Heights in metresdata.frame(ID, heights, heights.m)
## ID heights heights.m## 1 1 184 18.4## 2 2 162 16.2## 3 3 145 14.5## 4 4 200 20.0## 5 5 178 17.8## 6 6 154 15.4## 7 7 172 17.2## 8 8 142 14.2
Graeme L. Hickey An Introduction to R 31st October 2014 69 / 125
Functions
Coercion
We can coerce one data type into another using the as.* functions, e.g.
as.data.frame()as.matrix()as.vector()as.numeric()
Don’t worry about these for now, but handy for your own studies one day
Graeme L. Hickey An Introduction to R 31st October 2014 70 / 125
Functions
Merging 2 (or more) data frames
If you have 2 data frames, that share a common field, e.g. subject IDs,we can merge them together using merge()This is particularly useful for longitudinal datasets
Graeme L. Hickey An Introduction to R 31st October 2014 71 / 125
Functions
Let’s make another data set
Species <- c("setosa", "versicolor", "virginica")Colours <- c("red", "blue", "violet")flowerCols <- data.frame(Species, Colours)flowerCols
## Species Colours## 1 setosa red## 2 versicolor blue## 3 virginica violet
Graeme L. Hickey An Introduction to R 31st October 2014 72 / 125
Functions
Now let’s merge them
irisMerge <- merge(iris, flowerCols)head(irisMerge, 5)
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width## 1 setosa 5.1 3.5 1.4 0.2## 2 setosa 4.9 3.0 1.4 0.2## 3 setosa 4.7 3.2 1.3 0.2## 4 setosa 4.6 3.1 1.5 0.2## 5 setosa 5.0 3.6 1.4 0.2## Colours## 1 red## 2 red## 3 red## 4 red## 5 red
Graeme L. Hickey An Introduction to R 31st October 2014 73 / 125
Functions
Writing our own function
We can write our own functions when neededWe use the function() functionWe must remember to assign it to a name, otherwise we can’t use it
myFun <- function(arguments) {# do something
}
Graeme L. Hickey An Introduction to R 31st October 2014 74 / 125
Functions
E.g. f (x) = ex + x2 + 1
fx <- function(x) {exp(x) + x^2 + 1
}fx(5)
## [1] 174.4132
fx(seq(3, 12, 3))
## [1] 30.08554 440.42879 8185.08393 162899.79142
Graeme L. Hickey An Introduction to R 31st October 2014 75 / 125
Functions
Comments
Notice that anything written after a hash-symbol is ignored by RUse this to annotate your R scripts to remember what you are doing
# Graeme thinks R is great!# R will ignore all of this
Graeme L. Hickey An Introduction to R 31st October 2014 76 / 125
Functions
Help with functions
There are thousands of functions in RSome are loaded on launch of R (e.g. mean, seq, dim)Others require packages to be loaded firstIf you know the name of a function, you can use the ? operator toaccess the help file, e.g.
?sd
Graeme L. Hickey An Introduction to R 31st October 2014 77 / 125
Functions
Graeme L. Hickey An Introduction to R 31st October 2014 78 / 125
Functions
Help with functions
You can also use the search bar in the R Studio softwareIf you don’t know the name of the function, try the help.search fora list of possible candidates
help.search("sequences")
When all else fails: Google it!
Graeme L. Hickey An Introduction to R 31st October 2014 79 / 125
Conditional statements and loops
Conditional statements and loops
Graeme L. Hickey An Introduction to R 31st October 2014 80 / 125
Conditional statements and loops
Introduction
Inherent to all programming languages are conditional statementsand loopsRequire them when we need to make complex rulesWould require a more advanced tutorial to fully appreciate the powerIf interested to learn more, see references at end
Graeme L. Hickey An Introduction to R 31st October 2014 81 / 125
Conditional statements and loops
Conditional statements
The if statement, which is technically a function, only does something ifTRUE, e.g.
if(3 < 4) {3 + 3}
## [1] 6
if(3 > 4) {2 + 2}
Also, look up while and else using help.search()
Graeme L. Hickey An Introduction to R 31st October 2014 82 / 125
Conditional statements and loops
Loops
We might want to sequentially do something, conditional on somethingelseFor example, let Yi = Yi−1 + i/10 with Y1 = 0Calculate Y =
∑20i=1 Yi
E.g. 0 + (0 + 2/10) + (2/10 + 3/10) + . . . + (2/10 + 3/10 + . . .+ 20/10)
Graeme L. Hickey An Introduction to R 31st October 2014 83 / 125
Conditional statements and loops
Y <- 0 # Start with Y_1for(i in 2:20) {
Yi <- Y + i/10 # Calculate Y_iY <- Y + Yi # Cummulative sum
}Y # Solution
## [1] 157284.2
Graeme L. Hickey An Introduction to R 31st October 2014 84 / 125
Conditional statements and loops
Task
For each value in our heights vector earlier, how can we calculate thedifference between it and the previous one, i.e. calculate heightsi -heightsi−1?
Graeme L. Hickey An Introduction to R 31st October 2014 85 / 125
Conditional statements and loops
Solution
d <- 0for(i in 2:length(heights)) {
d[i] <- heights[i] - heights[i-1]}d
## [1] 0 -22 -17 55 -22 -24 18 -30
Graeme L. Hickey An Introduction to R 31st October 2014 86 / 125
Reading and writing files
Reading and writing files
Graeme L. Hickey An Introduction to R 31st October 2014 87 / 125
Reading and writing files
Reading
You want to get your data into RData comes in lots of different formats, luckily R can handle almost allof them!Most use packages – we’ll explore these later
Graeme L. Hickey An Introduction to R 31st October 2014 88 / 125
Reading and writing files
read.csv
The simplest way is to convert your data to a comma separated value(*.csv) files and use
my.data <- read.csv(file.choose())
Instead of writing file.choose() we could have specified the filelocationLook at the help file for more customization settings
Graeme L. Hickey An Introduction to R 31st October 2014 89 / 125
Reading and writing files
read.xlsx
Converting our data to CSV format is a pain!We can use the xlsx package instead
library("xlsx")my.data <- read.xlsx(file.choose(), sheetIndex = 1)
Graeme L. Hickey An Introduction to R 31st October 2014 90 / 125
Reading and writing files
foreign
What if our data is in a Stata, SPSS, SAS, etc. file?We can use the foreign package instead, e.g.
library("foreign")my.data <- read.spss(file.choose())
Graeme L. Hickey An Introduction to R 31st October 2014 91 / 125
Reading and writing files
Data on the web
What if our data is in the cloud?We can use the utils package function download.file instead, e.g.
library("utils")my.data <- download.file("http://www.liv.ac.uk/dat.csv")
Useful if you share your data on a public Dropbox folder
Graeme L. Hickey An Introduction to R 31st October 2014 92 / 125
Reading and writing files
Other formats
If data exists, R can read it inJust need to find the right package
Graeme L. Hickey An Introduction to R 31st October 2014 93 / 125
Reading and writing files
Writing
Usually as simple as change the read. to write.
Need to specify:
1 What file we want to save2 A name for the file we will save
write.csv(iris, "IrisData.csv")
Graeme L. Hickey An Introduction to R 31st October 2014 94 / 125
Graphics
Graphics
Graeme L. Hickey An Introduction to R 31st October 2014 95 / 125
Graphics
Introduction
R has 3 primary graphics packages2:
1 Base R - those built into R2 lattice - a functional extensional of the base graphics (requires a
package)3 ggplot2 - built on the grammar of graphics (requires a package)
All called using functions, and typically have lots of optional arguments forcustomization of figures
2I will discuss packages shortly.Graeme L. Hickey An Introduction to R 31st October 2014 96 / 125
Graphics
Plot
The plot function can be applied to most data objects
Alternatively, one can give it two arguments:
x - x-axis coordinatesy - y-axis coordinates
Can also specify arguments to: label the axes; colour the points; etc. See:?help
Graeme L. Hickey An Introduction to R 31st October 2014 97 / 125
Graphics
plot(iris)
Sepal.Length
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
4.5
5.5
6.5
7.5
2.0
3.0
4.0
Sepal.Width
Petal.Length
12
34
56
7
0.5
1.5
2.5
Petal.Width
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 1.5 2.0 2.5 3.0
1.0
2.0
3.0
Species
Graeme L. Hickey An Introduction to R 31st October 2014 98 / 125
Graphics
Task
How can I plot the sepal length against the petal length of the iris data, andcolour the points by species?
Hint: ?as.numeric + ?plot
Graeme L. Hickey An Introduction to R 31st October 2014 99 / 125
Graphics
Solution
plot(x = iris$Sepal.Length, y = iris$Petal.Length,col = as.numeric(iris$Species))
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
12
34
56
7
iris$Sepal.Length
iris$
Pet
al.L
engt
h
Graeme L. Hickey An Introduction to R 31st October 2014 100 / 125
Graphics
Histograms
hist(iris$Petal.Length,col = "grey", xlab = "Petal length (cm)")
Histogram of iris$Petal.Length
Petal length (cm)
Fre
quen
cy
1 2 3 4 5 6 7
010
2030
Graeme L. Hickey An Introduction to R 31st October 2014 101 / 125
Graphics
Boxplots
boxplot(Petal.Length ~ Species, data = iris)
setosa versicolor virginica
12
34
56
7
Graeme L. Hickey An Introduction to R 31st October 2014 102 / 125
Graphics
ggplot2
Flexible publication quality graphics
library("ggplot2")
Graeme L. Hickey An Introduction to R 31st October 2014 103 / 125
Graphics
ggplot(aes(x = Petal.Length, y = Petal.Width, colour = Sepal.Length),data = iris) +
geom_point() + geom_smooth() +facet_wrap(~ Species, scales = "free")
setosa versicolor virginica
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
1.5
2.0
2.5
1.00 1.25 1.50 1.75 3.0 3.5 4.0 4.5 5.0 4.5 5.0 5.5 6.0 6.5 7.0Petal.Length
Pet
al.W
idth
5
6
7
Sepal.Length
Graeme L. Hickey An Introduction to R 31st October 2014 104 / 125
Graphics
Saving figures
In R Studio, click Export then select file typeIn R, right click + Save
Graeme L. Hickey An Introduction to R 31st October 2014 105 / 125
Packages
Packages
Graeme L. Hickey An Introduction to R 31st October 2014 106 / 125
Packages
Introduction
Packages are like appsMany state of the art statistical methods published in journals with RpackagesLarge number of books released with specific R packages also,e.g. time series, prognostic modelling, survival regression, . . .Packages are published:
On CRANGit HubBio-conductorCode files on personal websites (not really packages per se)
Graeme L. Hickey An Introduction to R 31st October 2014 107 / 125
Packages
Where to find relevant packages
There are thousands of R packages!Books, journal articles (look what is reported in the Statistical Analysissection), word of mouth, your local friendly statisticianwww.rseek.org
Graeme L. Hickey An Introduction to R 31st October 2014 108 / 125
Packages
What packages are already installed?
R comes with a number of packages pre-installedWe can see all them, plus any we have installed ourselves
library() # No arguments!
or click the Packages tab if working from R Studio
Graeme L. Hickey An Introduction to R 31st October 2014 109 / 125
Packages
Installing a package from CRAN
Once you have identified the package you want, run:
install.packages("ggplot2")
Graeme L. Hickey An Introduction to R 31st October 2014 110 / 125
Packages
Loading a package
R does not automatically load all installed packages as it slows yourcomputer down, and can also cause clashes
When you know what package(s) you want to use, run:
library("ggplot2") # Load ggplot2 package
Graeme L. Hickey An Introduction to R 31st October 2014 111 / 125
Statistics
Statistics
Graeme L. Hickey An Introduction to R 31st October 2014 112 / 125
Statistics
Introduction
In R, statistical method is just another term for functionBeware of applying statistical methods to your data withoutunderstanding what they do first: G-I-G-O!
Graeme L. Hickey An Introduction to R 31st October 2014 113 / 125
Statistics
Student’s t-test
t.test(x = heights, mu = 180)
#### One Sample t-test#### data: heights## t = -1.8121, df = 7, p-value = 0.1129## alternative hypothesis: true mean is not equal to 180## 95 percent confidence interval:## 150.3245 183.9255## sample estimates:## mean of x## 167.125
Graeme L. Hickey An Introduction to R 31st October 2014 114 / 125
Statistics
Notice that we applied a function to two arguments:
1 Some data2 A number
And it gave us back out some statistics and a P-value
Graeme L. Hickey An Introduction to R 31st October 2014 115 / 125
Statistics
Linear regression
The first argument of many statistical functions is a formula
A formula is written as:
The outcome variable on the RHS (one of the variables in your dataset)A tilde symbol (~) which means model ontoThe explanatory variables on the RHS (separated by a + if more thanone)
Graeme L. Hickey An Introduction to R 31st October 2014 116 / 125
Statistics
fit <- lm(Petal.Width ~ Sepal.Length, data = iris)fit
#### Call:## lm(formula = Petal.Width ~ Sepal.Length, data = iris)#### Coefficients:## (Intercept) Sepal.Length## -3.2002 0.7529
Graeme L. Hickey An Introduction to R 31st October 2014 117 / 125
Statistics
Recall we can apply the summary function to different things?We have just assigned our linear model to the name fit
summary(fit)
#### Call:## lm(formula = Petal.Width ~ Sepal.Length, data = iris)#### Residuals:## Min 1Q Median 3Q Max## -0.96671 -0.35936 -0.01787 0.28388 1.23329#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -3.20022 0.25689 -12.46 <2e-16 ***## Sepal.Length 0.75292 0.04353 17.30 <2e-16 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.44 on 148 degrees of freedom## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16
Graeme L. Hickey An Introduction to R 31st October 2014 118 / 125
Statistics
We can apply other functions to fit also, e.g.
plot(fit)
0.0 0.5 1.0 1.5 2.0 2.5
−1.
00.
00.
51.
01.
5
Fitted values
Res
idua
ls
Residuals vs Fitted
115107
122
−2 −1 0 1 2
−2
−1
01
23
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
115107
122
0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location115107
122
0 50 100 150
0.00
0.02
0.04
0.06
0.08
Obs. number
Coo
k's
dist
ance
Cook's distance132
107
123
Graeme L. Hickey An Introduction to R 31st October 2014 119 / 125
Statistics
Probability
Most routine statistical distributions are built-in, e.g. Gaussian (norm),Binomial (binom), Poisson (pois), . . .
We use a prefix:
d to get the densityp to get the cumulative probability distributionq to get the quantile functionr to sample random values from the distribution
Graeme L. Hickey An Introduction to R 31st October 2014 120 / 125
Statistics
E.g.
# Sample 5 values from a N(2, 9) distributionrnorm(5, mean = 2, sd = 3)
## [1] -2.370804 3.118035 -2.090598 2.051935 3.879390
# Probability of tossing 10 heads in a rowpbinom(0, size = 10, prob = 0.5)
## [1] 0.0009765625
Graeme L. Hickey An Introduction to R 31st October 2014 121 / 125
Wrapping up
Wrapping up
Graeme L. Hickey An Introduction to R 31st October 2014 122 / 125
Wrapping up
Saving your workspace
You can save everything you have done using File -> Save (R Studiowill prompt you when exiting)I never save my workspaceWhy? Because I save the R Script (copy & paste)
Graeme L. Hickey An Introduction to R 31st October 2014 123 / 125
Wrapping up
Where to learn more
Venables WN, Smith DM (2014). An Introduction to R:http://cran.r-project.org/doc/manuals/R-intro.pdfShahbaba B (2012). Biostatistics with R. Springer, NY.Data Camp online course: https://www.datacamp.com/courses/Ask me, Peter, Elisabeth or Helen!
Graeme L. Hickey An Introduction to R 31st October 2014 124 / 125
Wrapping up
Further IGH Statistical Seminars
Other sessions in this series will make use of R
Introduction to Time Series (5th Dec 2014)Regression Modelling (Jan 2015)Time-to-Event Analysis (Feb 2015)Geostatistical Methods for Disease Prevalence Mapping (March 2015)Statistical Power + Sample Size Calculations (April 2015)Quantile Regression (May 2015)
Graeme L. Hickey An Introduction to R 31st October 2014 125 / 125