lecture 3: basics of r programming - boston universitypeople.bu.edu/aimcinto/720/lec3_2015.pdf ·...

1

Lecture 3: Basics of R Programming This lecture introduces you to how to do more things with R beyond simple commands. Outline: 1. R as a programming language 2. Grouping, loops and conditional execution 3. Creating your own functions Objectives By the end of this session students will be able to: 1. Perform simple data manipulations: vectors (numeric, logical, character), missing values, index

vectors, many-to-one, one-to-many merging 2. Use grouped expression and if-else statements: 3. Know how to write your own functions

Trivia: In the U.S, what is the record snowfall in a 24-hour period? http://docs.lib.noaa.gov/rescue/mwr/081/mwr-081-02-0038.pdf A couple odds and ends that may be of use to some of you:

a) Reading in huge datasets: http://simplystatistics.org/2011/10/07/r-workshop-reading-in-large-data-frames/

b) Web Scraping: http://thebiobucket.blogspot.com/2011/10/little-webscraping-exercise.html c) Sorting: use command sort(), e.g.

> sort( c(1,55,-2,11) ) [1] -2 1 11 55

d) Generating a random permutation of a set of data: > sample(c("First","Second","Third","Fourth"), replace=F) [1] "Fourth" "First" "Third" "Second"

3.1 A Quick Review of Matrices and Data frames We discussed vectors, matrices and data structures in Lectures 1 and 2. Let us recall how to create a matrix of data from some given measurements, say heights and weights of 15 students. Suppose we have the data below: height 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 weight 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164 We can read this into R using the following commands:

2

> height = c(58,59,60,61,62,63,64,65,66,67,68,69,70,71,72) > weight = c(115,117,120,123,126,129,132,135,139,142,146,150,154,159,164) The above gives us two vectors with the height and weight. But it may be useful to have this as a matrix, so that each person’s height and weight appear together. To do this, we can use the command: > htwtmatrix = matrix(c(height,weight),15,2) # what do 15 and 2 refer to? > htwtmatrix [,1] [,2] [1,] 58 115 [2,] 59 117 [3,] 60 120 [4,] 61 123 [5,] 62 126 [6,] 63 129 [7,] 64 132 [8,] 65 135 [9,] 66 139 [10,] 67 142 [11,] 68 146 [12,] 69 150 [13,] 70 154 [14,] 71 159 [15,] 72 164 What do you notice about how R creates a matrix from a vector? It constructs matrices column-wise by default, so if you want to create a matrix row-by-row, you need to give it an additional argument “byrow=T”. Exercise 1. How would you create a matrix that has height and weight as the two rows instead of columns? Look up the help on the “matrix” function if necessary. Now we have each person’s height and weight together. However, for future reference, instead of storing the data as a matrix, it might be helpful to have the column names together with the data. Recall from lecture 1 that in order to assign column names, we first have to convert htwtmatrix to a data frame. A data frame has a unique set of row and column names. To convert htwtmatrix to a data frame, we use the command:

3

> htwtdata = data.frame(htwtmatrix) > htwtdata X1 X2 1 58 115 2 59 117 3 60 120 4 61 123 5 62 126 6 63 129 7 64 132 8 65 135 9 66 139 10 67 142 11 68 146 12 69 150 13 70 154 14 71 159 15 72 164 (as.data.frame() works as well.) Notice that now the columns are named “X1” and “X2”. We can now assign names to the columns by means of the “names()” command: > names(htwtdata) = c(“height”,”weight”) We can find the column names of a data frame, without opening up the whole data set, by typing in > names(htwtdata) [1] "height" "weight" Quick aside: Let’s say we have a very large dataset and we don’t want to search through the column names to find out the column number we’re interested in. We can do this: > which(names(dataset.of.interest)=="column name") For example, with our two column data frame htwtdata, to find the column number of weight, we can type > which(names(htwtdata)=="weight") [1] 2 > This is telling us that weight is the second column in the data frame.

4

Let us recall how R operates on matrices, and how that compares to data frames. Recall that R evaluates functions over entire vectors (and matrices), avoiding the need to loops. For example, what do the following commands do? > htwtmatrix*2 > htwtmatrix[,1]/12 # convert height in inches to feet > mean(htwtmatrix[,2]) To get the dimensions or number of rows or columns of a data frame, it is often useful to use one of the following commands: > dim(htwtdata) > nrow(htwtdata) > ncol(htwtdata) Exercise 2. What does the following R command do? > htwtdata[,2]*703/htwtdata[,1]^2 Exercise 3. How would you get R to give you the height and weight of the 8th student in the data set? The 8th and 10th student?

5

3.2 Programming: loops, if-then/for/while statements So far we have mainly used R for performing one-line commands on vectors or matrices of data. One of the most powerful features of R is in being able to do programming, without a lot of the low-level bookkeeping issues that one needs to keep track of in other computer programming languages like C, Java, Perl, etc. Today we will look at some simple yet powerful programming tools in R, such as loops, if-then and while statements. R is an expression language in the sense that its only command type is a function or expression which returns a result. Even an assignment is an expression whose result is the value assigned, and it may be used wherever any expression may be used; in particular multiple assignments are possible. Commands may be grouped together in braces, {expr_1; ...; expr_m}, in which case the value of the group is the result of the last expression in the group evaluated. That is a bit abstract, so let’s get our hands dirty.

If/else statements

In R, one can write a conditional statement as follows: ifelse(condition on data, true value returned, false returned) The above expression reads if condition on the data is true then do the “true value” assigned otherwise execute the “false value.” Recall logical operations in class 1. > ifelse(3 > 4, x <- 5, x <- 6) > x [1] 6

The operators && and || are often used to denote multiple conditions in an if statement. Whereas & (and) and | (or) apply element-wise to vectors, && and || apply to vectors of length one, and only evaluate their second argument in the sequence if necessary. Thus it is important to remember which logical operator to use in which situation.

> hmean = mean(htwtdata$height) > wmean = mean(htwtdata$weight) > ifelse( hmean > 61 && wmean > 120, x <- 5, x <- 6) > x [1] 5 > htwt_cat<-ifelse (height>67 | weight>150, “high”, “low”)

6

> htwt_cat [1] "low" "low" "low" "low" "low" "low" "low" "low" "low" "low" "high" [12] "high" "high" "high" "high" > htwt_cat<-ifelse (height>67 || weight>150, “high”, “low”) > htwt_cat [1] "low" (Notice that in the second ifelse statement only the first element in the series was computed.) This can also be extended to include multiple conditions. Suppose we have the following data: final_score<- c(39, 51, 60, 65, 72, 78, 79, 83, 85, 85, 87, 89, 91, 95, 96, 97, 100, 100) passfail<-ifelse(final_score>=60, "pass", "fail") Suppose we want to create a variable called grades that is assigned as follows: “F” if final_score <60 “D” if 60≤final_score<70 “C” if 70≤final_score<80 “B” if 80≤final_score<90 “A” if 90≤final_score We can use a “nested” ifelse command as follows: grade <- ifelse(final_score<60,"F",

ifelse (final_score<70,"D",

ifelse(final_score<80,"C",

ifelse (final_score<90,"B", "A"))))

If you have missing values in your vector (NA), it’s not a problem. However (!), if you have some odd coding for missing values (-99 is common), what happens to grade for that value?

7

This nested logical statements method is really useful for putting different colors in graphs for different conditions. Recall the Beijing Air quality plot. One plot I did read as follows:

8

The code for the color section of the graph in the plot() command reads:

col = ifelse(pm25<=50,"green",

ifelse(pm25<101,"yellow",

ifelse(pm25<150,"orange",

ifelse(pm25<201,"red",

ifelse(pm25<301,"purple",

"firebrick")

)

)

)

)

Repetitive execution: for loops, repeat and while

All of these examples are analogous to MACROS in SAS. We want to do something many times, without doing it by hand each time; we are automating the process. These methods can be greatly expanded to do dynamic, efficient, extremely complicated operations. The idea of this section is really just to get you familiar with how these programs do complicated operations.

There is a for loop construction in R which has form

> for (name in expr_1) execute expr_2 That means, for some subsection (name) of some set (expr_1), perform operation expr_2.

Here is the simplest loop there is:

> for(i in 1:12){print(i)}

9

Suppose, based on the ozone measurements in the airquality data set, we want to figure out which days were good air quality days (1) or bad air quality (0), based on a cutoff of ozone levels above 60ppb. Let us create a new vector called “goodair”, which stores the information on good and bad air-quality days. We can do this using a for loop.

> numdays = nrow(airquality) > numdays [1] 153 > goodair = numeric(numdays) # creates an object which will store the vector > for(i in 1:numdays) if (airquality$Ozone[i] > 60) goodair[i] = 0 else goodair[i] = 1 ## (Notice that we have an if statement here within a for loop.) Does the command above work? Why/why not? Let us check the Ozone variable. What do you notice below? > airquality$Ozone [1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6 [19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA [37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA [55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA [73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50 [91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22 [109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73 [127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13 [145] 23 36 7 14 30 NA 14 18 20 When there are missing values, many operations in R fail. One way to get around this is to create a new data frame that deletes all the rows corresponding to observations with missing rows. This can be done by means of the command “na.omit” > airqualfull = na.omit(airquality) > dim(airqualfull) [1] 111 6 > dim(airquality) [1] 153 6 # How many cases were deleted because of missing data? Sometimes deleting all cases with missing values is useful, and sometimes it is a horrible idea…

10

We could get around this without deleting missing cases with an ifelse statement within the for loop. Now let us try doing this again with the data with the complete cases. > numdays = nrow(airqualfull) > numdays [1] 111 > goodair = numeric(numdays) # initialize the vector > for(i in 1:numdays) if (airqualfull$Ozone[i] >60) goodair[i] = 0 else goodair[i] = 1 > goodair [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 [38] 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 [75] 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 At this point we might be interested in which days were the ones with good air quality. The “which” command returns a set of indices corresponding to the condition specified. We can then use the indices to find the day of the month this corresponds to. > which(goodair == 1) ## notice the double “=” signs! > goodindices <- which(goodair == 1) > airqualfull[goodindices,] If we had wanted to keep the entire dataset airquality intact and not remove the observations with missing values, we could put a condition in the loop that deals with NA values. Exercise 4. Suppose we want to define a day with good quality air (1) as one with ozone levels below 60ppb, and temperatures less than 80 degrees F. Write an R loop to do this, and output the resulting subset (only values that meet these criteria) of the data to a file called goodquality.txt. (Hint: use an ifelse() statement inside the for loop.)

11

Other looping options: WHILE & REPEAT Similar to a loop function, the while statement can be used to perform an operation while a given condition is true. For example: z <-0 while (z<5){ z<-z+2 print(z) } [1] 2 [1] 4 [1] 6 In the above while statement we initiate z to have a value of 0. We then state that as long as z is less than 5 we will continue to perform the following loop operation z<-z+2. Thus we have z <- 0+2 ##Initially z is 0, but after the first iteration of the loop the value of z is 2 z <- 2+2 ## After the second iteration the value of z is 4 z <- 4+2 ## After the third iteration the value of z is 6 The while statement stops here because now z is now bigger than 5. Another option for looping is the repeat function. An example follows: > i<-1 > repeat{ + print(i) + if( i == 15) break + i<-i+1 + } [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] 10 [1] 11 [1] 12 [1] 13 [1] 14 [1] 15 (I won’t test you on any of this: I just want to give you a flavor for the options available.)

12

Example of a useful LOOP (and a not-so-useful one): I’m not sure if I can share this dataset with you or not, so I’m just going to show you this code. > expression.data <- read.table("/Users/Avery/Documents/classes/Graduate/830 Microarray Data/data/data.cn.adh.dcis.clean.raw.txt", header=T, sep="\t")

> dim(expression.data) #big data [1] 22283 110 There are 36 participants who have had genetic measurements (a microarray luminosity score) on 22k RNA expression levels. Note that rows and columns here are reversed from what we’re used to: subjects are on the columns, not rows. The data of interest are at every third column, so I subset the data as follows: > col.signal <- seq(2,107,by=3) > data.signal <- data.frame(expression.data[, col.signal]) > dim(data.signal) [1] 22283 36 > > data.signal[10:15, 1:5]

Person 1 Person 2 Person 3 Person 4 Person 5 … Gene1 188.9910 139.5375 120.3423 166.5347 295.7254… Gene2 322.2060 273.0769 232.9695 259.2409 460.1872… Gene3 222.3313 163.3700 128.9506 177.7558 297.9407… Gene4 478.7550 418.3608 366.6336 448.8449 780.4992… Gene5 415.2705 329.6245 341.5033 338.3168 557.4103… … This is a made up example, but say we want to see if any of our participants were in fact just a replicate of Person 1. If any column was a replicate of column (person) 1, the expression levels for each gene would be very, very close (about a 45 degree line on a plot). So we want to make a plot of every person’s RNA expression level compared to person 1. (Again: this is a made up example, but serves the purposes of this lecture). We can run the following code to automate this: setwd("/Users/Avery/Desktop/plots") for( i in 2:ncol(data.signal)){ png( paste("myplot_", i, ".png", sep="") ) plot(data.signal[,1],data.signal[,i], xlab="Person 1", ylab=paste("Person", i, sep=""),col="blue") dev.off() }

13

Don’t pay attention to the specifics of the data. This is not a genetics class. The aim is just to have you see an automated script for generating multiple plots. (non-useful LOOP): Simulating # of ties in card game War (script not included in lecture) Conditional Indexing Data frames and matrices allow for conditional indexing in R, which is often very useful. Instead of creating the goodair vector using a loop, we could directly extract the good air quality days using conditional indexing, using the single command below: > airqualfull[airqualfull$Ozone < 60,] It is worthwhile to keep in mind that many things in R can be done avoiding loops, and using conditional indexing can save a lot of time and effort! However, there are other times when using loops may be the best (or only) way to do things. Conditional indexing is also useful for extracting groups of data from data frames, or splitting data into groups according to some criterion. For example, to get sets of ozone measurements for days with temperatures higher and lower than 80 degrees F, we can use: > split(airqualfull$Ozone, airqualfull$Temp < 80) $`FALSE` [1] 45 29 71 39 23 135 49 32 64 40 77 97 97 85 27 7 48 35 [19] 61 79 63 80 108 20 52 82 50 64 59 39 9 16 122 89 110 44 [37] 28 65 168 73 76 118 84 85 96 78 73 91 47 32 20 44 16 36 $`TRUE` [1] 41 36 12 18 23 19 8 16 11 14 18 14 34 6 30 11 1 11 [19] 4 32 23 115 37 21 37 20 12 13 10 16 22 59 23 31 44 21 [37] 9 45 23 21 24 21 28 9 13 46 18 13 24 13 23 7 14 30 [55] 14 18 20

Exercise 5. Using conditional indexing, write an R command to replicate the results of exercise 4. (Hint: use single &, not double &&.)

14

3.3 Merging and Sorting Dataframes This topic has a lot to it, so I only cover the basics. There are plenty of online resources to do this on your own. There won’t be any homework questions on this section. Say you have two datasets and you want to merge them based on an ID number. A simple example of merging these by variable ID follows: > dataset.1<-matrix(c(1,13,12,1, 2,12,10,2, 3,13,9,1, 4,9,8,2, 5,3,7,3, 6,5,6,1, 7,6,5,2, 8,5,5,3), ncol=4, byrow=T) > > dataset.1<-data.frame(dataset.1) > names(dataset.1)<-c("ID","read 1","read 2","read 3") > dataset.1 ID read 1 read 2 read 3 1 1 13 12 1 2 2 12 10 2 3 3 13 9 1 4 4 9 8 2 5 5 3 7 3 6 6 5 6 1 7 7 6 5 2 8 8 5 5 3 > dataset.2<-matrix(c(1,12, 2,13, 3,3, 4,15, 5,31, 6,15, 7,4, 8,6, 9,22), ncol=2, byrow=T) > > dataset.2<-data.frame(dataset.2) > names(dataset.2)<-c("ID","read 4") > dataset.2 ID read 4 1 1 12 2 2 13 3 3 3 4 4 15 5 5 31 6 6 15 7 7 4 8 8 6 9 9 22 > dataset.merged <- merge(dataset.1,dataset.2,by="ID", all.y=TRUE) > dataset.merged ID read 1 read 2 read 3 read 4

15

1 1 13 12 1 12 2 2 12 10 2 13 3 3 13 9 1 3 4 4 9 8 2 15 5 5 3 7 3 31 6 6 5 6 1 15 7 7 6 5 2 4 8 8 5 5 3 6 9 9 NA NA NA 22 Now, a nice simple example of many-to-one merging: > library(reshape) > my.test.2<-matrix(c(1,13,12,1, 1,12,10,2, 2,13,9,1, 2,9,8,2, 2,3,7,3, 3,5,6,1, 3,6,5,2, 3,5,5,3), ncol=4, byrow=T)

#create sample data > my.test.2<-as.data.frame(my.test.2) #convert to a dataframe, a more robust format #add column names > names(my.test.2)<-c("ID","read A","read B","visit") #print it to get a look. This is in "long" format > my.test.2 ID read A read B visit 1 1 13 12 1 2 1 12 10 2 3 2 13 9 1 4 2 9 8 2 5 2 3 7 3 6 3 5 6 1 7 3 6 5 2 8 3 5 5 3 #above is a matrix of repeated measures on same individual, different visits. The function below coerces the data frame into a 'wide' format, with one row per individual, renaming variables to account for missing values.

> wide_mytest <- reshape(my.test.2, direction="wide",idvar="ID",timevar="visit") > > wide_mytest ID read A.1 read B.1 read A.2 read B.2 read A.3 read B.3 1 13 12 12 10 NA NA 2 13 9 9 8 3 7 3 5 6 6 5 5 5

16

Finally, you have a matrix you want to sort, say by from smallest to largest of a particular column, but you want to keep each row intact. Do the following: > setwd("/Users/Avery/Desktop/classes/720 spring 2015/classes/wk3/sorting matrices") > ZZ<-read.table("ZZ") p result x1 x2 1 0.8417549 1 2 0.4213440 2 0.9136235 1 3 -0.6412975 3 0.8361460 0 2 0.3798271 4 0.9850423 1 4 -0.5625415 5 0.3114491 1 0 1.4566465 #note each row here starts with an assigned number; this is NOT an actual column of (1,2,3,…), it’s just a row designation. #now sort it via column named "p" > ZZ[ order(ZZ[,1]) , ] #or: > attach(ZZ) > ZZ[order(p),] p result x1 x2 5 0.3114491 1 0 1.4566465 3 0.8361460 0 2 0.3798271 1 0.8417549 1 2 0.4213440 2 0.9136235 1 3 -0.6412975 4 0.9850423 1 4 -0.5625415 Now the matrix is sorted by column “p” while keeping the structure of the dataframe. There won’t be any homework questions on this section! I just wanted to show you all some of these techniques as a reference for later on.

17

3.4 Writing simple functions in R: Why and How

The R language allows the user to create objects of mode function. These are true R functions that are stored in a special internal form and may be used in further expressions and so on. In the process, the language gains enormously in power, convenience and elegance, and learning to write useful functions is one of the main ways to make your use of R comfortable and productive.

It should be emphasized that most of the functions supplied as part of the R system, such as mean(), var(), dim() and so on, are themselves written in R and thus do not differ materially from user written functions.

A function is defined by an assignment of the form

> name <- function(arg_1, arg_2, ...) expression

The expression is an R expression, (usually a grouped expression), that uses the arguments, arg_i, to calculate some value. The value of the expression is the value returned by the function.

A call to the function then usually takes the form name(expr_1, expr_2, ...) and may occur anywhere a function call is legitimate.

Simple functions

As a first example, consider a function to calculate a one-sample t-statistic to test the null hypothesis that in the height and weight data set, the mean population weight is “x” lb, where “x” can be specified by the user. This is an artificial example, of course, since there are other, simpler ways of achieving the same end (we’ll do this next class).

> onesam <- function(y1, x) { n1 <- length(y1) ##sample size yb1 <- mean(y1) ##mean of y1 s1 <- var(y1) ##variance of y1 tstat <- (yb1 - x)/sqrt(s1/n1) ##computing t-statistic = (mean-x)/SE tstat }

With this function defined, you could perform one-sample t-tests using a call such as

> t.statistic <- onesam(htwtdata$weight, 130); t.statistic To check whether this function works, compare it to running the actual t-test function inbuilt in R:

18

> t.test(htwtdata$weight-130) A function can be called within a loop, or can be applied to elements of a vector or matrix at once, making R very powerful. We will continue looking at similar examples throughout the course. Another example: min.max.range <- function(x){ minimum<- min(x)

r <- max(x) - min(x) maximum <- max(x) print(minimum) print(maximum) print(r) } vec.1<- c(10, 20, 50) min.max.range(vec.1) [1] 10 [1] 50 [1] 40 Exercise 6. Write a function called summarystat, which returns the mean, median, and standard deviation of a set of numbers.

19

Recap: • Review of matrix/dataframe operations • if statements, ifelse(), Loops • Conditional indexing (the most useful topic of this class) • Merging & sorting dataframes (not on homework—but very useful in the real world) • Creating functions

Reading:

• VS. Chapter 8.1, 8.2, 9 and 10 Assignment:

• Homework 2 due, Homework 3 assigned. (With extra credit for those interested in simulation.)