original coding by eric lecoutre - dtu research database...importing data to r importingdatator...

Original coding by Eric Lecoutre

Initial coding: Sean Lorenz

Introduction to R

Anders Stockmarr

DTU ComputeSection for Statistics and Data Analysis

Technical University of [email protected]

DTU Management EngineeringMay 22, 2017

(DTU) R intro May 22, 2017 1 / 93

Outline

coworkers

Elisabeth Wreford Andersen, The Danish Cancer SocietyKasper Kristensen, DTU AQUAAndes Nielsen, DTU AQUA

(DTU) R intro May 22, 2017 2 / 93

Outline

Outline of Talk

Introduction to RData managementGraphicsLinear Modelsggplot2

(DTU) R intro May 22, 2017 3 / 93

Outline

Outline

1 Introduction to R

2 Importing Data to R

3 Description of Data

4 Modifying Data

5 GraphicsHistogramBox plotScatter PlotLine plot

(DTU) R intro May 22, 2017 4 / 93

Introduction to R

Overview

1 Introduction to R



4 Modifying Data


(DTU) R intro May 22, 2017 5 / 93

Introduction to R

Introduction to R

R is a programming language and a programming environment.It is Free! Developed by users under a GNU license.Runs on a variety of platforms including Windows, Unix and MacOS.You can even get it for Android.Allows for fast implementation of new methods by user demandthrough packages.R has state-of-the-art graphics capabilities.

(DTU) R intro May 22, 2017 6 / 93

Introduction to R

Advantages of R

Frank Harrel in 2009 (my highlighting):

"One point that hasn’t been made very explicitly is one of the greatestadvantages of R:

Getting your work done better and in less time.

Hundreds of companies hire a multitude of SAS programmers to writecode in an archaic language, the SAS macro language. I believe thereis a real cost savings from R because of its value as a data analysis,data manipulation, and graphics environment. Instead ofprogramming using an indirect syntax manipulation environment (SASmacros), in R you can program in a dynamic data-sensitiveframework".

That was 8 years ago. Things have progressed since...(DTU) R intro May 22, 2017 7 / 93

Introduction to R

Base R

Base R and most R packages are available for download at theComprehensive R Archive Network (CRAN).http://www.cran.r-project.orgBase R includes basic data management, analysis and graphics tools.For non-specialized tasks, Base R is all you need.Specialized tasks may be handled by packages.We will download, install and use packages.Packages are not all very well-documented (depends on thecontributor).Want to be sure about what you program does?

Use well-established packages only;or write your own code.

(DTU) R intro May 22, 2017 8 / 93

Introduction to R

RStudio

You can work directly in R.Many prefer another front end (GUI, Graphical User Interface).We will use RStudio.Download from http://www.rstudio.com/

(DTU) R intro May 22, 2017 9 / 93

Introduction to R

RStudio

The GUI RStudio has 4 windows.One for writing the commands (the "script").

Use script for reproducibility.

One for results and interactive use.One for plots, help and packages.One showing which objects are resident in the R memory.

(DTU) R intro May 22, 2017 10 / 93

Introduction to R

R as a calculator2+2

[1] 4

(2*5)+(12/3)-(2^3)

[1] 6

exp(log(1))

[1] 1

sqrt(25)

[1] 5

log(2*2)

[1] 1.3863

log(2)+log(2)

[1] 1.3863(DTU) R intro May 22, 2017 11 / 93

Introduction to R

Writing commands in R

Commands are separated by either a new line or ;R is case sensitive: id is a different name than ID.The character # at the beginning of a line shows that the text in thisline is a comment. I.e. the text is not executed.Help can be found on the internet; from colleagues; or in R by writing? followed by the function you want to help about:

?plot

or, in RStudio, highlight the expression and press F1.

(DTU) R intro May 22, 2017 12 / 93

Introduction to R

Objects in R

Both data and output from analyses are stored as objects (if stored);Some times, output is just displayed on the screen, and you need toassign the object to an identifier to keep it (see below).In fact, everything in the R memory is stored in objects.An object could be a vector, a matrix or a data frame.Values are assigned to objects using the assignment operator

Introduction to R

Generating a sequence

Specify the first and last values separated by a colon.Otherwise use seq()

0:10

[1] 0 1 2 3 4 5 6 7 8 9 10

15:5

[1] 15 14 13 12 11 10 9 8 7 6 5

seq(from = 0, to = 1.2, by = 0.1)

[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

x

Introduction to R

Generating repeats using rep()

rep(8, 5)

[1] 8 8 8 8 8

rep(1:4, each = 2)

[1] 1 1 2 2 3 3 4 4

rep(1:4, each = 2, times = 3)

[1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4

(DTU) R intro May 22, 2017 15 / 93

Introduction to R

Functions in R

We assign a simple function to the identifier f:

>fff

Introduction to R

Functions in R

We have already used many functions with and without default values:

"+"(2,2)sqrt(25)log(2)ls()":"(0,10)seq(from=0.1,to=1.2,by=0.1)rep(1:4,each=2,time=3)

Many applications in R are built up as functions. You can see defaultarguments in the help files. Example: log.

(DTU) R intro May 22, 2017 17 / 93

Introduction to R

Data structures in R: Singles

Logical, e.g:> TRUE[1] TRUE> 1==2[1] FALSE

Single numbers, e.g:> 1[1] 1> 1.2[1] 1.2Character, e.g:> "5"[1] "5"> "abc"[1] "abc"

(DTU) R intro May 22, 2017 18 / 93

Introduction to R

Data structures in R: Vectors

Constructed via the concatenate function c().

Vector of numbers, e.g:

> c(1,1.2,pi,exp(1))[1] 1.000000 1.200000 3.141593 2.718282

We can have vectors of other things too, e.g:

> c(TRUE,1==2)[1] TRUE FALSE> c("a","ab","abc")[1] "a" "ab" "abc"

But not combinations, e.g:> c("a",5,1==2)[1] "a" "5" "FALSE"

Note that R just turned everything into characters!(DTU) R intro May 22, 2017 19 / 93

Introduction to R

Data structures in R: Matrices

Columns of same type and same length:

> matrix(c(1,2,3,4,5,6)+pi,nrow=2)[,1] [,2] [,3][1,] 4.141593 6.141593 8.141593[2,] 5.141593 7.141593 9.141593

> matrix(c(1,2,3,4,5,6)+pi,nrow=2)

Introduction to R

Data structures in R: Data frames

Same length of columns but different types; spread-sheet data.Created from reading in data from external files;or by using the function data.frame() on a set of vectors.

> data.frame(treatment=c("active","active","placebo"),+ bp=c(80,85,90))treatment bp

1 active 802 active 853 placebo 90

Compare to a matrix created with the cbind() command):> cbind(treatment=c("active","active","placebo"),bp=c(80,85,90))

treatment bp[1,] "active" "80"[2,] "active" "85"[3,] "placebo" "90"

(DTU) R intro May 22, 2017 21 / 93

Introduction to R

Data structures in R: Lists

Different length of columns and different types.Most general object type.> list(a=1,b="abc",c=c(1,2,3),d=list(e=matrix(1:4,2), f=function(x){x^2}))$a[1] 1$b[1] "abc"$c[1] 1 2 3$d$d$e

[,1] [,2][1,] 1 3[2,] 2 4$d$ffunction (x){

x^2}

The objects returned from many of the built-in functions in R arefairly complicated lists.

(DTU) R intro May 22, 2017 22 / 93

Importing Data to R

Overview

1 Introduction to R



4 Modifying Data


(DTU) R intro May 22, 2017 23 / 93

Importing Data to R

Importing Data to R

can be done directly from SAS, SPSS, Excel, STATA etc.The easiest is to use data saved as text files.Usually values in text files are separated, or delimited, by tabs orcommas.First tell R where you want to find your data using the commandsetwd().Check that all went to plan with getwd().

setwd("C:/users/anst/Foredrag/DTU Management Engineering 22052017")getwd()

[1] "C:/users/anst/Foredrag/DTU Management Engineering 22052017"

(DTU) R intro May 22, 2017 24 / 93

Importing Data to R

Importing Data to R

The function read.table() can be used to read data saved as text.Wrappers: read.csv(), read.csv2() and read.delim().Notice the option sep = .We are assigning the loaded data to objects.If you have an Excel sheet, then save as text.

Births.tab

Importing Data to R

Importing Data using RStudio

In the Objects Window, click "Import Dataset"

(DTU) R intro May 22, 2017 26 / 93

Importing Data to R

Importing Data From Other Programs

We can read data from a series of other statistical software packagesusing the package foreign.

# INSTALL AN EXTRA PACKAGEinstall.packages("foreign")

# ACTIVATE THE PACKAGElibrary("foreign")

SPSS_Data

Importing Data to R

Looking At Your Data

There are several ways to look at the data (or parts of the data).

# FIRST FEW OBSERVATIONShead(Births.tab)

id bweight lowbw gestwks preterm matage hyp sex sexalph1 1 2974 0 38.52 0 34 0 2 female2 2 3270 0 NA NA 30 0 1 male3 3 2620 0 38.15 0 35 0 2 female4 4 3751 0 39.80 0 31 0 1 male5 5 3200 0 38.89 0 33 1 1 male6 6 3673 0 40.97 0 33 0 2 female

(DTU) R intro May 22, 2017 28 / 93

Importing Data to R

Looking At Your Data

# LAST FEW OBSERVATIONStail(Births.tab)

id bweight lowbw gestwks preterm matage hyp sex sexalph495 495 2968 0 41.01 0 34 0 1 male496 496 2852 0 38.45 0 28 0 2 female497 497 3187 0 38.03 0 38 1 1 male498 498 3054 0 38.50 0 26 0 2 female499 499 3178 0 39.92 0 31 0 2 female500 500 2918 0 37.97 0 31 0 1 male

# VARIABLE NAMESnames(Births.tab)

[1] "id" "bweight" "lowbw" "gestwks" "preterm" "matage" "hyp"[8] "sex" "sexalph"

# VIEW THE DATA IN A NEW WINDOWView(Births.tab)

(DTU) R intro May 22, 2017 29 / 93

Importing Data to R

Missing values

In R, missing values are coded as NA (not available).In your Excel file leave missing values blank, do not set them to 99 or999.

id bweight lowbw gestwks preterm matage hyp sex sexalph1 1 2974 0 38.52 0 34 0 2 female2 2 3270 0 NA NA 30 0 1 male

(DTU) R intro May 22, 2017 30 / 93

Importing Data to R

Accessing Observations

Data are (usually) stored in a data frame object.Observations are the rows.Variables, either numerical or categorical, are the columns.We can access individual rows, columns and cells in the data frame.For this, we use the bracket operator: object[row, column].

(DTU) R intro May 22, 2017 31 / 93

Importing Data to R


# A SINGLE CELLBirths.tab[345, 4]

[1] 38.55

# LEAVING OUT A COLUMN NUMBER INDICATES THAT ALL COLUMNS# ARE CHOSEN. HERE ALL COLUMNS IN ROW 224Births.tab[224 , ]

id bweight lowbw gestwks preterm matage hyp sex sexalph224 224 3216 0 39.94 0 38 1 1 male

(DTU) R intro May 22, 2017 32 / 93

Importing Data to R


# LEAVING OUT A ROW NUMBER INDICATES THAT ALL ROWS ARE CHOSEN# HERE ALL ROWS IN COLUMN 5Births.tab[ ,5]

[1] 0 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0[24] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0[47] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0[70] 0 0 0 1 1 0 0 0 0 0 NA 0 0 0 0 0 0 0 0 0 0 NA 0[93] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 0 0 0 0 0 1

[116] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0[139] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[162] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0[185] 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0[208] 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0[231] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1[254] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1[277] 0 0 0 0 1 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[300] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0[323] 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0[346] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 NA 0 1 0 1 0 0[369] 0 1 0 0 0 0 0 0 1 0 0 0 NA 0 0 0 0 0 0 0 0 0 0[392] 0 0 0 1 NA 0 0 NA NA 0 0 0 0 0 0 0 0 0 0 1 0 0 1[415] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0[438] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0[461] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0[484] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(DTU) R intro May 22, 2017 33 / 93

Importing Data to R


# USE RANGES, ROWS 15 TO 18 COLUMNS 1 TO 4Births.tab[15:18, 1:4]

id bweight lowbw gestwks15 15 3662 0 39.2316 16 3035 0 38.9617 17 3351 0 39.3518 18 3804 0 38.99

(DTU) R intro May 22, 2017 34 / 93

Importing Data to R


Variables can be accessed directly using their name, either with the $operator (object$variable) the name (object[ ,"variable"]), or the columnnumber (object[ ,k]).

# GET THE BIRTH WEIGHT FOR CHILD 26 TO 36Births.tab$bweight[26:36]

[1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509

Births.tab[26:36, "bweight"]

[1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509

Births.tab[26:36,2]

[1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509

(DTU) R intro May 22, 2017 35 / 93

Importing Data to R

Subsetting using the c() function

The concatenate function c() can be used to access non-sequentialrows and columns from a data frame.

# GET COLUMNS 2, 5, 7, 8, 9 FOR ROW 33Births.tab[33, c(2, 5, 7:9)]

bweight preterm hyp sex sexalph33 2887 0 0 1 male

# GET bweight, preterm and sexalph FOR ROW 71Births.tab[71, c("bweight", "preterm", "sexalph")]

bweight preterm sexalph71 3189 0 male

(DTU) R intro May 22, 2017 36 / 93

Importing Data to R

Variable Names

If we want to change the variable names we can use names().

# NEW VARIABLE NAMESnames(Births.tab)

Importing Data to R

Saving/Exporting data

We can save the data to a textfile, using either write.table() for a tabseparated file, or write.csv()/write.csv2() for a comma/semicolonseparated file (with "."and ","as punctuation mark, respectively).

write.table(Births.tab, file = "Birth_new.txt",sep = "\t", na = ".", row.names= FALSE)

write.csv2(Births.tab, file = "Birth_new.csv")

(DTU) R intro May 22, 2017 38 / 93

Description of Data

Overview

1 Introduction to R



4 Modifying Data


(DTU) R intro May 22, 2017 39 / 93

Description of Data

Description of Data

We are still looking at the data set with birth weights for 500 children.Using the function str() we can see a description of what our data framecontains (the structure).

str(Births.tab)

'data.frame': 500 obs. of 9 variables:$ id : int 1 2 3 4 5 6 7 8 9 10 ...$ bweight: int 2974 3270 2620 3751 3200 3673 3628 3773 3960 3405 ...$ lowbw : int 0 0 0 0 0 0 0 0 0 0 ...$ gestwks: num 38.5 NA 38.2 39.8 38.9 ...$ preterm: int 0 NA 0 0 0 0 0 0 0 0 ...$ matage : int 34 30 35 31 33 33 29 37 36 39 ...$ hyp : int 0 0 0 0 1 0 0 0 0 0 ...$ sex : int 2 1 2 1 1 2 2 1 2 1 ...$ sexalph: Factor w/ 2 levels "female","male": 1 2 1 2 2 1 1 2 1 2 ...

(DTU) R intro May 22, 2017 40 / 93

Description of Data

Description of Data: Birth weights

The Birth.tab dataset is a data frame with 500 observations and 9variables.Some are integers but “gestwks“ is numeric.The variable “sexalph“ is a factor. This is a categorical variable (eithernumeric or string) with a finite number of levels, here “female“ and“male“.“sexalph“ and “sex“ contains the same info, but “sexalph“ is a factorwhile “sex“ is not.We can convert “sex“ to a factor using as.factor().

(DTU) R intro May 22, 2017 41 / 93

Description of Data

Description of Data: Birth weights

# TELL R THAT sex IS A FACTORBirths.tab$sex

Description of Data

Descriptive Statistics

There are many simple extractor functions for summary statistics in R.Common functions are mean(), sd(), median(), max() and min().

mean(Births.tab$bweight)

[1] 3136.9

sd(Births.tab$bweight)

[1] 637.45

median(Births.tab$bweight)

[1] 3188.5

max(Births.tab$bweight)

[1] 4553

min(Births.tab[ , 2])

[1] 628

(DTU) R intro May 22, 2017 43 / 93

Description of Data

The Summary Function

The function summary() can be used with many objects in R.When used on a data frame we get all the main summary statistics.

# SUMMARY OF THE DATA FRAMEsummary(Births.tab)

id bweight lowbw gestwksMin. : 1 Min. : 628 Min. :0.00 Min. :24.71st Qu.:126 1st Qu.:2862 1st Qu.:0.00 1st Qu.:37.9Median :250 Median :3188 Median :0.00 Median :39.1Mean :250 Mean :3137 Mean :0.12 Mean :38.73rd Qu.:375 3rd Qu.:3551 3rd Qu.:0.00 3rd Qu.:40.1Max. :500 Max. :4553 Max. :1.00 Max. :43.2

NA's :10preterm matage hyp sex sexalph

Min. :0.000 Min. :23 Min. :0.000 1:264 female:2361st Qu.:0.000 1st Qu.:31 1st Qu.:0.000 2:236 male :264Median :0.000 Median :34 Median :0.000Mean :0.129 Mean :34 Mean :0.1443rd Qu.:0.000 3rd Qu.:37 3rd Qu.:0.000Max. :1.000 Max. :43 Max. :1.000NA's :10

(DTU) R intro May 22, 2017 44 / 93

Description of Data

Summaries

We may only want summaries for some of the data, e.g. babies withbirth weight < 2900g.We subset the data and then summarize as before:

summary(Births.tab[Births.tab$bweight

Description of Data

Group Summaries

We can work on data separated by groups.Suppose that we want to calculate the mean birth weight for boys andgirls (many ways to do this).We will use the tapply() function to apply the mean function to thetwo levels of “sexalph“.tapply(, , ).

# MEAN BIRTH WEIGHT FOR BOYS AND GIRLStapply(Births.tab$bweight, Births.tab$sexalph, mean)

female male3032.831 3229.902

(DTU) R intro May 22, 2017 46 / 93

Description of Data

Histogram

Often it is easier to get an impression of a distribution using plots.Histograms are typically used for continuous variables.

hist(Births.tab$bweight, main = "Title", xlab = "Birth weight (g)")

Title

Birth weight (g)

Fre

quen

cy

1000 2000 3000 4000 5000

050

100

150

(DTU) R intro May 22, 2017 47 / 93

Description of Data

Histogram

Often it is easier to get an impression of a distribution using plots.Histograms are typically used for continuous variables. Here with a box on.

hist(Births.tab$bweight, main = "Title", xlab = "Birth weight (g)")box()

Title

Birth weight (g)

Fre

quen

cy

1000 2000 3000 4000 5000

050

100

150

(DTU) R intro May 22, 2017 48 / 93

Description of Data

Boxplot

Boxplots show the median, upper, lower quartiles and potentially extremevalues.

boxplot(Births.tab$bweight, xlab = "Birth weight (g)")

●

●●●

●

●

●●

●

●

●●

●

●

●●●●●

●●●●●

●

1000

3000

Birth weight (g)

(DTU) R intro May 22, 2017 49 / 93

Modifying Data

Overview

1 Introduction to R



4 Modifying Data


(DTU) R intro May 22, 2017 50 / 93

Modifying Data

Modifying Data

We will concentrate on how to modify and rearrange our data.Data can be sorted with the order function.order can sort the Birth.tab data by “sex“, and then by “bweight“.The order function returns a vector of sorted indices, which we applyto the rows of the unsorted data frame to get a sorted version.

Birth_sort

Modifying Data

Creating new variables and deleting old

New variables can be added to a data frame.

# ADD A VARIABLE TO DATA FRAMEBirths.tab$log_bweight

Modifying Data

Grouping the values of a variable using cut

If you want to group a continuous variable e.g. mother’s age (matage) intothe groups: ]20-30], ]30-35], ]35-40], ]40-45].

Births.tab$agegrp

Modifying Data

Creating new variables: RowSums

Often we want to form new variables from other variables.For example we might want to calculate a total score from some subscores.We can sum variables using rowSums. Related functions are:rowMeans, colSums, colMeans.Notice the effect of the option na.rm:na.rm= FALSE: If we take a row sum where one of the values ismissing then the row sum is set to missing.na.rm= TRUE: If we want to ignore missing values and calculate asum of the non-missing.rowSums, rowMeans, colSums and colMeans are wrappers of sapply,ie. t.ex. colMeans(x) is the same as sapply(x,mean). sapply can beused with many other functions.

(DTU) R intro May 22, 2017 54 / 93

Modifying Data

Creating new variables: RowSums

# WANT TO MAKE A NEW VARIABLE SUMMING PRETERM, LOWBW AND HYPBirths.tab$score

Modifying Data

Split Data: Subset

Sometimes we may need to split our data.In the Births data we may need to split the data into boys and girls.We can use the subset() function and assign the new data sets toseparate R objects.Notice == (logical expression). We are not assigning a value to “sex“,but asking whether “sex is equal to 1“.

Births.Male

Modifying Data

Subset

Often data sets come with a lot of variables and we only want to use afew.The function subset() can also be used to select the variables we want.Notice the select option. This is needed to say that we want a subsetof columns (on the previous slide it was rows).Notice that we do not need quotes in select.

# SELECT 3 VARIABLESBirths.new

Modifying Data

Aggregating data

Sometimes we want to make a new dataframe as a summary of theoriginal dataframe on the basis of factor levels.Below we want to make a new dataframe with the mean birthweightfor combinations of preterm and sex.

PreSex

Modifying Data

Add rows: rbind

Suppose that aata are collected for subgroups of subjects and saved inseparate objects.The separate objects are appended (stacked) to create a single object.This will give an error message if the number of columns differs.

# APPENDBirths.Both

Modifying Data

Add variables: merge

Often you have data in several data sets and want to combine the data setsby merging using one or more variables as key variables. Adding variables toa master data set.

Person Data

Id, age, sex, race Answers to ques-

tionnaire:

Id, q1,…,q10

Merged data: Person data and answers. Id, age, sex, race, q1,…,q10

PDFil

l PDF

Edit

or wi

th Fr

ee W

riter

and T

ools

(DTU) R intro May 22, 2017 60 / 93

Modifying Data

Merge

We have two data sets with a key variable "id". One with backgroundinformation and one set with blood pressure measurements.

agesex

Modifying Data

4 Different Merges

In the merge function we will look at 4 of the options.We have merge(x, y, by = "key variable", all = TRUE, < all =FALSE, all.x = TRUE, all.y =FALSE > ).Here x and y are data frames

(DTU) R intro May 22, 2017 62 / 93

Modifying Data

Merging all=FALSE

merge_small

Modifying Data

Merging all=TRUE

merge_large

Modifying Data

Merging all.x=TRUE

merge_x

Modifying Data

Merging all.y=TRUE

merge_y

Modifying Data

Counting the Missing Observations: The is.na() and sum()functions

Suppose that we want to count the number of missing observations.The function is.na returns a logical vector that is TRUE when a valueis missing and FALSE otherwise.

is.na(merge_y$sex)

[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE

#COUNT MISSING FOR ONE VARIABLEsum(is.na(merge_y$sex))

[1] 1

#COUNT FOR DATA FRAMEcolSums(is.na(merge_y))

id age sex visit bp0 2 1 0 0

(DTU) R intro May 22, 2017 67 / 93

Modifying Data

Saving your work

Saving your scriptSaving your workspace

Always save your script - do it often if you work in Rstudio.

Reasons for saving your workspace:Extensive data creations will be there next time you open yourworkspace.Objects created ’on the fly’ (not in your script) will be there.

Reasons for not saving your workspace:With a well-written script, you can recreate your analysis in seconds,unless you work with huge amounts of data.Edited and saved data where editions have been forgotten may causehavoc on your results.Left-over objects created for various purposes may enter yourcalculations unintentionally due to the structure of R’s search path.

(DTU) R intro May 22, 2017 68 / 93

Modifying Data

Saving your work

How to save your work:

Script: Click on the script and press ’save’ in Rstudio and the plain RGUI.Workspace: Click on the command prompt and press ’save’.Alternatively, use the save.image() functionBoth: Accept when asked after terminating Rstudio or the plain R GUI.

(DTU) R intro May 22, 2017 69 / 93

Graphics

Overview

1 Introduction to R



4 Modifying Data


(DTU) R intro May 22, 2017 70 / 93

Graphics

Visualizing Data

Whenever we want to analyze data, the first thing we do is to have alook at it.How are the observations spread out? What are the most commonvalues? Are there any unusual observations? Are there anyrelationships between variables? Etc.

The graphics section will not tell you all about graphics in R but get yougoing.

(DTU) R intro May 22, 2017 71 / 93

Graphics

R Graphics Systems

base The original/default graphics system in R.Example:

demo(graphics)

Highly customizable; but complex plots require much code.

lattice Shorter syntax for complex (e.g. multipanel) plots. Lesscustumizable than base.

Example:library(lattice)demo(lattice)

ggplot2 By Hadley Wickham; builds on the same ideas as lattice.gg = “grammar of graphics”Example:

library(ggplot2)example(qplot)

(DTU) R intro May 22, 2017 72 / 93

Graphics Histogram

A Basic Histogram

Common way to examine the distribution of a continuous variable.The range of the variable is by default divided into equal-widthintervals (bins). Plots the number of observations in each bin (unlessyou specify otherwise).

hist(Births$bweight)

Histogram of Births$bweight

Births$bweight

Fre

quen

cy

1000 2000 3000 4000 5000

050

150

Note that R automatically creates axis labels and a heading.(DTU) R intro May 22, 2017 73 / 93

Graphics Histogram

Histogram with a few options

To modify axis labels we set the options xlab and ylab.The heading is set in the option main.

hist(Births$bweight, xlab = "Birth weight (g)",main = "Histogram of Birth Weight")

Histogram of birth weight

Birth weight (g)

Fre

quen

cy

1000 2000 3000 4000 5000

010

0

(DTU) R intro May 22, 2017 74 / 93

Graphics Histogram

Histogram with more options

We could type ?hist to find more options to customize the histogram.The available colours are coded as numbers or one can write col =“red“If we want shading we can try the density function.The angle of the numbers on the axes is set by the option las.

hist(Births$bweight,las = 1, main = "Histogram of birth weight",col = 2, density = 7)

Histogram of birth weight

Births$bweight

Fre

quen

cy

1000 2000 3000 4000 5000

050

100150

(DTU) R intro May 22, 2017 75 / 93

Graphics Histogram

How to get your plot from RStudio

(DTU) R intro May 22, 2017 76 / 93

Graphics Box plot

A Basic Box Plot

Box plots show a measure of the location (the median line).The spread of the distribution (the length of the box and whiskers).Skewness as asymmetry in the upper and lower parts of the box andwhisker length.We use the function boxplot(variable). Adding labels to the axes andcolours is done as for hist.

(DTU) R intro May 22, 2017 77 / 93

Graphics Box plot

Histograms and a Box Plot

(DTU) R intro May 22, 2017 78 / 93

Graphics Box plot

A Basic Box Plot

When describing data we can even add the observations to the plot.Notice the function rug shows the observations.

boxplot(Births$bweight, xlab = "Birth weight (g)", horizontal = TRUE,col = 6)

rug(Births$bweight)

● ●●●● ●● ●● ●●● ●● ●●●● ●● ●●● ●●

1000 2000 3000 4000

Birth weight (g)(DTU) R intro May 22, 2017 79 / 93

Graphics Box plot

Box Plot for Groups

A very useful feature is that we can make box plots for different groupsnext to each other for comparison. Notice the option data = Births.

# BOX PLOT FOR BOYS AND GIRLSboxplot(bweight ~ sexalph, data = Births, las = 1,

ylab = "Birth weight (g)", col = 2:3)

●

●

●

●

●

●●

●

●●●

●●●●●

●

●

●

●●●

●●●

●●●

female male

1000

2000

3000

4000

Bir

th w

eigh

t (g)

(DTU) R intro May 22, 2017 80 / 93

Graphics Box plot

Box Plot for GroupsSet our own axis. Notice xaxt = “n“.

# BOX PLOT WHERE WE WANT TO MAKE OUR OWN AXISboxplot(bweight ~ sexalph, data = Births, las = 1,

ylab = "Birth weight (g)", col = c("red", "blue"), xaxt = "n")axis(1 ,at = c(1,2), labels = c('Girl', 'Boy'))

●

●

●

●

●

●●

●

●●●

●●●●●

●

●

●

●●●

●●●

●●●

1000

2000

3000

4000

Bir

th w

eigh

t (g)

Girl Boy

(DTU) R intro May 22, 2017 81 / 93

Graphics Scatter Plot

The Basic Scatter Plot

The scatter plot is the standard graph for examining the relationshipbetween two continuous variables.The plot(x,y) function is used to create scatter plots. Where (x,y) arethe points we want to plot.We will look at the relationship between car weight (lbs/1000) andmiles per gallon for 32 cars.

plot(mtcars$wt, mtcars$mpg)lines (sort(mtcars$wt),37.285-5.344*sort(mtcars$wt),type="l")

(DTU) R intro May 22, 2017 82 / 93


The Basic Scatter Plot

2 3 4 5

1015

2025

30

mtcars$wt

mtc

ars$

mpg

(DTU) R intro May 22, 2017 83 / 93


The Scatter Plot

We can customize the scatter plot similar to before.The function abline adds a straight line to the plot.When we write abline(lm(mpg ∼ wt)) we get the best fitting line.

plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1, pch = 19)

abline(lm(mtcars$mpg ~ mtcars$wt), lty = 1, col = 3)

● ●● ●●●

●

●●●● ●●●

● ●●

●●

●

●

●● ●

●

● ●●

●●

●

●

2 3 4 5

1015202530

Car weight (lbs/1000)

Mile

s pe

r ga

llon

(DTU) R intro May 22, 2017 84 / 93


abline

The function abline can also add reference lines to a plot.A horizontal line, e.g. at 25 and 30 abline(h = c(25, 30))A vertical line, e.g. at 2 and 5 abline(v = c(2, 5))

plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1, pch = 19)

abline(h = c(25, 30), col = c("red", "magenta"), lty = 2)abline(v = c(2, 5), col = 4:5, lty = 3:4)

● ●● ●●●

●

●●●● ●●●

● ●●

●●●

●

●● ●

●

● ●●

●●

●

●

2 3 4 5

1015202530


Mile

s pe

r ga

llon

(DTU) R intro May 22, 2017 85 / 93


Add a smoothed line

Perhaps we do not think the association is linear and try a nonparametricsmoothed line.

plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1)

abline(lm(mtcars$mpg ~ mtcars$wt), lty = 2, col = 4)lines(lowess(mtcars$wt, mtcars$mpg), lty = 1, col = 2)

(DTU) R intro May 22, 2017 86 / 93


Add a smoothed line

● ●

●●

●●

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

2 3 4 5

10

15

20

25

30


Mile

s pe

r ga

llon

(DTU) R intro May 22, 2017 87 / 93


Enhanced graph procedures: Scatter plot example from the"car"package

scatterplot(mpg ~ wt | cyl, data = mtcars, ylim = c(0,40),xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1,legend.plot = TRUE,legend.coords = "topright",id.method = "identify",labels = row.names(mtcars),boxplots = "xy")

Here we want to plot miles per gallon versus weight for cars that have 4, 6 or 8 cylinders.We write this as mpg ∼ wt | cyl.By default we get different colours for groups and both a linear and a smoothed line.A legend is included in the top right corner of the plot.The option id.method = “identify“ means that points can be identified by mouse clicks.Box plots of miles per gallon and weight included ("xy"option for both axes).More possibilities: ?scatterplot.

(DTU) R intro May 22, 2017 88 / 93


The resulting scatter plot

(DTU) R intro May 22, 2017 89 / 93

Graphics Line plot

A Line Plot

Connecting points in a scatter plot from left to right. Here the growth of atree. Notice the option type = “b“ meaning points joined by lines.

plot(TreeA$age, TreeA$circumference, type = "b", xlab = "Age (days)",ylab = "Circumference (mm)", las = 1)

●

●

●

●●

● ●

500 1500

50

100

150

200

Age (days)

Circ

umfe

renc

e (m

m)

●

●

●

●●

● ●

500 1500

50

100

150

200

Age (days)

Circ

umfe

renc

e (m

m)

(DTU) R intro May 22, 2017 90 / 93

Graphics Line plot

Difference between plot() and lines() functions

We have seen both the plot and the lines functions.The plot function creates a new graph. It is a high-level plottingfunction.The lines function adds information to an existing graph but it cannotproduce it’s own graph. It is a low-level plotting function.A high-level plotting function can (often) be converted to a low-levelplotting function with the option ADD=TRUE.Usually lines will be used after a high-level plotting function (such asplot) has produced a graph.

(DTU) R intro May 22, 2017 91 / 93

Graphics Line plot

A line plot and a legend

plot(TreeA$age, TreeA$circumference, type = "b", lty = 1,xlab = "Age (days)",ylab = "Circumference (mm)", las = 1, col= 2)

lines(TreeB$age, TreeB$circumference, type = "b", col = 3, lty = 2)legend(locator(1), # we will place it with a mouse click

legend = c("A","B"), title = "Tree",lty = 1:2, col= 2:3)

(DTU) R intro May 22, 2017 92 / 93

Graphics Line plot

Layout of several plots on one graph

Several plots on one graph:

Use the option par(mfrow = c(2, 2)) and back to one plot par(mfrow =c(1, 1)). For other options: Check the layout() function

(DTU) R intro May 22, 2017 93 / 93

Linear Models

Linear models

Statistical models of a linear relationship between variables:

Yi = α+ βXi + εi , i = 1, . . . , n.

� Dependent variable: Y .

� Independent variable: X .

� Stochastic term/error term: ε.

The εi ’s should be a) stochastically independent, and b) identically normallydistributed, with mean 0, and variance σ2 for some positive number σ2 > 0.

� Model parameters: α, β and σ2.

Linear models: Example

Y = 1 + 0.5X + ε

> plot(X,Y,xlab=�X�,ylab=�Y�)

> lines(sort(X),1+0.5*sort(X),lwd=3,col="red")

0 1 2 3 4 5

01

23

4

X

Y

Linear models: Example

Model residuals: The random/stochastic term.

> residuals.Y 0.

f

Fitting linear models: The lm() function

Y = α+ βX + ε

In R, linear models can be fitted to data with the lm() function:

> analysis analysis

Call:

lm(formula = Y ~ X)

Coefficients:

(Intercept) X

0.9702 0.5155

α̂ is the intercept 0.97, while β̂ is the estimated coefficient to X, 0.52.

Model formulas

The argument to lm() is a formula object.

� A linear model is specified by a formula object, which t.ex. may look likethis:

> my.formula fit fit fit

The lm object: Model diagnostics

> analysis

The lm object: Contents

� An lm object is a list, and contains a lot of information. See the contentswith the names() function:

> analysis names(analysis)

[1] "coefficients" "residuals" "effects"

[4] "rank" "fitted.values" "assign"

[7] "qr" "df.residual" "xlevels"

[10] "call" "terms" "model"

� Access the contents with the $ operator; eg.

> analysis$coef

(Intercept) X

0.9701906 0.5154684

� Some of the 12 components of the list are lists themselves. Find moreinformation by applying str().

The lm object: Summaries

The summary() fuction may be applied to lm objects as well:

> analysis summary(analysis)

Call:lm(formula = Y ~ X)

Residuals:Min 1Q Median 3Q Max

-1.61297 -0.40132 0.07808 0.55124 1.32380

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.97019 0.09182 10.566 < 2e-16 ***X 0.51547 0.05410 9.527 1.29e-15 ***---Signif. codes:0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

Residual standard error: 0.6861 on 98 degrees of freedomMultiple R-squared: 0.4808, Adjusted R-squared: 0.4755F-statistic: 90.77 on 1 and 98 DF, p-value: 1.286e-15

The lm object: Summaries

The summary is a R list object itself, with sub-elements that can be accessed:

> analysis names(summary(analysis))

[1] "call" "terms" "residuals"

[4] "coefficients" "aliased" "sigma"

[7] "df" "r.squared" "adj.r.squared"

[10] "fstatistic" "cov.unscaled"

We can find the estimate σ̂2 for σ2 as

> summary(analysis)$sigma^2

[1] 0.4707802

Modeling nonlinear relations with lm()

> plot(X,Y)

> lines(sort(X),predict(lm(Y~X))[order(X)])

0 1 2 3 4 5

040

080

012

00

X

Y

Relationship with Y and X is not linear. How to proceed with lm()?

Modeling nonlinear relations with lm()

� The I-operator in formulas:

> analysis plot(X,Y)

> lines(sort(X),predict(analysis)[order(X)],type="l" )

0 1 2 3 4 5

040

080

012

00

X

Y

’Linear’ in lm() is relative to the ’right’ independent variables.

Extraction functions

� Some important extraction functions for obtaining information:

coef() Estimated model parametersconfint() Confidence intervals for estimated model parameters

residuals() Raw residualsrstandard() Standardized residuals

model.matrix() The design matrixpredict() Predictions from model

vcov() Covariance matrix for estimated model parametersanova() Anova test table for model reductiondrop1() Test for dropping one term from model

summary() A summary printout, and access to summary statistics

� Statistical tests: drop1() is usually the function to use.

Factors and interactions

A dataset on Sex, Age,and a response Y:

> summary(my.data)

Sex Age YFemale:50 Min. :18.70 Min. : 3.091Male :50 1st Qu.:36.38 1st Qu.: 9.274

Median :51.12 Median :12.430Mean :49.99 Mean :12.7113rd Qu.:63.22 3rd Qu.:15.972Max. :77.31 Max. :21.800

plot(my.data$Age, my.data$Y,xlab=��,ylab=�Y�,col=my.data$Sex)legend(20,20,c("Female","Male"),col=1:2,pch=1)

20 30 40 50 60 70

510

1520

Y

FemaleMale


Model: Interaction between Sex and Age. Testing the interaction term withdrop1():

> analysis drop1(analysis,test="F")

Single term deletions

Model:

Y ~ Age + Sex + Age:Sex

Df Sum of Sq RSS AIC F value Pr(>F)

102.77 10.737

Age:Sex 1 33.131 135.91 36.679 30.947 2.391e-07 ***

---

Signif. codes:

0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

The interaction is for real and cannot be removed.


> summary(analysis)

Call:lm(formula = Y ~ Age + Sex + Age:Sex, data = my.data)

Residuals:Min 1Q Median 3Q Max

-2.60300 -0.53551 0.00317 0.59830 2.43544

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.696993 0.452879 3.747 0.000305 ***Age 0.187921 0.009045 20.777 < 2e-16 ***SexMale -0.525623 0.677086 -0.776 0.439479Age:SexMale 0.071599 0.012871 5.563 2.39e-07 ***---Signif. codes:0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

Residual standard error: 1.035 on 96 degrees of freedomMultiple R-squared: 0.945, Adjusted R-squared: 0.9433F-statistic: 550.2 on 3 and 96 DF, p-value: < 2.2e-16

� R selects the first level of the Sex variable; similarly for the interactionterm.


> my.data2 with(my.data2,{+ plot(Age,Y,xlab=�Age�,ylab=�Y�,col=Sex)+ lines(Age[Sex=="Female"],predicted[Sex=="Female"],col=1,type="l")+ lines(Age[Sex=="Male"],predicted[Sex=="Male"],col=2,type="l")+ legend(20,20,c("Female","Male"),col=1:2,pch=1)+ })

20 30 40 50 60 70

510

1520

Age

Y

FemaleMale

More on formula objects

� Model formulae are symbolic. We have seen the use of ’+’ and ’:’, andadding a 0 or -1.

� The product ’*’ crosses variables: Expands to main effects andinteractions:

y ~ x*z

corresponds to

$y~x+z+x:z$

� Powers expands effects to the specified order:

y~(x+z+w)^2

corresponds to

y~x+z+w+x:z+x:w+z:w

� The subtraction function ’-’ removes variables if possible:

y~(x+z+w)^2-x:z-a:b

corresponds to

y~x+z+w+x:w+z:w

More on formula objects

The I() function overrides the symbolic interpretation, and invokes the usualarithmetic instead.

Observe that

y~(x*z)^2 = y~(x+z)^2

But

y~I((x*z)^2) and y~I((x+z)^2)

are two different model formulas; regressing y on x2z2 and x2 + z2 + 2xz ,respectively.

Formulas when transforming data into normality

� Sometimes it is possible to transform data, such that it matches a linearmodel.

� For instance if the variance is increasing with the mean

1 2 3 4 5 6 7

−3−2

−10

12

Raw data

Prediction

Res

idua

l

0.0 0.5 1.0 1.5 2.0

−0.6

−0.2

0.2

0.6

Log transformed

Prediction

Res

idua

l� A log transformation is often appropriate in this case.

� This may be done directly in a formula object. T. ex:

log(y)~log(x)+log(z)

Generalized linear models - the glm() function

� Some types of observations can never be transformed into normality

� Example: binary data; ones and zeroes.

� For a wide class of distributions, the so called exponential families, we canuse generalized linear models:

� Formulate linear models for a transformation of the mean value.

� No transformation of observations, thereby preserving their distributionalproperties.

� Allows easy modeling in R with the glm() function, nearly identical tolm().

� Standard example: Logistic regression.

GLM vs GLM

General linear models Generalized linear models

Normal distribution Exponential dispersion family

Mean value linear Function of mean value linear

Independent observations Independent observations

Same variance Variance function of mean

lm() easy to apply glm() almost as easy to apply

Types of response variables

i Count data (y1 = 57, . . ., yn = 59 accidents) - Poisson distribution.

ii Binary response variables (y1 = 0, y2 = 1, . . ., yn = 0), or frequencies ofcounts (y1 = 15/297, . . ., yn = 144/285) - Binomial distribution.

iii Count data, waiting times - Negative Binomial distribution.

iv Multiple ordered categories ”Unsatisfied”, ”Neutral”, ”Satisfied” -Multinomial distribution.

v Count data, multiple categories - Multinomial distribution..

vi Continuous responses, constant variance (y1 = 2.567, . . ., yn = 2.422) -Normal distribution.

vii Continuous positive responses with constant coefficient of variation -Gamma distribution.

Logistic regression example

In a study of developmental toxicity of a chemical compound, a specifiedamount of an ether was dosed daily to pregnant mice, and after 10 days allfetuses were examined. The size of each litter and the number of stillbornswere recorded:

Index Number of Number of Fraction still- Concentrationstillborn, zi fetuses, ni born, yi [mg/kg/day], xi

1 15 297 0.0505 0.02 17 242 0.0702 62.53 22 312 0.0705 125.04 38 299 0.1271 250.05 144 285 0.5053 500.0

Table: Results of a dose-response experiment on pregnant mice. Number of stillbornfetuses found for various dose levels of a toxic agent.

Reported in Price et al. (1987).


Let Zi denote the number of stillborns at dose concentration xi .

We shall assume Zi ∼ B(ni , pi ), that is a binomial distribution corresponding toni independent trials (fetuses), and the probability, pi , of stillbirth being thesame for all ni fetuses.

We want to model Yi = Zi/ni . In particular, we will look for a model forE [Yi ] = pi .


� A natural quantity to consider is the odds, p/(1− p); varies on (0;∞),more natural than (0; 1) where p varies.

� since effects on the odds are often multiplicative, we take the log toconvert the effects to additive form.

� we arrive at the logit function:

logit(p) = log( p1− p

).

for this model, the logit function is our link function. We will formulate alinear model for the mean values transformed with the link function:

ηi = logit(pi ), i = 1, . . . , 5.

The linear model is

ηi = α+ βxi , i = 1, . . . , 5.

� The inverse transformation, which gives the probabilities, pi , for stillbirthis the so-called logistic function:

pi =exp(α+ βxi )

1 + exp(α+ βxi ), i = 1, . . . , 5.


> mice mice$resp mice.glm


> summary(mice.glm)

Call:

glm(formula = resp ~ conc, family = binomial(link = logit), data = mice)

Deviance Residuals:

1 2 3 4 5

1.1317 1.0174 -0.5968 -1.6464 0.6284

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.2479337 0.1576602 -20.6

Logistic regression exampleThe linear predictor, ŷi = α̂+ β̂xi :

0 100 200 300 400 500

−3.5

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

Concentration

logi

t(stil

l bor

n fra

ctio

n)

Figure: Logit transformed observations and corresponding linear predictions for doseresponse assay.


Predicted stillborn fractions, p̂i = exp(ŷi )/(1 + exp(ŷi )):

0 100 200 300 400 500

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Concentration

Still

born

frac

tion

Figure: Observed stillborn fractions and corresponding fitted values under logisticregression for dose response assay.

Specification of a generalized linear model in glm()

> mice.glm

ggplot2

� Basic plotting function: ggplot(). Used for advanced plots.

� Wrapper that resembles plot() from the basic graphics system: qplot().Used for ’quick’ plots. Syntax resembles that of plot().

� Grammar of graphics:� All plots are objects. You build them incrementally. Use the operator + to

add to an existing plot.� Layer: Aestetics (aes): Defines how the data are mapped.� Layer: Geometric objects (geom): Points, lines, polygens, etc.� Layer: Coordinate system objects (coord).

Example: Diamond data

� Load the ggplot2 package and take a look at the diamond data:> library(ggplot2)

> head(diamonds)

carat cut color clarity depth table price x y z

1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

� Quick plot

> qplot(carat, price, data=diamonds)

0

5000

10000

15000

0 1 2 3 4 5carat

pric

e

Modifying the quick plot

� With qplot(), it is easy to work with:color Color each point according to a variable in the dataset, and add a

corresponding legend.log Log-transform one or both axes.

facets Split in a multi-panel plot according to a group variable.main Add a title

� Let’s try modifying the previous plot by adding� color = cut� log = ”xy”� facets =∼ clarity� main = ”Diamonds”

one-by-one

Modified plot

> qplot(carat,

+ price,

+ data=diamonds,

+ color = cut,

+ log="xy",

+ facets=~clarity,

+ main="Diamonds")

I1 SI2 SI1

VS2 VS1 VVS2

VVS1 IF

1000

10000

1000

10000

1000

10000

1 1carat

pric

e

cutFair

Good

Very Good

Premium

Ideal

Diamonds

Incremental plot construction

� qplot is good for a start. However, in order to take full advantage ofggplot2, we must know what the plot is built of and how to modify theparts.

� The quick plot qplot(carat, price, data=diamonds) can be builtincrementally by

� Define an empty plot object:> p p p p

� We can use the ’+’ operator to modify the plot ’p’. Lets see someexamples in the following:

Change the plot type (geom)

� Get an overview of possible geoms at http://docs.ggplot2.org.

� You can also look at the examples in the documentation:

> example(geom_boxplot)

> example(geom_polygon)

> example(geom_raster)

� Example: add a 2D density on top:

> p + geom_density2d()

0

5000

10000

15000

0 1 2 3 4 5carat

pric

e

Change the coordinate transformations

> p + coord_flip()

0

1

2

3

4

5

0 5000 10000 15000price

cara

t

> p + coord_polar()

1

2

3

4

5

5000

10000

15000

carat

pric

e

Change to multiplanel display

� Add a facet grid to split the plot in multiple panels.

� A facet grid takes a formula as input.

� Example:

> p + facet_grid(. ~ cut)

Fair Good Very Good Premium Ideal

0

5000

10000

15000

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5carat

pric

e

Change to multiplanel display - other facets

> p + facet_grid(cut ~ .)

0

5000

10000

15000

0

5000

10000

15000

0

5000

10000

15000

0

5000

10000

15000

0

5000

10000

15000

FairG

oodVery G

oodPrem

iumIdeal

0 1 2 3 4 5carat

pric

e

Change to multiplanel display - other facets

> p + facet_grid(cut ~ color)

D E F G H I J

05000

1000015000

05000

1000015000

05000

1000015000

05000

1000015000

05000

1000015000

FairG

oodVery G

oodPrem

iumIdeal

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5carat

pric

e

Density plot and alpha blending

� Attributes (colour, shape, fill, linetype etc) have automatically becomegrouping variables.

� Note the specification of transparancy through the alpha argument: alphablending.

> ggplot(diamonds) +

+ aes(price, fill=cut) +

+ geom_density(alpha=.3)

0e+00

1e−04

2e−04

3e−04

4e−04

0 5000 10000 15000price

dens

ity

cutFair

Good

Very Good

Premium

Ideal

Application: Maps

� the ggmap package: Interfacing ggplot2 and RGoogleMaps.

� Two steps in making a map with ggmap:1. download raster dta for the map;2. create the map with ggmap(), and overlay it with layers of geoms etc.

� Downloading raster data: Specify: a) location of center; b) the zoomfactor.

� : Location specification for map downloads: Two ways.1. location/address:

> myLocation myLocation

A first map: The London Olympic Stadium

� Download data with the get_map() function, plot with ggmap():

> mapData1 ggmap(mapData1,extent = "panel",ylab = "Latitude",xlab = "Longitude")

51.536

51.538

51.540

51.542

−0.020 −0.016 −0.012lon

lat

The London Olympic Stadium - same but different

� Different map type:

> mapData ggmap(mapData,extent = "panel",ylab = "Latitude",xlab = "Longitude")

51.536

51.538

51.540

51.542

−0.020 −0.016 −0.012lon

lat

The London Olympic Stadium - same but hybrid

� Different map type:

> mapData ggmap(mapData,extent = "panel",ylab = "Latitude",xlab = "Longitude")

51.536

51.538

51.540

51.542

−0.020 −0.016 −0.012lon

lat

Overlaying maps� Geographic coordinates obtained with the geocode() function:

> geocode("University of Washington")

lon lat1 -106.4407 31.76788

� A map of the USA: Lets overlay this map with data.> usa_center USA USA

20

30

40

50

−120 −110 −100 −90 −80 −70lon

lat

Fatal vehicle accidents in the USA 2012

� mv_collisions data:

> head(mv_collisions)

state collisions1 Alabama 782 Arizona 1453 Arkansas 464 California 7225 Colorado 776 Connecticut 40

� Getting the geocoordinates with geocode():

> for (i in 1:nrow(mv_collisions)) {+ latlon = geocode(mv_collisions$state[i])+ mv_collisions$lon[i] = as.numeric(latlon[1])+ mv_collisions$lat[i] = as.numeric(latlon[2])+ }

� Getting the map:

> usa_center = geocode("United States")> USA

Fatal vehicle accidents in the USA 2012

� Overlaying the data:> circle_scale USA + geom_point(aes(x=lon, y=lat), data=mv_collisions, col="red",+ alpha=0.4, size=mv_collisions$collisions*circle_scale)

20

30

40

50

−120 −110 −100 −90 −80 −70lon

lat

Credits

� Original coding of the fatal motor vehicle collision example: Sean Lorenz.

� ggmap:D. Kahle and H. Wickham (2013): ggmap: Spatial Visualization withggplot2.The R Journal, 5(1), 144-161. URL: http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

Where to go?R posibilities are endless:

• R shiny – web applications• dplyr – data management• RODBC – Reading from SQL databases etc.• TwitteR – text analytics of tweets• GoogleAnalyticsR – Google search analytics• Data Science with R on the Edx platform – Online course by yours

truly…• Or just practice… and check t.test()…

Intro R DTU Management EngineeringR frontpagegraphics teaser

Rintro1Introduction to RImporting Data to RDescription of DataModifying DataGraphicsHistogramBox plotScatter PlotLine plot

Rintro2Rintro3

where to goWhere to go?

original coding by eric lecoutre - dtu research database...importing data to r importingdatator...

Documents