original coding by eric lecoutre - dtu research database...importing data to r importingdatator...

148

Upload: others

Post on 22-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Original coding by Eric Lecoutre

  • Initial coding: Sean Lorenz

  • Introduction to R

    Anders Stockmarr

    DTU ComputeSection for Statistics and Data Analysis

    Technical University of [email protected]

    DTU Management EngineeringMay 22, 2017

    (DTU) R intro May 22, 2017 1 / 93

  • Outline

    coworkers

    Elisabeth Wreford Andersen, The Danish Cancer SocietyKasper Kristensen, DTU AQUAAndes Nielsen, DTU AQUA

    (DTU) R intro May 22, 2017 2 / 93

  • Outline

    Outline of Talk

    Introduction to RData managementGraphicsLinear Modelsggplot2

    (DTU) R intro May 22, 2017 3 / 93

  • Outline

    Outline

    1 Introduction to R

    2 Importing Data to R

    3 Description of Data

    4 Modifying Data

    5 GraphicsHistogramBox plotScatter PlotLine plot

    (DTU) R intro May 22, 2017 4 / 93

  • Introduction to R

    Overview

    1 Introduction to R

    2 Importing Data to R

    3 Description of Data

    4 Modifying Data

    5 GraphicsHistogramBox plotScatter PlotLine plot

    (DTU) R intro May 22, 2017 5 / 93

  • Introduction to R

    Introduction to R

    R is a programming language and a programming environment.It is Free! Developed by users under a GNU license.Runs on a variety of platforms including Windows, Unix and MacOS.You can even get it for Android.Allows for fast implementation of new methods by user demandthrough packages.R has state-of-the-art graphics capabilities.

    (DTU) R intro May 22, 2017 6 / 93

  • Introduction to R

    Advantages of R

    Frank Harrel in 2009 (my highlighting):

    "One point that hasn’t been made very explicitly is one of the greatestadvantages of R:

    Getting your work done better and in less time.

    Hundreds of companies hire a multitude of SAS programmers to writecode in an archaic language, the SAS macro language. I believe thereis a real cost savings from R because of its value as a data analysis,data manipulation, and graphics environment. Instead ofprogramming using an indirect syntax manipulation environment (SASmacros), in R you can program in a dynamic data-sensitiveframework".

    That was 8 years ago. Things have progressed since...(DTU) R intro May 22, 2017 7 / 93

  • Introduction to R

    Base R

    Base R and most R packages are available for download at theComprehensive R Archive Network (CRAN).http://www.cran.r-project.orgBase R includes basic data management, analysis and graphics tools.For non-specialized tasks, Base R is all you need.Specialized tasks may be handled by packages.We will download, install and use packages.Packages are not all very well-documented (depends on thecontributor).Want to be sure about what you program does?

    Use well-established packages only;or write your own code.

    (DTU) R intro May 22, 2017 8 / 93

  • Introduction to R

    RStudio

    You can work directly in R.Many prefer another front end (GUI, Graphical User Interface).We will use RStudio.Download from http://www.rstudio.com/

    (DTU) R intro May 22, 2017 9 / 93

  • Introduction to R

    RStudio

    The GUI RStudio has 4 windows.One for writing the commands (the "script").

    Use script for reproducibility.

    One for results and interactive use.One for plots, help and packages.One showing which objects are resident in the R memory.

    (DTU) R intro May 22, 2017 10 / 93

  • Introduction to R

    R as a calculator2+2

    [1] 4

    (2*5)+(12/3)-(2^3)

    [1] 6

    exp(log(1))

    [1] 1

    sqrt(25)

    [1] 5

    log(2*2)

    [1] 1.3863

    log(2)+log(2)

    [1] 1.3863(DTU) R intro May 22, 2017 11 / 93

  • Introduction to R

    Writing commands in R

    Commands are separated by either a new line or ;R is case sensitive: id is a different name than ID.The character # at the beginning of a line shows that the text in thisline is a comment. I.e. the text is not executed.Help can be found on the internet; from colleagues; or in R by writing? followed by the function you want to help about:

    ?plot

    or, in RStudio, highlight the expression and press F1.

    (DTU) R intro May 22, 2017 12 / 93

  • Introduction to R

    Objects in R

    Both data and output from analyses are stored as objects (if stored);Some times, output is just displayed on the screen, and you need toassign the object to an identifier to keep it (see below).In fact, everything in the R memory is stored in objects.An object could be a vector, a matrix or a data frame.Values are assigned to objects using the assignment operator

  • Introduction to R

    Generating a sequence

    Specify the first and last values separated by a colon.Otherwise use seq()

    0:10

    [1] 0 1 2 3 4 5 6 7 8 9 10

    15:5

    [1] 15 14 13 12 11 10 9 8 7 6 5

    seq(from = 0, to = 1.2, by = 0.1)

    [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

    x

  • Introduction to R

    Generating repeats using rep()

    rep(8, 5)

    [1] 8 8 8 8 8

    rep(1:4, each = 2)

    [1] 1 1 2 2 3 3 4 4

    rep(1:4, each = 2, times = 3)

    [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4

    (DTU) R intro May 22, 2017 15 / 93

  • Introduction to R

    Functions in R

    We assign a simple function to the identifier f:

    >fff

  • Introduction to R

    Functions in R

    We have already used many functions with and without default values:

    "+"(2,2)sqrt(25)log(2)ls()":"(0,10)seq(from=0.1,to=1.2,by=0.1)rep(1:4,each=2,time=3)

    Many applications in R are built up as functions. You can see defaultarguments in the help files. Example: log.

    (DTU) R intro May 22, 2017 17 / 93

  • Introduction to R

    Data structures in R: Singles

    Logical, e.g:> TRUE[1] TRUE> 1==2[1] FALSE

    Single numbers, e.g:> 1[1] 1> 1.2[1] 1.2Character, e.g:> "5"[1] "5"> "abc"[1] "abc"

    (DTU) R intro May 22, 2017 18 / 93

  • Introduction to R

    Data structures in R: Vectors

    Constructed via the concatenate function c().

    Vector of numbers, e.g:

    > c(1,1.2,pi,exp(1))[1] 1.000000 1.200000 3.141593 2.718282

    We can have vectors of other things too, e.g:

    > c(TRUE,1==2)[1] TRUE FALSE> c("a","ab","abc")[1] "a" "ab" "abc"

    But not combinations, e.g:> c("a",5,1==2)[1] "a" "5" "FALSE"

    Note that R just turned everything into characters!(DTU) R intro May 22, 2017 19 / 93

  • Introduction to R

    Data structures in R: Matrices

    Columns of same type and same length:

    > matrix(c(1,2,3,4,5,6)+pi,nrow=2)[,1] [,2] [,3][1,] 4.141593 6.141593 8.141593[2,] 5.141593 7.141593 9.141593

    > matrix(c(1,2,3,4,5,6)+pi,nrow=2)

  • Introduction to R

    Data structures in R: Data frames

    Same length of columns but different types; spread-sheet data.Created from reading in data from external files;or by using the function data.frame() on a set of vectors.

    > data.frame(treatment=c("active","active","placebo"),+ bp=c(80,85,90))treatment bp

    1 active 802 active 853 placebo 90

    Compare to a matrix created with the cbind() command):> cbind(treatment=c("active","active","placebo"),bp=c(80,85,90))

    treatment bp[1,] "active" "80"[2,] "active" "85"[3,] "placebo" "90"

    (DTU) R intro May 22, 2017 21 / 93

  • Introduction to R

    Data structures in R: Lists

    Different length of columns and different types.Most general object type.> list(a=1,b="abc",c=c(1,2,3),d=list(e=matrix(1:4,2), f=function(x){x^2}))$a[1] 1$b[1] "abc"$c[1] 1 2 3$d$d$e

    [,1] [,2][1,] 1 3[2,] 2 4$d$ffunction (x){

    x^2}

    The objects returned from many of the built-in functions in R arefairly complicated lists.

    (DTU) R intro May 22, 2017 22 / 93

  • Importing Data to R

    Overview

    1 Introduction to R

    2 Importing Data to R

    3 Description of Data

    4 Modifying Data

    5 GraphicsHistogramBox plotScatter PlotLine plot

    (DTU) R intro May 22, 2017 23 / 93

  • Importing Data to R

    Importing Data to R

    can be done directly from SAS, SPSS, Excel, STATA etc.The easiest is to use data saved as text files.Usually values in text files are separated, or delimited, by tabs orcommas.First tell R where you want to find your data using the commandsetwd().Check that all went to plan with getwd().

    setwd("C:/users/anst/Foredrag/DTU Management Engineering 22052017")getwd()

    [1] "C:/users/anst/Foredrag/DTU Management Engineering 22052017"

    (DTU) R intro May 22, 2017 24 / 93

  • Importing Data to R

    Importing Data to R

    The function read.table() can be used to read data saved as text.Wrappers: read.csv(), read.csv2() and read.delim().Notice the option sep = .We are assigning the loaded data to objects.If you have an Excel sheet, then save as text.

    Births.tab

  • Importing Data to R

    Importing Data using RStudio

    In the Objects Window, click "Import Dataset"

    (DTU) R intro May 22, 2017 26 / 93

  • Importing Data to R

    Importing Data From Other Programs

    We can read data from a series of other statistical software packagesusing the package foreign.

    # INSTALL AN EXTRA PACKAGEinstall.packages("foreign")

    # ACTIVATE THE PACKAGElibrary("foreign")

    SPSS_Data

  • Importing Data to R

    Looking At Your Data

    There are several ways to look at the data (or parts of the data).

    # FIRST FEW OBSERVATIONShead(Births.tab)

    id bweight lowbw gestwks preterm matage hyp sex sexalph1 1 2974 0 38.52 0 34 0 2 female2 2 3270 0 NA NA 30 0 1 male3 3 2620 0 38.15 0 35 0 2 female4 4 3751 0 39.80 0 31 0 1 male5 5 3200 0 38.89 0 33 1 1 male6 6 3673 0 40.97 0 33 0 2 female

    (DTU) R intro May 22, 2017 28 / 93

  • Importing Data to R

    Looking At Your Data

    # LAST FEW OBSERVATIONStail(Births.tab)

    id bweight lowbw gestwks preterm matage hyp sex sexalph495 495 2968 0 41.01 0 34 0 1 male496 496 2852 0 38.45 0 28 0 2 female497 497 3187 0 38.03 0 38 1 1 male498 498 3054 0 38.50 0 26 0 2 female499 499 3178 0 39.92 0 31 0 2 female500 500 2918 0 37.97 0 31 0 1 male

    # VARIABLE NAMESnames(Births.tab)

    [1] "id" "bweight" "lowbw" "gestwks" "preterm" "matage" "hyp"[8] "sex" "sexalph"

    # VIEW THE DATA IN A NEW WINDOWView(Births.tab)

    (DTU) R intro May 22, 2017 29 / 93

  • Importing Data to R

    Missing values

    In R, missing values are coded as NA (not available).In your Excel file leave missing values blank, do not set them to 99 or999.

    id bweight lowbw gestwks preterm matage hyp sex sexalph1 1 2974 0 38.52 0 34 0 2 female2 2 3270 0 NA NA 30 0 1 male

    (DTU) R intro May 22, 2017 30 / 93

  • Importing Data to R

    Accessing Observations

    Data are (usually) stored in a data frame object.Observations are the rows.Variables, either numerical or categorical, are the columns.We can access individual rows, columns and cells in the data frame.For this, we use the bracket operator: object[row, column].

    (DTU) R intro May 22, 2017 31 / 93

  • Importing Data to R

    Accessing Observations

    # A SINGLE CELLBirths.tab[345, 4]

    [1] 38.55

    # LEAVING OUT A COLUMN NUMBER INDICATES THAT ALL COLUMNS# ARE CHOSEN. HERE ALL COLUMNS IN ROW 224Births.tab[224 , ]

    id bweight lowbw gestwks preterm matage hyp sex sexalph224 224 3216 0 39.94 0 38 1 1 male

    (DTU) R intro May 22, 2017 32 / 93

  • Importing Data to R

    Accessing Observations

    # LEAVING OUT A ROW NUMBER INDICATES THAT ALL ROWS ARE CHOSEN# HERE ALL ROWS IN COLUMN 5Births.tab[ ,5]

    [1] 0 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0[24] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0[47] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0[70] 0 0 0 1 1 0 0 0 0 0 NA 0 0 0 0 0 0 0 0 0 0 NA 0[93] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 0 0 0 0 0 1

    [116] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0[139] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[162] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0[185] 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0[208] 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0[231] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1[254] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1[277] 0 0 0 0 1 NA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[300] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0[323] 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0[346] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 NA 0 1 0 1 0 0[369] 0 1 0 0 0 0 0 0 1 0 0 0 NA 0 0 0 0 0 0 0 0 0 0[392] 0 0 0 1 NA 0 0 NA NA 0 0 0 0 0 0 0 0 0 0 1 0 0 1[415] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0[438] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0[461] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0[484] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    (DTU) R intro May 22, 2017 33 / 93

  • Importing Data to R

    Accessing Observations

    # USE RANGES, ROWS 15 TO 18 COLUMNS 1 TO 4Births.tab[15:18, 1:4]

    id bweight lowbw gestwks15 15 3662 0 39.2316 16 3035 0 38.9617 17 3351 0 39.3518 18 3804 0 38.99

    (DTU) R intro May 22, 2017 34 / 93

  • Importing Data to R

    Accessing Observations

    Variables can be accessed directly using their name, either with the $operator (object$variable) the name (object[ ,"variable"]), or the columnnumber (object[ ,k]).

    # GET THE BIRTH WEIGHT FOR CHILD 26 TO 36Births.tab$bweight[26:36]

    [1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509

    Births.tab[26:36, "bweight"]

    [1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509

    Births.tab[26:36,2]

    [1] 3585 3798 3164 3739 1780 4022 3942 2887 2391 3911 3509

    (DTU) R intro May 22, 2017 35 / 93

  • Importing Data to R

    Subsetting using the c() function

    The concatenate function c() can be used to access non-sequentialrows and columns from a data frame.

    # GET COLUMNS 2, 5, 7, 8, 9 FOR ROW 33Births.tab[33, c(2, 5, 7:9)]

    bweight preterm hyp sex sexalph33 2887 0 0 1 male

    # GET bweight, preterm and sexalph FOR ROW 71Births.tab[71, c("bweight", "preterm", "sexalph")]

    bweight preterm sexalph71 3189 0 male

    (DTU) R intro May 22, 2017 36 / 93

  • Importing Data to R

    Variable Names

    If we want to change the variable names we can use names().

    # NEW VARIABLE NAMESnames(Births.tab)

  • Importing Data to R

    Saving/Exporting data

    We can save the data to a textfile, using either write.table() for a tabseparated file, or write.csv()/write.csv2() for a comma/semicolonseparated file (with "."and ","as punctuation mark, respectively).

    write.table(Births.tab, file = "Birth_new.txt",sep = "\t", na = ".", row.names= FALSE)

    write.csv2(Births.tab, file = "Birth_new.csv")

    (DTU) R intro May 22, 2017 38 / 93

  • Description of Data

    Overview

    1 Introduction to R

    2 Importing Data to R

    3 Description of Data

    4 Modifying Data

    5 GraphicsHistogramBox plotScatter PlotLine plot

    (DTU) R intro May 22, 2017 39 / 93

  • Description of Data

    Description of Data

    We are still looking at the data set with birth weights for 500 children.Using the function str() we can see a description of what our data framecontains (the structure).

    str(Births.tab)

    'data.frame': 500 obs. of 9 variables:$ id : int 1 2 3 4 5 6 7 8 9 10 ...$ bweight: int 2974 3270 2620 3751 3200 3673 3628 3773 3960 3405 ...$ lowbw : int 0 0 0 0 0 0 0 0 0 0 ...$ gestwks: num 38.5 NA 38.2 39.8 38.9 ...$ preterm: int 0 NA 0 0 0 0 0 0 0 0 ...$ matage : int 34 30 35 31 33 33 29 37 36 39 ...$ hyp : int 0 0 0 0 1 0 0 0 0 0 ...$ sex : int 2 1 2 1 1 2 2 1 2 1 ...$ sexalph: Factor w/ 2 levels "female","male": 1 2 1 2 2 1 1 2 1 2 ...

    (DTU) R intro May 22, 2017 40 / 93

  • Description of Data

    Description of Data: Birth weights

    The Birth.tab dataset is a data frame with 500 observations and 9variables.Some are integers but “gestwks“ is numeric.The variable “sexalph“ is a factor. This is a categorical variable (eithernumeric or string) with a finite number of levels, here “female“ and“male“.“sexalph“ and “sex“ contains the same info, but “sexalph“ is a factorwhile “sex“ is not.We can convert “sex“ to a factor using as.factor().

    (DTU) R intro May 22, 2017 41 / 93

  • Description of Data

    Description of Data: Birth weights

    # TELL R THAT sex IS A FACTORBirths.tab$sex

  • Description of Data

    Descriptive Statistics

    There are many simple extractor functions for summary statistics in R.Common functions are mean(), sd(), median(), max() and min().

    mean(Births.tab$bweight)

    [1] 3136.9

    sd(Births.tab$bweight)

    [1] 637.45

    median(Births.tab$bweight)

    [1] 3188.5

    max(Births.tab$bweight)

    [1] 4553

    min(Births.tab[ , 2])

    [1] 628

    (DTU) R intro May 22, 2017 43 / 93

  • Description of Data

    The Summary Function

    The function summary() can be used with many objects in R.When used on a data frame we get all the main summary statistics.

    # SUMMARY OF THE DATA FRAMEsummary(Births.tab)

    id bweight lowbw gestwksMin. : 1 Min. : 628 Min. :0.00 Min. :24.71st Qu.:126 1st Qu.:2862 1st Qu.:0.00 1st Qu.:37.9Median :250 Median :3188 Median :0.00 Median :39.1Mean :250 Mean :3137 Mean :0.12 Mean :38.73rd Qu.:375 3rd Qu.:3551 3rd Qu.:0.00 3rd Qu.:40.1Max. :500 Max. :4553 Max. :1.00 Max. :43.2

    NA's :10preterm matage hyp sex sexalph

    Min. :0.000 Min. :23 Min. :0.000 1:264 female:2361st Qu.:0.000 1st Qu.:31 1st Qu.:0.000 2:236 male :264Median :0.000 Median :34 Median :0.000Mean :0.129 Mean :34 Mean :0.1443rd Qu.:0.000 3rd Qu.:37 3rd Qu.:0.000Max. :1.000 Max. :43 Max. :1.000NA's :10

    (DTU) R intro May 22, 2017 44 / 93

  • Description of Data

    Summaries

    We may only want summaries for some of the data, e.g. babies withbirth weight < 2900g.We subset the data and then summarize as before:

    summary(Births.tab[Births.tab$bweight

  • Description of Data

    Group Summaries

    We can work on data separated by groups.Suppose that we want to calculate the mean birth weight for boys andgirls (many ways to do this).We will use the tapply() function to apply the mean function to thetwo levels of “sexalph“.tapply(, , ).

    # MEAN BIRTH WEIGHT FOR BOYS AND GIRLStapply(Births.tab$bweight, Births.tab$sexalph, mean)

    female male3032.831 3229.902

    (DTU) R intro May 22, 2017 46 / 93

  • Description of Data

    Histogram

    Often it is easier to get an impression of a distribution using plots.Histograms are typically used for continuous variables.

    hist(Births.tab$bweight, main = "Title", xlab = "Birth weight (g)")

    Title

    Birth weight (g)

    Fre

    quen

    cy

    1000 2000 3000 4000 5000

    050

    100

    150

    (DTU) R intro May 22, 2017 47 / 93

  • Description of Data

    Histogram

    Often it is easier to get an impression of a distribution using plots.Histograms are typically used for continuous variables. Here with a box on.

    hist(Births.tab$bweight, main = "Title", xlab = "Birth weight (g)")box()

    Title

    Birth weight (g)

    Fre

    quen

    cy

    1000 2000 3000 4000 5000

    050

    100

    150

    (DTU) R intro May 22, 2017 48 / 93

  • Description of Data

    Boxplot

    Boxplots show the median, upper, lower quartiles and potentially extremevalues.

    boxplot(Births.tab$bweight, xlab = "Birth weight (g)")

    ●●●

    ●●

    ●●

    ●●●●●

    ●●●●●

    1000

    3000

    Birth weight (g)

    (DTU) R intro May 22, 2017 49 / 93

  • Modifying Data

    Overview

    1 Introduction to R

    2 Importing Data to R

    3 Description of Data

    4 Modifying Data

    5 GraphicsHistogramBox plotScatter PlotLine plot

    (DTU) R intro May 22, 2017 50 / 93

  • Modifying Data

    Modifying Data

    We will concentrate on how to modify and rearrange our data.Data can be sorted with the order function.order can sort the Birth.tab data by “sex“, and then by “bweight“.The order function returns a vector of sorted indices, which we applyto the rows of the unsorted data frame to get a sorted version.

    Birth_sort

  • Modifying Data

    Creating new variables and deleting old

    New variables can be added to a data frame.

    # ADD A VARIABLE TO DATA FRAMEBirths.tab$log_bweight

  • Modifying Data

    Grouping the values of a variable using cut

    If you want to group a continuous variable e.g. mother’s age (matage) intothe groups: ]20-30], ]30-35], ]35-40], ]40-45].

    Births.tab$agegrp

  • Modifying Data

    Creating new variables: RowSums

    Often we want to form new variables from other variables.For example we might want to calculate a total score from some subscores.We can sum variables using rowSums. Related functions are:rowMeans, colSums, colMeans.Notice the effect of the option na.rm:na.rm= FALSE: If we take a row sum where one of the values ismissing then the row sum is set to missing.na.rm= TRUE: If we want to ignore missing values and calculate asum of the non-missing.rowSums, rowMeans, colSums and colMeans are wrappers of sapply,ie. t.ex. colMeans(x) is the same as sapply(x,mean). sapply can beused with many other functions.

    (DTU) R intro May 22, 2017 54 / 93

  • Modifying Data

    Creating new variables: RowSums

    # WANT TO MAKE A NEW VARIABLE SUMMING PRETERM, LOWBW AND HYPBirths.tab$score

  • Modifying Data

    Split Data: Subset

    Sometimes we may need to split our data.In the Births data we may need to split the data into boys and girls.We can use the subset() function and assign the new data sets toseparate R objects.Notice == (logical expression). We are not assigning a value to “sex“,but asking whether “sex is equal to 1“.

    Births.Male

  • Modifying Data

    Subset

    Often data sets come with a lot of variables and we only want to use afew.The function subset() can also be used to select the variables we want.Notice the select option. This is needed to say that we want a subsetof columns (on the previous slide it was rows).Notice that we do not need quotes in select.

    # SELECT 3 VARIABLESBirths.new

  • Modifying Data

    Aggregating data

    Sometimes we want to make a new dataframe as a summary of theoriginal dataframe on the basis of factor levels.Below we want to make a new dataframe with the mean birthweightfor combinations of preterm and sex.

    PreSex

  • Modifying Data

    Add rows: rbind

    Suppose that aata are collected for subgroups of subjects and saved inseparate objects.The separate objects are appended (stacked) to create a single object.This will give an error message if the number of columns differs.

    # APPENDBirths.Both

  • Modifying Data

    Add variables: merge

    Often you have data in several data sets and want to combine the data setsby merging using one or more variables as key variables. Adding variables toa master data set.

    Person Data

    Id, age, sex, race Answers to ques-

    tionnaire:

    Id, q1,…,q10

    Merged data: Person data and answers. Id, age, sex, race, q1,…,q10

    PDFil

    l PDF

    Edit

    or wi

    th Fr

    ee W

    riter

    and T

    ools

    (DTU) R intro May 22, 2017 60 / 93

  • Modifying Data

    Merge

    We have two data sets with a key variable "id". One with backgroundinformation and one set with blood pressure measurements.

    agesex

  • Modifying Data

    4 Different Merges

    In the merge function we will look at 4 of the options.We have merge(x, y, by = "key variable", all = TRUE, < all =FALSE, all.x = TRUE, all.y =FALSE > ).Here x and y are data frames

    (DTU) R intro May 22, 2017 62 / 93

  • Modifying Data

    Merging all=FALSE

    merge_small

  • Modifying Data

    Merging all=TRUE

    merge_large

  • Modifying Data

    Merging all.x=TRUE

    merge_x

  • Modifying Data

    Merging all.y=TRUE

    merge_y

  • Modifying Data

    Counting the Missing Observations: The is.na() and sum()functions

    Suppose that we want to count the number of missing observations.The function is.na returns a logical vector that is TRUE when a valueis missing and FALSE otherwise.

    is.na(merge_y$sex)

    [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE

    #COUNT MISSING FOR ONE VARIABLEsum(is.na(merge_y$sex))

    [1] 1

    #COUNT FOR DATA FRAMEcolSums(is.na(merge_y))

    id age sex visit bp0 2 1 0 0

    (DTU) R intro May 22, 2017 67 / 93

  • Modifying Data

    Saving your work

    Saving your scriptSaving your workspace

    Always save your script - do it often if you work in Rstudio.

    Reasons for saving your workspace:Extensive data creations will be there next time you open yourworkspace.Objects created ’on the fly’ (not in your script) will be there.

    Reasons for not saving your workspace:With a well-written script, you can recreate your analysis in seconds,unless you work with huge amounts of data.Edited and saved data where editions have been forgotten may causehavoc on your results.Left-over objects created for various purposes may enter yourcalculations unintentionally due to the structure of R’s search path.

    (DTU) R intro May 22, 2017 68 / 93

  • Modifying Data

    Saving your work

    How to save your work:

    Script: Click on the script and press ’save’ in Rstudio and the plain RGUI.Workspace: Click on the command prompt and press ’save’.Alternatively, use the save.image() functionBoth: Accept when asked after terminating Rstudio or the plain R GUI.

    (DTU) R intro May 22, 2017 69 / 93

  • Graphics

    Overview

    1 Introduction to R

    2 Importing Data to R

    3 Description of Data

    4 Modifying Data

    5 GraphicsHistogramBox plotScatter PlotLine plot

    (DTU) R intro May 22, 2017 70 / 93

  • Graphics

    Visualizing Data

    Whenever we want to analyze data, the first thing we do is to have alook at it.How are the observations spread out? What are the most commonvalues? Are there any unusual observations? Are there anyrelationships between variables? Etc.

    The graphics section will not tell you all about graphics in R but get yougoing.

    (DTU) R intro May 22, 2017 71 / 93

  • Graphics

    R Graphics Systems

    base The original/default graphics system in R.Example:

    demo(graphics)

    Highly customizable; but complex plots require much code.

    lattice Shorter syntax for complex (e.g. multipanel) plots. Lesscustumizable than base.

    Example:library(lattice)demo(lattice)

    ggplot2 By Hadley Wickham; builds on the same ideas as lattice.gg = “grammar of graphics”Example:

    library(ggplot2)example(qplot)

    (DTU) R intro May 22, 2017 72 / 93

  • Graphics Histogram

    A Basic Histogram

    Common way to examine the distribution of a continuous variable.The range of the variable is by default divided into equal-widthintervals (bins). Plots the number of observations in each bin (unlessyou specify otherwise).

    hist(Births$bweight)

    Histogram of Births$bweight

    Births$bweight

    Fre

    quen

    cy

    1000 2000 3000 4000 5000

    050

    150

    Note that R automatically creates axis labels and a heading.(DTU) R intro May 22, 2017 73 / 93

  • Graphics Histogram

    Histogram with a few options

    To modify axis labels we set the options xlab and ylab.The heading is set in the option main.

    hist(Births$bweight, xlab = "Birth weight (g)",main = "Histogram of Birth Weight")

    Histogram of birth weight

    Birth weight (g)

    Fre

    quen

    cy

    1000 2000 3000 4000 5000

    010

    0

    (DTU) R intro May 22, 2017 74 / 93

  • Graphics Histogram

    Histogram with more options

    We could type ?hist to find more options to customize the histogram.The available colours are coded as numbers or one can write col =“red“If we want shading we can try the density function.The angle of the numbers on the axes is set by the option las.

    hist(Births$bweight,las = 1, main = "Histogram of birth weight",col = 2, density = 7)

    Histogram of birth weight

    Births$bweight

    Fre

    quen

    cy

    1000 2000 3000 4000 5000

    050

    100150

    (DTU) R intro May 22, 2017 75 / 93

  • Graphics Histogram

    How to get your plot from RStudio

    (DTU) R intro May 22, 2017 76 / 93

  • Graphics Box plot

    A Basic Box Plot

    Box plots show a measure of the location (the median line).The spread of the distribution (the length of the box and whiskers).Skewness as asymmetry in the upper and lower parts of the box andwhisker length.We use the function boxplot(variable). Adding labels to the axes andcolours is done as for hist.

    (DTU) R intro May 22, 2017 77 / 93

  • Graphics Box plot

    Histograms and a Box Plot

    (DTU) R intro May 22, 2017 78 / 93

  • Graphics Box plot

    A Basic Box Plot

    When describing data we can even add the observations to the plot.Notice the function rug shows the observations.

    boxplot(Births$bweight, xlab = "Birth weight (g)", horizontal = TRUE,col = 6)

    rug(Births$bweight)

    ● ●●●● ●● ●● ●●● ●● ●●●● ●● ●●● ●●

    1000 2000 3000 4000

    Birth weight (g)(DTU) R intro May 22, 2017 79 / 93

  • Graphics Box plot

    Box Plot for Groups

    A very useful feature is that we can make box plots for different groupsnext to each other for comparison. Notice the option data = Births.

    # BOX PLOT FOR BOYS AND GIRLSboxplot(bweight ~ sexalph, data = Births, las = 1,

    ylab = "Birth weight (g)", col = 2:3)

    ●●

    ●●●

    ●●●●●

    ●●●

    ●●●

    ●●●

    female male

    1000

    2000

    3000

    4000

    Bir

    th w

    eigh

    t (g)

    (DTU) R intro May 22, 2017 80 / 93

  • Graphics Box plot

    Box Plot for GroupsSet our own axis. Notice xaxt = “n“.

    # BOX PLOT WHERE WE WANT TO MAKE OUR OWN AXISboxplot(bweight ~ sexalph, data = Births, las = 1,

    ylab = "Birth weight (g)", col = c("red", "blue"), xaxt = "n")axis(1 ,at = c(1,2), labels = c('Girl', 'Boy'))

    ●●

    ●●●

    ●●●●●

    ●●●

    ●●●

    ●●●

    1000

    2000

    3000

    4000

    Bir

    th w

    eigh

    t (g)

    Girl Boy

    (DTU) R intro May 22, 2017 81 / 93

  • Graphics Scatter Plot

    The Basic Scatter Plot

    The scatter plot is the standard graph for examining the relationshipbetween two continuous variables.The plot(x,y) function is used to create scatter plots. Where (x,y) arethe points we want to plot.We will look at the relationship between car weight (lbs/1000) andmiles per gallon for 32 cars.

    plot(mtcars$wt, mtcars$mpg)lines (sort(mtcars$wt),37.285-5.344*sort(mtcars$wt),type="l")

    (DTU) R intro May 22, 2017 82 / 93

  • Graphics Scatter Plot

    The Basic Scatter Plot

    2 3 4 5

    1015

    2025

    30

    mtcars$wt

    mtc

    ars$

    mpg

    (DTU) R intro May 22, 2017 83 / 93

  • Graphics Scatter Plot

    The Scatter Plot

    We can customize the scatter plot similar to before.The function abline adds a straight line to the plot.When we write abline(lm(mpg ∼ wt)) we get the best fitting line.

    plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1, pch = 19)

    abline(lm(mtcars$mpg ~ mtcars$wt), lty = 1, col = 3)

    ● ●● ●●●

    ●●●● ●●●

    ● ●●

    ●●

    ●● ●

    ● ●●

    ●●

    2 3 4 5

    1015202530

    Car weight (lbs/1000)

    Mile

    s pe

    r ga

    llon

    (DTU) R intro May 22, 2017 84 / 93

  • Graphics Scatter Plot

    abline

    The function abline can also add reference lines to a plot.A horizontal line, e.g. at 25 and 30 abline(h = c(25, 30))A vertical line, e.g. at 2 and 5 abline(v = c(2, 5))

    plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1, pch = 19)

    abline(h = c(25, 30), col = c("red", "magenta"), lty = 2)abline(v = c(2, 5), col = 4:5, lty = 3:4)

    ● ●● ●●●

    ●●●● ●●●

    ● ●●

    ●●●

    ●● ●

    ● ●●

    ●●

    2 3 4 5

    1015202530

    Car weight (lbs/1000)

    Mile

    s pe

    r ga

    llon

    (DTU) R intro May 22, 2017 85 / 93

  • Graphics Scatter Plot

    Add a smoothed line

    Perhaps we do not think the association is linear and try a nonparametricsmoothed line.

    plot(mtcars$wt, mtcars$mpg, xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1)

    abline(lm(mtcars$mpg ~ mtcars$wt), lty = 2, col = 4)lines(lowess(mtcars$wt, mtcars$mpg), lty = 1, col = 2)

    (DTU) R intro May 22, 2017 86 / 93

  • Graphics Scatter Plot

    Add a smoothed line

    ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    2 3 4 5

    10

    15

    20

    25

    30

    Car weight (lbs/1000)

    Mile

    s pe

    r ga

    llon

    (DTU) R intro May 22, 2017 87 / 93

  • Graphics Scatter Plot

    Enhanced graph procedures: Scatter plot example from the"car"package

    scatterplot(mpg ~ wt | cyl, data = mtcars, ylim = c(0,40),xlab = "Car weight (lbs/1000)",ylab = "Miles per gallon", las = 1,legend.plot = TRUE,legend.coords = "topright",id.method = "identify",labels = row.names(mtcars),boxplots = "xy")

    Here we want to plot miles per gallon versus weight for cars that have 4, 6 or 8 cylinders.We write this as mpg ∼ wt | cyl.By default we get different colours for groups and both a linear and a smoothed line.A legend is included in the top right corner of the plot.The option id.method = “identify“ means that points can be identified by mouse clicks.Box plots of miles per gallon and weight included ("xy"option for both axes).More possibilities: ?scatterplot.

    (DTU) R intro May 22, 2017 88 / 93

  • Graphics Scatter Plot

    The resulting scatter plot

    (DTU) R intro May 22, 2017 89 / 93

  • Graphics Line plot

    A Line Plot

    Connecting points in a scatter plot from left to right. Here the growth of atree. Notice the option type = “b“ meaning points joined by lines.

    plot(TreeA$age, TreeA$circumference, type = "b", xlab = "Age (days)",ylab = "Circumference (mm)", las = 1)

    ●●

    ● ●

    500 1500

    50

    100

    150

    200

    Age (days)

    Circ

    umfe

    renc

    e (m

    m)

    ●●

    ● ●

    500 1500

    50

    100

    150

    200

    Age (days)

    Circ

    umfe

    renc

    e (m

    m)

    (DTU) R intro May 22, 2017 90 / 93

  • Graphics Line plot

    Difference between plot() and lines() functions

    We have seen both the plot and the lines functions.The plot function creates a new graph. It is a high-level plottingfunction.The lines function adds information to an existing graph but it cannotproduce it’s own graph. It is a low-level plotting function.A high-level plotting function can (often) be converted to a low-levelplotting function with the option ADD=TRUE.Usually lines will be used after a high-level plotting function (such asplot) has produced a graph.

    (DTU) R intro May 22, 2017 91 / 93

  • Graphics Line plot

    A line plot and a legend

    plot(TreeA$age, TreeA$circumference, type = "b", lty = 1,xlab = "Age (days)",ylab = "Circumference (mm)", las = 1, col= 2)

    lines(TreeB$age, TreeB$circumference, type = "b", col = 3, lty = 2)legend(locator(1), # we will place it with a mouse click

    legend = c("A","B"), title = "Tree",lty = 1:2, col= 2:3)

    (DTU) R intro May 22, 2017 92 / 93

  • Graphics Line plot

    Layout of several plots on one graph

    Several plots on one graph:

    Use the option par(mfrow = c(2, 2)) and back to one plot par(mfrow =c(1, 1)). For other options: Check the layout() function

    (DTU) R intro May 22, 2017 93 / 93

  • Linear Models

  • Linear models

    Statistical models of a linear relationship between variables:

    Yi = α+ βXi + εi , i = 1, . . . , n.

    � Dependent variable: Y .

    � Independent variable: X .

    � Stochastic term/error term: ε.

    The εi ’s should be a) stochastically independent, and b) identically normallydistributed, with mean 0, and variance σ2 for some positive number σ2 > 0.

    � Model parameters: α, β and σ2.

  • Linear models: Example

    Y = 1 + 0.5X + ε

    > plot(X,Y,xlab=�X�,ylab=�Y�)

    > lines(sort(X),1+0.5*sort(X),lwd=3,col="red")

    0 1 2 3 4 5

    01

    23

    4

    X

    Y

  • Linear models: Example

    Model residuals: The random/stochastic term.

    > residuals.Y 0.

    f

  • Fitting linear models: The lm() function

    Y = α+ βX + ε

    In R, linear models can be fitted to data with the lm() function:

    > analysis analysis

    Call:

    lm(formula = Y ~ X)

    Coefficients:

    (Intercept) X

    0.9702 0.5155

    α̂ is the intercept 0.97, while β̂ is the estimated coefficient to X, 0.52.

  • Model formulas

    The argument to lm() is a formula object.

    � A linear model is specified by a formula object, which t.ex. may look likethis:

    > my.formula fit fit fit

  • The lm object: Model diagnostics

    > analysis

  • The lm object: Contents

    � An lm object is a list, and contains a lot of information. See the contentswith the names() function:

    > analysis names(analysis)

    [1] "coefficients" "residuals" "effects"

    [4] "rank" "fitted.values" "assign"

    [7] "qr" "df.residual" "xlevels"

    [10] "call" "terms" "model"

    � Access the contents with the $ operator; eg.

    > analysis$coef

    (Intercept) X

    0.9701906 0.5154684

    � Some of the 12 components of the list are lists themselves. Find moreinformation by applying str().

  • The lm object: Summaries

    The summary() fuction may be applied to lm objects as well:

    > analysis summary(analysis)

    Call:lm(formula = Y ~ X)

    Residuals:Min 1Q Median 3Q Max

    -1.61297 -0.40132 0.07808 0.55124 1.32380

    Coefficients:Estimate Std. Error t value Pr(>|t|)

    (Intercept) 0.97019 0.09182 10.566 < 2e-16 ***X 0.51547 0.05410 9.527 1.29e-15 ***---Signif. codes:0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

    Residual standard error: 0.6861 on 98 degrees of freedomMultiple R-squared: 0.4808, Adjusted R-squared: 0.4755F-statistic: 90.77 on 1 and 98 DF, p-value: 1.286e-15

  • The lm object: Summaries

    The summary is a R list object itself, with sub-elements that can be accessed:

    > analysis names(summary(analysis))

    [1] "call" "terms" "residuals"

    [4] "coefficients" "aliased" "sigma"

    [7] "df" "r.squared" "adj.r.squared"

    [10] "fstatistic" "cov.unscaled"

    We can find the estimate σ̂2 for σ2 as

    > summary(analysis)$sigma^2

    [1] 0.4707802

  • Modeling nonlinear relations with lm()

    > plot(X,Y)

    > lines(sort(X),predict(lm(Y~X))[order(X)])

    0 1 2 3 4 5

    040

    080

    012

    00

    X

    Y

    Relationship with Y and X is not linear. How to proceed with lm()?

  • Modeling nonlinear relations with lm()

    � The I-operator in formulas:

    > analysis plot(X,Y)

    > lines(sort(X),predict(analysis)[order(X)],type="l" )

    0 1 2 3 4 5

    040

    080

    012

    00

    X

    Y

    ’Linear’ in lm() is relative to the ’right’ independent variables.

  • Extraction functions

    � Some important extraction functions for obtaining information:

    coef() Estimated model parametersconfint() Confidence intervals for estimated model parameters

    residuals() Raw residualsrstandard() Standardized residuals

    model.matrix() The design matrixpredict() Predictions from model

    vcov() Covariance matrix for estimated model parametersanova() Anova test table for model reductiondrop1() Test for dropping one term from model

    summary() A summary printout, and access to summary statistics

    � Statistical tests: drop1() is usually the function to use.

  • Factors and interactions

    A dataset on Sex, Age,and a response Y:

    > summary(my.data)

    Sex Age YFemale:50 Min. :18.70 Min. : 3.091Male :50 1st Qu.:36.38 1st Qu.: 9.274

    Median :51.12 Median :12.430Mean :49.99 Mean :12.7113rd Qu.:63.22 3rd Qu.:15.972Max. :77.31 Max. :21.800

    plot(my.data$Age, my.data$Y,xlab=��,ylab=�Y�,col=my.data$Sex)legend(20,20,c("Female","Male"),col=1:2,pch=1)

    20 30 40 50 60 70

    510

    1520

    Y

    FemaleMale

  • Factors and interactions

    Model: Interaction between Sex and Age. Testing the interaction term withdrop1():

    > analysis drop1(analysis,test="F")

    Single term deletions

    Model:

    Y ~ Age + Sex + Age:Sex

    Df Sum of Sq RSS AIC F value Pr(>F)

    102.77 10.737

    Age:Sex 1 33.131 135.91 36.679 30.947 2.391e-07 ***

    ---

    Signif. codes:

    0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

    The interaction is for real and cannot be removed.

  • Factors and interactions

    > summary(analysis)

    Call:lm(formula = Y ~ Age + Sex + Age:Sex, data = my.data)

    Residuals:Min 1Q Median 3Q Max

    -2.60300 -0.53551 0.00317 0.59830 2.43544

    Coefficients:Estimate Std. Error t value Pr(>|t|)

    (Intercept) 1.696993 0.452879 3.747 0.000305 ***Age 0.187921 0.009045 20.777 < 2e-16 ***SexMale -0.525623 0.677086 -0.776 0.439479Age:SexMale 0.071599 0.012871 5.563 2.39e-07 ***---Signif. codes:0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

    Residual standard error: 1.035 on 96 degrees of freedomMultiple R-squared: 0.945, Adjusted R-squared: 0.9433F-statistic: 550.2 on 3 and 96 DF, p-value: < 2.2e-16

    � R selects the first level of the Sex variable; similarly for the interactionterm.

  • Factors and interactions

    > my.data2 with(my.data2,{+ plot(Age,Y,xlab=�Age�,ylab=�Y�,col=Sex)+ lines(Age[Sex=="Female"],predicted[Sex=="Female"],col=1,type="l")+ lines(Age[Sex=="Male"],predicted[Sex=="Male"],col=2,type="l")+ legend(20,20,c("Female","Male"),col=1:2,pch=1)+ })

    20 30 40 50 60 70

    510

    1520

    Age

    Y

    FemaleMale

  • More on formula objects

    � Model formulae are symbolic. We have seen the use of ’+’ and ’:’, andadding a 0 or -1.

    � The product ’*’ crosses variables: Expands to main effects andinteractions:

    y ~ x*z

    corresponds to

    $y~x+z+x:z$

    � Powers expands effects to the specified order:

    y~(x+z+w)^2

    corresponds to

    y~x+z+w+x:z+x:w+z:w

    � The subtraction function ’-’ removes variables if possible:

    y~(x+z+w)^2-x:z-a:b

    corresponds to

    y~x+z+w+x:w+z:w

  • More on formula objects

    The I() function overrides the symbolic interpretation, and invokes the usualarithmetic instead.

    Observe that

    y~(x*z)^2 = y~(x+z)^2

    But

    y~I((x*z)^2) and y~I((x+z)^2)

    are two different model formulas; regressing y on x2z2 and x2 + z2 + 2xz ,respectively.

  • Formulas when transforming data into normality

    � Sometimes it is possible to transform data, such that it matches a linearmodel.

    � For instance if the variance is increasing with the mean

    1 2 3 4 5 6 7

    −3−2

    −10

    12

    Raw data

    Prediction

    Res

    idua

    l

    0.0 0.5 1.0 1.5 2.0

    −0.6

    −0.2

    0.2

    0.6

    Log transformed

    Prediction

    Res

    idua

    l� A log transformation is often appropriate in this case.

    � This may be done directly in a formula object. T. ex:

    log(y)~log(x)+log(z)

  • Generalized linear models - the glm() function

    � Some types of observations can never be transformed into normality

    � Example: binary data; ones and zeroes.

    � For a wide class of distributions, the so called exponential families, we canuse generalized linear models:

    � Formulate linear models for a transformation of the mean value.

    � No transformation of observations, thereby preserving their distributionalproperties.

    � Allows easy modeling in R with the glm() function, nearly identical tolm().

    � Standard example: Logistic regression.

  • GLM vs GLM

    General linear models Generalized linear models

    Normal distribution Exponential dispersion family

    Mean value linear Function of mean value linear

    Independent observations Independent observations

    Same variance Variance function of mean

    lm() easy to apply glm() almost as easy to apply

  • Types of response variables

    i Count data (y1 = 57, . . ., yn = 59 accidents) - Poisson distribution.

    ii Binary response variables (y1 = 0, y2 = 1, . . ., yn = 0), or frequencies ofcounts (y1 = 15/297, . . ., yn = 144/285) - Binomial distribution.

    iii Count data, waiting times - Negative Binomial distribution.

    iv Multiple ordered categories ”Unsatisfied”, ”Neutral”, ”Satisfied” -Multinomial distribution.

    v Count data, multiple categories - Multinomial distribution..

    vi Continuous responses, constant variance (y1 = 2.567, . . ., yn = 2.422) -Normal distribution.

    vii Continuous positive responses with constant coefficient of variation -Gamma distribution.

  • Logistic regression example

    In a study of developmental toxicity of a chemical compound, a specifiedamount of an ether was dosed daily to pregnant mice, and after 10 days allfetuses were examined. The size of each litter and the number of stillbornswere recorded:

    Index Number of Number of Fraction still- Concentrationstillborn, zi fetuses, ni born, yi [mg/kg/day], xi

    1 15 297 0.0505 0.02 17 242 0.0702 62.53 22 312 0.0705 125.04 38 299 0.1271 250.05 144 285 0.5053 500.0

    Table: Results of a dose-response experiment on pregnant mice. Number of stillbornfetuses found for various dose levels of a toxic agent.

    Reported in Price et al. (1987).

  • Logistic regression example

    Let Zi denote the number of stillborns at dose concentration xi .

    We shall assume Zi ∼ B(ni , pi ), that is a binomial distribution corresponding toni independent trials (fetuses), and the probability, pi , of stillbirth being thesame for all ni fetuses.

    We want to model Yi = Zi/ni . In particular, we will look for a model forE [Yi ] = pi .

  • Logistic regression example

    � A natural quantity to consider is the odds, p/(1− p); varies on (0;∞),more natural than (0; 1) where p varies.

    � since effects on the odds are often multiplicative, we take the log toconvert the effects to additive form.

    � we arrive at the logit function:

    logit(p) = log( p1− p

    ).

    for this model, the logit function is our link function. We will formulate alinear model for the mean values transformed with the link function:

    ηi = logit(pi ), i = 1, . . . , 5.

    The linear model is

    ηi = α+ βxi , i = 1, . . . , 5.

    � The inverse transformation, which gives the probabilities, pi , for stillbirthis the so-called logistic function:

    pi =exp(α+ βxi )

    1 + exp(α+ βxi ), i = 1, . . . , 5.

  • Logistic regression example

    > mice mice$resp mice.glm

  • Logistic regression example

    > summary(mice.glm)

    Call:

    glm(formula = resp ~ conc, family = binomial(link = logit), data = mice)

    Deviance Residuals:

    1 2 3 4 5

    1.1317 1.0174 -0.5968 -1.6464 0.6284

    Coefficients:

    Estimate Std. Error z value Pr(>|z|)

    (Intercept) -3.2479337 0.1576602 -20.6

  • Logistic regression exampleThe linear predictor, ŷi = α̂+ β̂xi :

    0 100 200 300 400 500

    −3.5

    −3.0

    −2.5

    −2.0

    −1.5

    −1.0

    −0.5

    0.0

    Concentration

    logi

    t(stil

    l bor

    n fra

    ctio

    n)

    Figure: Logit transformed observations and corresponding linear predictions for doseresponse assay.

  • Logistic regression example

    Predicted stillborn fractions, p̂i = exp(ŷi )/(1 + exp(ŷi )):

    0 100 200 300 400 500

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    Concentration

    Still

    born

    frac

    tion

    Figure: Observed stillborn fractions and corresponding fitted values under logisticregression for dose response assay.

  • Specification of a generalized linear model in glm()

    > mice.glm

  • ggplot2

    � Basic plotting function: ggplot(). Used for advanced plots.

    � Wrapper that resembles plot() from the basic graphics system: qplot().Used for ’quick’ plots. Syntax resembles that of plot().

    � Grammar of graphics:� All plots are objects. You build them incrementally. Use the operator + to

    add to an existing plot.� Layer: Aestetics (aes): Defines how the data are mapped.� Layer: Geometric objects (geom): Points, lines, polygens, etc.� Layer: Coordinate system objects (coord).

  • Example: Diamond data

    � Load the ggplot2 package and take a look at the diamond data:> library(ggplot2)

    > head(diamonds)

    carat cut color clarity depth table price x y z

    1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

    2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

    3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

    4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

    5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

    6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

    � Quick plot

    > qplot(carat, price, data=diamonds)

    0

    5000

    10000

    15000

    0 1 2 3 4 5carat

    pric

    e

  • Modifying the quick plot

    � With qplot(), it is easy to work with:color Color each point according to a variable in the dataset, and add a

    corresponding legend.log Log-transform one or both axes.

    facets Split in a multi-panel plot according to a group variable.main Add a title

    � Let’s try modifying the previous plot by adding� color = cut� log = ”xy”� facets =∼ clarity� main = ”Diamonds”

    one-by-one

  • Modified plot

    > qplot(carat,

    + price,

    + data=diamonds,

    + color = cut,

    + log="xy",

    + facets=~clarity,

    + main="Diamonds")

    I1 SI2 SI1

    VS2 VS1 VVS2

    VVS1 IF

    1000

    10000

    1000

    10000

    1000

    10000

    1 1carat

    pric

    e

    cutFair

    Good

    Very Good

    Premium

    Ideal

    Diamonds

  • Incremental plot construction

    � qplot is good for a start. However, in order to take full advantage ofggplot2, we must know what the plot is built of and how to modify theparts.

    � The quick plot qplot(carat, price, data=diamonds) can be builtincrementally by

    � Define an empty plot object:> p p p p

    � We can use the ’+’ operator to modify the plot ’p’. Lets see someexamples in the following:

  • Change the plot type (geom)

    � Get an overview of possible geoms at http://docs.ggplot2.org.

    � You can also look at the examples in the documentation:

    > example(geom_boxplot)

    > example(geom_polygon)

    > example(geom_raster)

    � Example: add a 2D density on top:

    > p + geom_density2d()

    0

    5000

    10000

    15000

    0 1 2 3 4 5carat

    pric

    e

  • Change the coordinate transformations

    > p + coord_flip()

    0

    1

    2

    3

    4

    5

    0 5000 10000 15000price

    cara

    t

    > p + coord_polar()

    1

    2

    3

    4

    5

    5000

    10000

    15000

    carat

    pric

    e

  • Change to multiplanel display

    � Add a facet grid to split the plot in multiple panels.

    � A facet grid takes a formula as input.

    � Example:

    > p + facet_grid(. ~ cut)

    Fair Good Very Good Premium Ideal

    0

    5000

    10000

    15000

    0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5carat

    pric

    e

  • Change to multiplanel display - other facets

    > p + facet_grid(cut ~ .)

    0

    5000

    10000

    15000

    0

    5000

    10000

    15000

    0

    5000

    10000

    15000

    0

    5000

    10000

    15000

    0

    5000

    10000

    15000

    FairG

    oodVery G

    oodPrem

    iumIdeal

    0 1 2 3 4 5carat

    pric

    e

  • Change to multiplanel display - other facets

    > p + facet_grid(cut ~ color)

    D E F G H I J

    05000

    1000015000

    05000

    1000015000

    05000

    1000015000

    05000

    1000015000

    05000

    1000015000

    FairG

    oodVery G

    oodPrem

    iumIdeal

    0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5carat

    pric

    e

  • Density plot and alpha blending

    � Attributes (colour, shape, fill, linetype etc) have automatically becomegrouping variables.

    � Note the specification of transparancy through the alpha argument: alphablending.

    > ggplot(diamonds) +

    + aes(price, fill=cut) +

    + geom_density(alpha=.3)

    0e+00

    1e−04

    2e−04

    3e−04

    4e−04

    0 5000 10000 15000price

    dens

    ity

    cutFair

    Good

    Very Good

    Premium

    Ideal

  • Application: Maps

    � the ggmap package: Interfacing ggplot2 and RGoogleMaps.

    � Two steps in making a map with ggmap:1. download raster dta for the map;2. create the map with ggmap(), and overlay it with layers of geoms etc.

    � Downloading raster data: Specify: a) location of center; b) the zoomfactor.

    � : Location specification for map downloads: Two ways.1. location/address:

    > myLocation myLocation

  • A first map: The London Olympic Stadium

    � Download data with the get_map() function, plot with ggmap():

    > mapData1 ggmap(mapData1,extent = "panel",ylab = "Latitude",xlab = "Longitude")

    51.536

    51.538

    51.540

    51.542

    −0.020 −0.016 −0.012lon

    lat

  • The London Olympic Stadium - same but different

    � Different map type:

    > mapData ggmap(mapData,extent = "panel",ylab = "Latitude",xlab = "Longitude")

    51.536

    51.538

    51.540

    51.542

    −0.020 −0.016 −0.012lon

    lat

  • The London Olympic Stadium - same but hybrid

    � Different map type:

    > mapData ggmap(mapData,extent = "panel",ylab = "Latitude",xlab = "Longitude")

    51.536

    51.538

    51.540

    51.542

    −0.020 −0.016 −0.012lon

    lat

  • Overlaying maps� Geographic coordinates obtained with the geocode() function:

    > geocode("University of Washington")

    lon lat1 -106.4407 31.76788

    � A map of the USA: Lets overlay this map with data.> usa_center USA USA

    20

    30

    40

    50

    −120 −110 −100 −90 −80 −70lon

    lat

  • Fatal vehicle accidents in the USA 2012

    � mv_collisions data:

    > head(mv_collisions)

    state collisions1 Alabama 782 Arizona 1453 Arkansas 464 California 7225 Colorado 776 Connecticut 40

    � Getting the geocoordinates with geocode():

    > for (i in 1:nrow(mv_collisions)) {+ latlon = geocode(mv_collisions$state[i])+ mv_collisions$lon[i] = as.numeric(latlon[1])+ mv_collisions$lat[i] = as.numeric(latlon[2])+ }

    � Getting the map:

    > usa_center = geocode("United States")> USA

  • Fatal vehicle accidents in the USA 2012

    � Overlaying the data:> circle_scale USA + geom_point(aes(x=lon, y=lat), data=mv_collisions, col="red",+ alpha=0.4, size=mv_collisions$collisions*circle_scale)

    20

    30

    40

    50

    −120 −110 −100 −90 −80 −70lon

    lat

  • Credits

    � Original coding of the fatal motor vehicle collision example: Sean Lorenz.

    � ggmap:D. Kahle and H. Wickham (2013): ggmap: Spatial Visualization withggplot2.The R Journal, 5(1), 144-161. URL: http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

  • Where to go?R posibilities are endless:

    • R shiny – web applications• dplyr – data management• RODBC – Reading from SQL databases etc.• TwitteR – text analytics of tweets• GoogleAnalyticsR – Google search analytics• Data Science with R on the Edx platform – Online course by yours

    truly…• Or just practice… and check t.test()…

    Intro R DTU Management EngineeringR frontpagegraphics teaser

    Rintro1Introduction to RImporting Data to RDescription of DataModifying DataGraphicsHistogramBox plotScatter PlotLine plot

    Rintro2Rintro3

    where to goWhere to go?