understanding r for epidemiologists

60
Understanding R for Epidemiologists Tom´ as J. Arag´ on, MD, DrPH Faculty, Division of Epidemiology UC Berkeley School of Public Health Health Officer, City & County of San Francisco Director, Population Health Division (PHD) San Francisco Department of Public Health Blog: http://www.medepi.com Email: [email protected] September 8, 2014 Tom´ as Arag´on, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 1 / 60

Upload: tomas-j-aragon

Post on 19-Jun-2015

1.047 views

Category:

Health & Medicine


9 download

DESCRIPTION

Understanding R for Epidemiologists

TRANSCRIPT

Page 1: Understanding R for Epidemiologists

Understanding R for Epidemiologists

Tomas J. Aragon, MD, DrPH

Faculty, Division of EpidemiologyUC Berkeley School of Public Health

Health Officer, City & County of San FranciscoDirector, Population Health Division (PHD)San Francisco Department of Public Health

Blog: http://www.medepi.comEmail: [email protected]

September 8, 2014

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 1 / 60

Page 2: Understanding R for Epidemiologists

Outline

1 BackgroundCostQualityCommunity

2 Getting started with RFull-function calculator/spreadsheetExtensible statistical packagesHigh quality graphics toolMulti-use programming language

3 Working with R data objectsAtomic vs. recursive data objectsWorking with vectors, matrices, & arraysWorking with lists, data frames, and functions

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 2 / 60

Page 3: Understanding R for Epidemiologists

Background

Background: Major issues

Cost

Quality

Community

Functionality

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 3 / 60

Page 4: Understanding R for Epidemiologists

Background Cost

Cost: Open Source vs. Proprietary Software

Costs of software

Costs of multi-platforms

Costs of education and training

Costs of adding solutions (e.g., packages)

Costs of solving problems and sharing solutions

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 4 / 60

Page 5: Understanding R for Epidemiologists

Background Quality

Quality: Open Source vs. Proprietary Software

Core Development Team

Large pool of users/testers

Quality control process for packages

Bug fixes based on need/demand, not profits

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 5 / 60

Page 6: Understanding R for Epidemiologists

Background Community

Community: Open Source vs. Proprietary Software

Large community of users

Transparent development process

Growing number of books and trainings

Growing number of free tutorials and manuals

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 6 / 60

Page 7: Understanding R for Epidemiologists

Background Community

Current R contributors

Douglas BatesJohn ChambersPeter DalgaardSeth FalconRobert GentlemanKurt HornikStefano IacusRoss IhakaFriedrich LeischUwe Ligges

Thomas LumleyMartin MaechlerDuncan MurdochPaul MurrellMartyn PlummerBrian RipleyDeepayan SarkarDuncan Temple LangLuke TierneySimon Urbanek

Source: http://www.r-project.org/contributors.html

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 7 / 60

Page 8: Understanding R for Epidemiologists

Getting started with R

What is R?

Full-function calculator/spreadsheet

Extensible statistical packages

High-quality graphics tool

Multi-use programming language

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 8 / 60

Page 9: Understanding R for Epidemiologists

Getting started with R Full-function calculator/spreadsheet

Full-function calculator: Selected math operators

Operator Description Try these examples

+ addition 5+4

− subtraction 5-4

∗ multiplication 5*4

/ division 5/4

ˆ exponentiation 5^4

− unary minus (change currentsign)

-5

abs absolute value abs(-23)

exp exponentiation (e to a power) exp(8)

log logarithm (default is natural log) log(exp(8))

sqrt square root sqrt(64)

%/% integer divide 10%/%3

%% modulus 10%%3

%*% matrix multiplication xx <- matrix(1:4, 2, 2)

xx%*%c(1, 1)

c(1, 1)%*%xx

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 9 / 60

Page 10: Understanding R for Epidemiologists

Getting started with R Extensible statistical packages

Extensible statistical packages

Generalized Linear Models (Base)

Linear regressionLogistic regressionPoisson regression

Cox Proportional Hazard models (Survival)

Cox PH regressionConditional logistic regression (matched case-control studies)

Meta-analysis (meta)

Complex survey analysis (survey)

Epidemiology packages

epitoolsepicalcepibasixepiR

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 10 / 60

Page 11: Understanding R for Epidemiologists

Getting started with R High quality graphics tool

Graphics display of sample size curves

− Z1−α 2 µ0 Z1−α 2 µ1

α 2β

Power(1 − β)

Null distributionH0

Alternative distributionH1

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 11 / 60

Page 12: Understanding R for Epidemiologists

Getting started with R High quality graphics tool

Graphics display of P value function

0.2 0.5 1.0 2.0 5.0 10.0 20.02.9

00.05

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100

95

90

80

70

60

50

40

30

20

10

0

Con

fiden

ce le

vel (

%)

Rate Ratio

P−

valu

e

Nul

l hyp

othe

sis

Med

ian

unbi

ased

est

imat

e

95%

Low

er C

onfid

ence

Lim

it =

0.7

4

95%

Upp

er C

onfid

ence

Lim

it =

21.

0

95% Confidence Interval

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 12 / 60

Page 13: Understanding R for Epidemiologists

Getting started with R High quality graphics tool

Graphical display of multiple linear regression

0 10 20 30 40 501020

3040

5060

7080

90

010

2030

4050

x1

x2

y

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 13 / 60

Page 14: Understanding R for Epidemiologists

Getting started with R High quality graphics tool

Epidemic curve using Color Brewer colors

UnknownWNFWNND

020

4060

80

West Nile Virus Human Cases Reported in California by Disease Week as of December 14, 2004

Cas

es

52 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48 51Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Disease Week & Calendar Month, 2004

+ Bird2/24

+ Mosquito4/14

+ Chicken5/17

+ Horse6/20

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 14 / 60

Page 15: Understanding R for Epidemiologists

Getting started with R Multi-use programming language

Multi-use programming language

Vectorized computations

Functional programming language

Object-oriented programming

Text processing (e.g., using regular expressions)

Links to C, Fortran, etc.

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 15 / 60

Page 16: Understanding R for Epidemiologists

Working with R data objects Atomic vs. recursive data objects

Data objects in R

Object types

Vector

Matrix

Array

List

Data frame

Function

Operations

Create

Name

Index

Replace

Manipulate

Do computations

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 16 / 60

Page 17: Understanding R for Epidemiologists

Working with R data objects Atomic vs. recursive data objects

Summary of types of data objects in R

Data object Possible modea Default class

Atomic

vector character, numeric, logical NULL

matrix character, numeric, logical NULL

array character, numeric, logical NULL

Recursive

list list NULL

data frame list data frame

function function NULL

a We are ignoring complex numbers

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 17 / 60

Page 18: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding vectors

A vector is a collection of like elements without dimensions1. The vectorelements are all of the same mode (either character, numeric, or logical).

> y <- c("Pedro", "Paulo", "Maria")

> y

[1] "Pedro" "Paulo" "Maria"

> x <- c(1, 2, 3, 4, 5)

> x

[1] 1 2 3 4 5

> x < 3

[1] TRUE TRUE FALSE FALSE FALSE

1In other programming languages, vectors are either row vectors or column vectors.R does not make this distinction until it is necessary.

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 18 / 60

Page 19: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding vectors: Indexing

Indexing by Try these examples

Position x <- c(chol=234, sbp=148, dbp=78, age=54)

x[2] #positions to include

x[c(2, 3)]

x[-c(1, 3, 4)] #positions to exclude

x[-c(1, 4)]

Name x["sbp"]

x[c("sbp", "dbp")]

Logical x < 100

x[x < 100]

(x < 150) & (x > 70)

bp <- (x < 150) & (x > 70)

x[bp]

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 19 / 60

Page 20: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding vectors: Replacement

Replacing by Try these examples

Position x <- c(chol=234, sbp=148, dbp=78, age=54)

x[1]

x[1] <- 250

x

Name x["sbp"]

x["sbp"] <- 150

x

Logical x[x<100]

x[x<100] <- NA

x

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 20 / 60

Page 21: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding vectors: Replacement

> x <- c(chol = 234, sbp = 148, dbp = 78, age = 54)

> x[1] <- 250 #by position

> x

chol sbp dbp age

250 148 78 54

> x["sbp"] <- 150 #by name

> x

chol sbp dbp age

250 150 78 54

> x[x<100]

dbp age

78 54

> x[x<100] <- NA #by logical

> x

chol sbp dbp age

250 150 NA NA

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 21 / 60

Page 22: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding matrices

A matrix is a collection of like elements organized into a 2-dimensional(tabular) data object. Matrix elements can be either numeric, character,or logical. We can think of a matrix as a vector with a 2-dimensionalstructure. Contingency tables in epidemiology are represented in R asnumeric matrices or arrays. An array is the generalization of matrices to 3or more dimensions (commonly known as stratified tables). We coverarrays later, for now we will focus on 2-dimensional tables.

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 22 / 60

Page 23: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding matrices

When R returns a matrix the [n,] indicates the nth row and [,m]

indicates the mth column.

> x <- c("a", "b", "c", "d")

> y <- matrix(x, 2, 2)

> y

[,1] [,2]

[1,] "a" "c"

[2,] "b" "d"

> y[1,]

[1] "a" "c"

> y[,2]

[1] "c" "d"

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 23 / 60

Page 24: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding matrices

> x <- c(30, 21, 170, 180) # creating

> y <- matrix(x, 2, 2, byrow = TRUE) # creating

> y

[,1] [,2]

[1,] 30 21

[2,] 170 180

> rownames(y) <- c("Deaths", "Survivors") # naming

> colnames(y) <- c("Tolbutamide", "Placebo") # naming

> y[2, 1] <- 174 # replace by position

> y["Survivors", "Placebo"] <- 184 # replace by name

> y

Tolbutamide Placebo

Deaths 30 21

Survivors 174 184

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 24 / 60

Page 25: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding matrices

Consider the 2× 2 table of crude data in Table. In this randomized clinicaltrial (RCT), diabetic subjects were randomly assigned to receive eithertolbutamide, an oral hypoglycemic drug, or placebo. Because this was aprospective study we can calculate risks, odds, a risk ratio, and an oddsratio. We will do this using R as a calculator.

Table : Deaths among subjects who received tolbutamide and placebo in theUnversity Group Diabetes Program (1970)

Tolbutamide Placebo

Deaths 30 21

Survivors 174 184

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 25 / 60

Page 26: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding matrices

> dat <- matrix(c(30, 174, 21, 184), 2, 2)

> rownames(dat) <- c("Deaths", "Survivors")

> colnames(dat) <- c("Tolbutamide", "Placebo")

> coltot <- apply(dat, 2, sum) #column totals

> risks <- dat["Deaths",]/coltot

> risk.ratio <- risks/risks[2] #risk ratio

> odds <- risks/(1-risks)

> odds.ratio <- odds/odds[2] #odds ratio

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 26 / 60

Page 27: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding matrices

> # display results

> dat

Tolbutamide Placebo

Deaths 30 21

Survivors 174 184

> rbind(risks, risk.ratio, odds, odds.ratio)

Tolbutamide Placebo

risks 0.1470588 0.1024390

risk.ratio 1.4355742 1.0000000

odds 0.1724138 0.1141304

odds.ratio 1.5106732 1.0000000

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 27 / 60

Page 28: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding arrays

An array is a collection of like elements organized into a n-dimensionaldata object. When R returns an array the [n,,] indicates the nth rowand [,m,] indicates the mth column, and so on.

> x <- 1:8

> y <- array(x, dim=c(2, 2, 2))

> y

, , 1

[,1] [,2]

[1,] 1 3

[2,] 2 4

, , 2

[,1] [,2]

[1,] 5 7

[2,] 6 8

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 28 / 60

Page 29: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding arrays

While a matrix is a 2-dimensional table of like elements, an array is thegeneralization of matrices to n-dimensions. Stratified contingency tables inepidemiology are represented as array data objects in R. For example, theRCT previously shown comparing the number deaths among diabeticsubjects that received tolbutamide vs. placebo is now also stratified by agegroup:

Table : Deaths among subjects who received tolbutamide and placebo in theUnversity Group Diabetes Program (1970), stratifying by age

Age<55 Age≥55 Combined

Tolb Plac Tolb Plac Tolb Plac

Deaths 8 5 22 16 30 21

Survivors 98 115 76 69 174 184

Total 106 120 98 85 204 205

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 29 / 60

Page 30: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding arrays

> tdat <- c(8, 98, 5, 115, 22, 76, 16, 69)

> tdat <- array(tdat, c(2, 2, 2))

> dimnames(tdat) <- list(Outcome=c("Deaths", "Survivors"),

+ Treatment=c("Tolbutamide", "Placebo"),

+ "Age group"=c("Age<55", "Age>=55"))

> tdat

, , Age group = Age<55

Treatment

Outcome Tolbutamide Placebo

Deaths 8 5

Survivors 98 115

, , Age group = Age>=55

Treatment

Outcome Tolbutamide Placebo

Deaths 22 16

Survivors 76 69

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 30 / 60

Page 31: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Table : Example of 4-dimensional array: Year 2000 population estimates by age,ethnicity, sex, and county

Ethnicity

County/Sex Age White AfrAmer AsianPI Latino Multirace AmerInd

Alameda

Female <=19 58,160 31,765 40,653 49,738 10,120 839

20–44 112,326 44,437 72,923 58,553 7,658 1,401

45–64 82,205 24,948 33,236 18,534 2,922 822

65+ 49,762 12,834 16,004 7,548 1,014 246

Male <=19 61,446 32,277 42,922 53,097 10,102 828

20–44 115,745 36,976 69,053 69,233 6,795 1,263

45–64 81,332 20,737 29,841 17,402 2,506 687

65+ 33,994 8,087 11,855 5,416 711 156

San Francisco

Female <=19 14,355 6,986 23,265 13,251 2,940 173

20–44 85,766 10,284 52,479 23,458 3,656 526

45–64 35,617 6,890 31,478 9,184 1,144 282

65+ 27,215 5,172 23,044 5,773 554 121

Male <=19 14,881 6,959 24,541 14,480 2,851 165

20–44 105,798 11,111 48,379 31,605 3,766 782

45–64 43,694 7,352 26,404 8,674 1,220 354

65+ 20,072 3,329 17,190 3,428 450 76

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 31 / 60

Page 32: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding arrays

Figure : Schematic representation of a 4-dimensional array: Year 2000 populationestimates by age (1), race (2), sex (3), and county (4)

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 32 / 60

Page 33: Understanding R for Epidemiologists

Working with R data objects Working with vectors, matrices, & arrays

Understanding arrays

Figure : Schematic of a theoretical 5-D array (e.g., data by age (1), race (2), sex(3), party affiliation (4), and state (5)). We can see that the field “state” has 3levels, and the field “party affiliation” has 2 levels; however, it is not apparent thenumber of age, race, and sex levels. Although not displayed, age levels would berepresented by row names (along 1st dimension), race levels by column names(along 2nd dimension), and sex levels by depth names (along 3rd dimension).

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 33 / 60

Page 34: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding lists

Up to now, we have been working with atomic data objects (vector, matrix,array). In contrast, lists, data frames, and functions are recursive dataobjects. Recursive data objects have more flexibility in combining diversedata objects into one object. A list provides the most flexibility. Think of alist object as a collection of “bins” that can contain any R object. Listsare very useful for collecting results of an analysis or a function into onedata object where all its contents are readily accessible by indexing.

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 34 / 60

Page 35: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding lists

A list is a collection of data objects without any restrictions:

> x <- c(11, 22, 34)

> y <- c("Male", "Female", "Male")

> z <- matrix(c(67, 34, 56,22), 2, 2)

> mylist <- list(x, y, z)

> mylist

[[1]]

[1] 11 22 34

[[2]]

[1] "Male" "Female" "Male"

[[3]]

[,1] [,2]

[1,] 67 56

[2,] 34 22

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 35 / 60

Page 36: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding lists

Names can be assigned to each bin of a list.

> names(mylist) <- c("Age", "Sex", "Data")

> mylist

$Age

[1] 11 22 34

$Sex

[1] "Male" "Female" "Male"

$Data

[,1] [,2]

[1,] 67 56

[2,] 34 22

> mylist$Sex

[1] "Male" "Female" "Male"

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 36 / 60

Page 37: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding lists

Figure : Schematic representation of a list of length four. The first bin [1]

contains a smiling face [[1]], the second bin [2] contains a flower [[2]], thethird bin [3] contains a lightning bolt [[3]], and the fourth bin [[4]] containsa heart [[4]]. When indexing a list object, single brackets [·] indexes the bin,and double brackets [[·]] indexes the bin contents. If the bin has a name, then$name also indexes the contents.

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 37 / 60

Page 38: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding lists

For example, using the UGDP clinical trial data, suppose we performFisher’s exact test for testing the null hypothesis of independence of rowsand columns in a contingency table with fixed marginals.

> udat <- read.csv("http://www.medepi.net/data/ugdp.txt")

> tab <- xtabs(~ Status + Treatment, data = udat)[,2:1]

> tab

Treatment

Status Tolbutamide Placebo

Death 30 21

Survivor 174 184

> ftab <- fisher.test(tab)

> ftab

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 38 / 60

Page 39: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding lists

> ftab

Fisher’s Exact Test for Count Data

data: tab

p-value = 0.1813

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:

0.8013768 2.8872863

sample estimates:

odds ratio

1.509142

The default display only shows partial results. The total results are storedin the object ftab. Let’s evaluate the structure of ftab and extract someresults:

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 39 / 60

Page 40: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding lists

> str(ftab)

List of 7

$ p.value : num 0.181

$ conf.int : atomic [1:2] 0.801 2.887

..- attr(*, "conf.level")= num 0.95

$ estimate : Named num 1.51

..- attr(*, "names")= chr "odds ratio"

$ null.value : Named num 1

..- attr(*, "names")= chr "odds ratio"

$ alternative: chr "two.sided"

$ method : chr "Fisher’s Exact Test for Count Data"

$ data.name : chr "tab"

- attr(*, "class")= chr "htest"

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 40 / 60

Page 41: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding lists

Let’s index some of the bins from ftab.

> ftab$estimate

odds ratio

1.5091

> ftab$conf.int

[1] 0.80138 2.88729

> ftab$conf.int[2]

[1] 2.887286

attr(,"conf.level")

[1] 0.95

> ftab$p.value

[1] 0.18126

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 41 / 60

Page 42: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames

A data frame is a list with a 2-dimensional (tabular) structure.Epidemiologists are very experienced working with data frames where eachrow usually represents data collected on individual subjects (also calledrecords or observations) and columns represent fields for each type of datacollected (also called variables).

> subjno <- c(1, 2, 3, 4)

> age <- c(34, 56, 45, 23)

> sex <- c("Male", "Male", "Female", "Male")

> case <- c("Yes", "No", "No", "Yes")

> mydat <- data.frame(subjno, age, sex, case)

> mydat

subjno age sex case

1 1 34 Male Yes

2 2 56 Male No

3 3 45 Female No

4 4 23 Male Yes

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 42 / 60

Page 43: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames

Epidemiologists are familiar with tabular data sets where each row is arecord and each column is a field. A record can be data collected onindividuals or groups. We usually refer to the field name as a variable(e.g., age, gender, ethnicity). Fields can contain numeric or characterdata. In R, these types of data sets are handled by data frames. Eachcolumn of a data frame is usually either a factor or numeric vector,although it can have complex, character, or logical vectors. Data frameshave the functionality of matrices and lists. For example, here is the first10 rows of the infert data set, a matched case-control study published in1976 that evaluated whether infertility was associated with priorspontaneous or induced abortions.

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 43 / 60

Page 44: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames

> data(infert)

> str(infert)

‘data.frame’: 248 obs. of 8 variables:

$ education : Factor w/ 3 levels "0-5yrs",..: 1 1 ...

$ age : num NA 45 NA 23 35 36 23 32 21 28 ...

$ parity : num 6 1 6 4 3 4 1 2 1 2 ...

$ induced : num 1 1 2 2 1 2 0 0 0 0 ...

$ case : num 1 1 1 1 1 1 1 1 1 1 ...

$ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ...

$ stratum : int 1 2 3 4 5 6 7 8 9 10 ...

$ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ...

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 44 / 60

Page 45: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames

> infert[1:10, 1:6]

education age parity induced case spontaneous

1 0-5yrs NA 6 1 1 2

2 0-5yrs 45 1 1 1 0

3 0-5yrs NA 6 2 1 0

4 0-5yrs 23 4 2 1 0

5 6-11yrs 35 3 1 1 1

6 6-11yrs 36 4 2 1 1

7 6-11yrs 23 1 0 1 0

8 6-11yrs 32 2 0 1 0

9 6-11yrs 21 1 0 1 1

10 6-11yrs 28 2 0 1 0

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 45 / 60

Page 46: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames

The fields are obviously vectors. Let’s explore a few of these vectors to seewhat we can learn about their structure in R.

> #age variable

> infert$age

[1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31 27 30 26

...

[235] 25 32 25 31 38 26 31 31 25 31 34 35 29 23

> mode(infert$age)

[1] "numeric"

> class(infert$age)

[1] "numeric"

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 46 / 60

Page 47: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames

> # education variable

> infert$education

[1] 0-5yrs 0-5yrs 0-5yrs 0-5yrs 6-11yrs 6-11yrs

...

[247] 12+ yrs 12+ yrs

Levels: 0-5yrs 6-11yrs 12+ yrs

> mode(infert$education)

[1] "numeric"

> class(infert$education)

[1] "factor"

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 47 / 60

Page 48: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames and factors

A factor is R’s representation of categorical fields and keeps track of allpossible category levels.

> sex <- sample(c("Male", "Female"), 100, replace = TRUE)

> mode(sex); class(sex)

[1] "character"

[1] "character"

> table(sex)

sex

Female Male

51 49

> sexf <- factor(sex, levels = c("Male", "Female", "Transgender"))

> table(sexf)

sexf

Male Female Transgender

49 51 0

> mode(sexf); class(sexf)

[1] "numeric"

[1] "factor"

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 48 / 60

Page 49: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames and lists

Infert data is a matched case-control study evaluating the association ofhistory of abortions and infertility. Use conditional logistic regression.

> mod3 <- clogit(case ~ spontaneous + induced +

+ strata(stratum), data = infert)

> mod3

Call:

clogit(case ~ spontaneous + induced + strata(stratum), data =

coef exp(coef) se(coef) z p

spontaneous 1.99 7.29 0.352 5.63 1.8e-08

induced 1.41 4.09 0.361 3.91 9.4e-05

> summod3 <- summary(mod3)

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 49 / 60

Page 50: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames and lists

> summod3

n= 248

coef exp(coef) se(coef) z Pr(>|z|)

spontaneous 1.9859 7.2854 0.3524 5.635 1.75e-08 ***

induced 1.4090 4.0919 0.3607 3.906 9.38e-05 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

exp(coef) exp(-coef) lower .95 upper .95

spontaneous 7.285 0.1373 3.651 14.536

induced 4.092 0.2444 2.018 8.298

Rsquare= 0.193 (max possible= 0.519 )

Likelihood ratio test= 53.15 on 2 df, p=2.869e-12

Wald test = 31.84 on 2 df, p=1.221e-07

Score (logrank) test = 48.44 on 2 df, p=3.032e-11

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 50 / 60

Page 51: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frames and lists

> str(summod3)

List of 12

$ call : language coxph(formula = Surv(rep(1, 248L), case) ~ spontaneous

$ fail : NULL

$ na.action : NULL

$ n : int 248

$ loglik : num [1:2] -90.8 -64.2

$ coefficients: num [1:2, 1:5] 1.986 1.409 7.285 4.092 0.352 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : chr [1:2] "spontaneous" "induced"

.. ..$ : chr [1:5] "coef" "exp(coef)" "se(coef)" "z" ...

$ conf.int : num [1:2, 1:4] 7.285 4.092 0.137 0.244 3.651 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : chr [1:2] "spontaneous" "induced"

.. ..$ : chr [1:4] "exp(coef)" "exp(-coef)" "lower .95" "upper .95"

$ logtest : Named num [1:3] 5.32e+01 2.00 2.87e-12

... [output truncated]

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 51 / 60

Page 52: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding data frame and lists

> summod3$coef

coef exp(coef) se(coef) z Pr(>|z|)

spontaneous 1.985876 7.285423 0.3524435 5.634592 1.754734e-08

induced 1.409012 4.091909 0.3607124 3.906191 9.376245e-05

> summod3$coef[1, ]

coef exp(coef) se(coef) z Pr(>|z|)

1.985876e+00 7.285423e+00 3.524435e-01 5.634592e+00 1.754734e-08

> summod3$coef[ ,2]

spontaneous induced

7.285423 4.091909

> summod3$coef[1,2]

[1] 7.285423

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 52 / 60

Page 53: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding functions

Risk Ratio confidence interval from baby Rothman, p. 135

rr.wald <- function(x, conf.level = 0.95){

## prepare input

x1 <- x[1,1]; n1 <- sum(x[1,])

x0 <- x[2,1]; n0 <- sum(x[2,])

## do calculations

p1 <- x1/n1 ##risk among exposed

p0 <- x0/n0 ##risk among unexposed

RR <- p1/p0;

logRR <- log(RR)

SElogRR <- sqrt(1/x1 - 1/n1 + 1/x0 - 1/n0)

Z <- qnorm(0.5*(1 + conf.level))

LCL <- exp(logRR - Z*SElogRR)

UCL <- exp(logRR + Z*SElogRR)

##collect output

list(x = x, risks = c(p1 = p1, p0 = p0), risk.ratio = RR,

conf.int = c(LCL, UCL), conf.level = conf.level)

}

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 53 / 60

Page 54: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Understanding functions

Run rr.wald function on UGDP RCT data (results displayed in 2columns).

> tab

Treatment

Status Tolbutamide Placebo

Death 30 21

Survivor 174 184

> rr.wald(tab)

$x

Treatment

Status Tolbutamide Placebo

Death 30 21

Survivor 174 184

$risks

p1 p0

0.5882353 0.4860335

$risk.ratio

[1] 1.210277

$conf.int

[1] 0.9396227 1.5588927

$conf.level

[1] 0.95

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 54 / 60

Page 55: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

The epitools package

The following epidemiologists, directly or indirectly, contributed to’epitools’:

Tomas Aragon, MD, DrPH, , UC Berkeley

Michael P. Fay, PhD, Mathematical Statistician National Institute ofAllergy and Infectious Diseases

Wayne Enanoria, PhD, MPH, UC Berkeley

Travis Porco, PhD, MPH, UC San Francisco

Michael Samuel, DrPH, California Department of Public Health

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 55 / 60

Page 56: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Using epitools for outbreak investigations

Using the epitab function (only arguments are displayed);

epitab(x, y = NULL,

method = c("oddsratio", "riskratio", "rateratio"),

conf.level = 0.95,

rev = c("neither", "rows", "columns", "both"),

oddsratio = c("wald", "fisher", "midp", "small"),

riskratio = c("wald", "boot", "small"),

rateratio = c("wald", "midp"),

pvalue = c("fisher.exact", "midp.exact", "chi2"),

correction = FALSE,

verbose = FALSE)

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 56 / 60

Page 57: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Hypothesis testing using Oswego: Passing 2 vectors

> library(epitools) #load ’epitools’ package

> data(oswego) #load Oswego dataset

> attach(oswego) #attach dataset

> round(epitab(jello, ill, method = "riskratio")$tab, 2)

Outcome

Predictor N p0 Y p1 riskratio lower upper p.value

N 22 0.42 30 0.58 1.00 NA NA NA

Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44

> round(epitab(jello, ill, method = "oddsratio")$tab, 2)

Outcome

Predictor N p0 Y p1 oddsratio lower upper p.value

N 22 0.76 30 0.65 1.00 NA NA NA

Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44

> detach(oswego) #detach dataset

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 57 / 60

Page 58: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Hypothesis testing using Oswego: Passing a table

> jello.tab1

ill

jello N Y

N 22 30

Y 7 16

> round(epitab(jello.tab1)$tab, 2)

ill

jello N p0 Y p1 oddsratio lower upper p.value

N 22 0.76 30 0.65 1.00 NA NA NA

Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44

> round(epitab(jello.tab1, method = "risk")$tab, 2)

ill

jello N p0 Y p1 riskratio lower upper p.value

N 22 0.42 30 0.58 1.00 NA NA NA

Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 58 / 60

Page 59: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Hypothesis testing using Oswego: Passing one vector

> round(epitab(c(22, 30, 7, 16))$tab, 2)

Outcome

Predictor Disease1 p0 Disease2 p1 oddsratio lower upper p.value

Exposed1 22 0.76 30 0.65 1.00 NA NA NA

Exposed2 7 0.24 16 0.35 1.68 0.59 4.76 0.44

> round(epitab(c(22, 30, 7, 16), method = "risk")$tab, 2)

Outcome

Predictor Disease1 p0 Disease2 p1 riskratio lower upper p.value

Exposed1 22 0.42 30 0.58 1.00 NA NA NA

Exposed2 7 0.30 16 0.70 1.21 0.84 1.72 0.44

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 59 / 60

Page 60: Understanding R for Epidemiologists

Working with R data objects Working with lists, data frames, and functions

Summary

1 BackgroundCostQualityCommunity

2 Getting started with RFull-function calculator/spreadsheetExtensible statistical packagesHigh quality graphics toolMulti-use programming language

3 Working with R data objectsAtomic vs. recursive data objectsWorking with vectors, matrices, & arraysWorking with lists, data frames, and functions

Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 60 / 60