understanding r for epidemiologists
DESCRIPTION
Understanding R for EpidemiologistsTRANSCRIPT
Understanding R for Epidemiologists
Tomas J. Aragon, MD, DrPH
Faculty, Division of EpidemiologyUC Berkeley School of Public Health
Health Officer, City & County of San FranciscoDirector, Population Health Division (PHD)San Francisco Department of Public Health
Blog: http://www.medepi.comEmail: [email protected]
September 8, 2014
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 1 / 60
Outline
1 BackgroundCostQualityCommunity
2 Getting started with RFull-function calculator/spreadsheetExtensible statistical packagesHigh quality graphics toolMulti-use programming language
3 Working with R data objectsAtomic vs. recursive data objectsWorking with vectors, matrices, & arraysWorking with lists, data frames, and functions
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 2 / 60
Background
Background: Major issues
Cost
Quality
Community
Functionality
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 3 / 60
Background Cost
Cost: Open Source vs. Proprietary Software
Costs of software
Costs of multi-platforms
Costs of education and training
Costs of adding solutions (e.g., packages)
Costs of solving problems and sharing solutions
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 4 / 60
Background Quality
Quality: Open Source vs. Proprietary Software
Core Development Team
Large pool of users/testers
Quality control process for packages
Bug fixes based on need/demand, not profits
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 5 / 60
Background Community
Community: Open Source vs. Proprietary Software
Large community of users
Transparent development process
Growing number of books and trainings
Growing number of free tutorials and manuals
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 6 / 60
Background Community
Current R contributors
Douglas BatesJohn ChambersPeter DalgaardSeth FalconRobert GentlemanKurt HornikStefano IacusRoss IhakaFriedrich LeischUwe Ligges
Thomas LumleyMartin MaechlerDuncan MurdochPaul MurrellMartyn PlummerBrian RipleyDeepayan SarkarDuncan Temple LangLuke TierneySimon Urbanek
Source: http://www.r-project.org/contributors.html
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 7 / 60
Getting started with R
What is R?
Full-function calculator/spreadsheet
Extensible statistical packages
High-quality graphics tool
Multi-use programming language
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 8 / 60
Getting started with R Full-function calculator/spreadsheet
Full-function calculator: Selected math operators
Operator Description Try these examples
+ addition 5+4
− subtraction 5-4
∗ multiplication 5*4
/ division 5/4
ˆ exponentiation 5^4
− unary minus (change currentsign)
-5
abs absolute value abs(-23)
exp exponentiation (e to a power) exp(8)
log logarithm (default is natural log) log(exp(8))
sqrt square root sqrt(64)
%/% integer divide 10%/%3
%% modulus 10%%3
%*% matrix multiplication xx <- matrix(1:4, 2, 2)
xx%*%c(1, 1)
c(1, 1)%*%xx
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 9 / 60
Getting started with R Extensible statistical packages
Extensible statistical packages
Generalized Linear Models (Base)
Linear regressionLogistic regressionPoisson regression
Cox Proportional Hazard models (Survival)
Cox PH regressionConditional logistic regression (matched case-control studies)
Meta-analysis (meta)
Complex survey analysis (survey)
Epidemiology packages
epitoolsepicalcepibasixepiR
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 10 / 60
Getting started with R High quality graphics tool
Graphics display of sample size curves
− Z1−α 2 µ0 Z1−α 2 µ1
α 2β
Power(1 − β)
Null distributionH0
Alternative distributionH1
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 11 / 60
Getting started with R High quality graphics tool
Graphics display of P value function
0.2 0.5 1.0 2.0 5.0 10.0 20.02.9
00.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
100
95
90
80
70
60
50
40
30
20
10
0
Con
fiden
ce le
vel (
%)
Rate Ratio
P−
valu
e
Nul
l hyp
othe
sis
Med
ian
unbi
ased
est
imat
e
95%
Low
er C
onfid
ence
Lim
it =
0.7
4
95%
Upp
er C
onfid
ence
Lim
it =
21.
0
95% Confidence Interval
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 12 / 60
Getting started with R High quality graphics tool
Graphical display of multiple linear regression
0 10 20 30 40 501020
3040
5060
7080
90
010
2030
4050
x1
x2
y
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 13 / 60
Getting started with R High quality graphics tool
Epidemic curve using Color Brewer colors
UnknownWNFWNND
020
4060
80
West Nile Virus Human Cases Reported in California by Disease Week as of December 14, 2004
Cas
es
52 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48 51Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Disease Week & Calendar Month, 2004
+ Bird2/24
+ Mosquito4/14
+ Chicken5/17
+ Horse6/20
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 14 / 60
Getting started with R Multi-use programming language
Multi-use programming language
Vectorized computations
Functional programming language
Object-oriented programming
Text processing (e.g., using regular expressions)
Links to C, Fortran, etc.
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 15 / 60
Working with R data objects Atomic vs. recursive data objects
Data objects in R
Object types
Vector
Matrix
Array
List
Data frame
Function
Operations
Create
Name
Index
Replace
Manipulate
Do computations
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 16 / 60
Working with R data objects Atomic vs. recursive data objects
Summary of types of data objects in R
Data object Possible modea Default class
Atomic
vector character, numeric, logical NULL
matrix character, numeric, logical NULL
array character, numeric, logical NULL
Recursive
list list NULL
data frame list data frame
function function NULL
a We are ignoring complex numbers
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 17 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding vectors
A vector is a collection of like elements without dimensions1. The vectorelements are all of the same mode (either character, numeric, or logical).
> y <- c("Pedro", "Paulo", "Maria")
> y
[1] "Pedro" "Paulo" "Maria"
> x <- c(1, 2, 3, 4, 5)
> x
[1] 1 2 3 4 5
> x < 3
[1] TRUE TRUE FALSE FALSE FALSE
1In other programming languages, vectors are either row vectors or column vectors.R does not make this distinction until it is necessary.
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 18 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding vectors: Indexing
Indexing by Try these examples
Position x <- c(chol=234, sbp=148, dbp=78, age=54)
x[2] #positions to include
x[c(2, 3)]
x[-c(1, 3, 4)] #positions to exclude
x[-c(1, 4)]
Name x["sbp"]
x[c("sbp", "dbp")]
Logical x < 100
x[x < 100]
(x < 150) & (x > 70)
bp <- (x < 150) & (x > 70)
x[bp]
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 19 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding vectors: Replacement
Replacing by Try these examples
Position x <- c(chol=234, sbp=148, dbp=78, age=54)
x[1]
x[1] <- 250
x
Name x["sbp"]
x["sbp"] <- 150
x
Logical x[x<100]
x[x<100] <- NA
x
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 20 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding vectors: Replacement
> x <- c(chol = 234, sbp = 148, dbp = 78, age = 54)
> x[1] <- 250 #by position
> x
chol sbp dbp age
250 148 78 54
> x["sbp"] <- 150 #by name
> x
chol sbp dbp age
250 150 78 54
> x[x<100]
dbp age
78 54
> x[x<100] <- NA #by logical
> x
chol sbp dbp age
250 150 NA NA
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 21 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding matrices
A matrix is a collection of like elements organized into a 2-dimensional(tabular) data object. Matrix elements can be either numeric, character,or logical. We can think of a matrix as a vector with a 2-dimensionalstructure. Contingency tables in epidemiology are represented in R asnumeric matrices or arrays. An array is the generalization of matrices to 3or more dimensions (commonly known as stratified tables). We coverarrays later, for now we will focus on 2-dimensional tables.
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 22 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding matrices
When R returns a matrix the [n,] indicates the nth row and [,m]
indicates the mth column.
> x <- c("a", "b", "c", "d")
> y <- matrix(x, 2, 2)
> y
[,1] [,2]
[1,] "a" "c"
[2,] "b" "d"
> y[1,]
[1] "a" "c"
> y[,2]
[1] "c" "d"
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 23 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding matrices
> x <- c(30, 21, 170, 180) # creating
> y <- matrix(x, 2, 2, byrow = TRUE) # creating
> y
[,1] [,2]
[1,] 30 21
[2,] 170 180
> rownames(y) <- c("Deaths", "Survivors") # naming
> colnames(y) <- c("Tolbutamide", "Placebo") # naming
> y[2, 1] <- 174 # replace by position
> y["Survivors", "Placebo"] <- 184 # replace by name
> y
Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 24 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding matrices
Consider the 2× 2 table of crude data in Table. In this randomized clinicaltrial (RCT), diabetic subjects were randomly assigned to receive eithertolbutamide, an oral hypoglycemic drug, or placebo. Because this was aprospective study we can calculate risks, odds, a risk ratio, and an oddsratio. We will do this using R as a calculator.
Table : Deaths among subjects who received tolbutamide and placebo in theUnversity Group Diabetes Program (1970)
Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 25 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding matrices
> dat <- matrix(c(30, 174, 21, 184), 2, 2)
> rownames(dat) <- c("Deaths", "Survivors")
> colnames(dat) <- c("Tolbutamide", "Placebo")
> coltot <- apply(dat, 2, sum) #column totals
> risks <- dat["Deaths",]/coltot
> risk.ratio <- risks/risks[2] #risk ratio
> odds <- risks/(1-risks)
> odds.ratio <- odds/odds[2] #odds ratio
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 26 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding matrices
> # display results
> dat
Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
> rbind(risks, risk.ratio, odds, odds.ratio)
Tolbutamide Placebo
risks 0.1470588 0.1024390
risk.ratio 1.4355742 1.0000000
odds 0.1724138 0.1141304
odds.ratio 1.5106732 1.0000000
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 27 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding arrays
An array is a collection of like elements organized into a n-dimensionaldata object. When R returns an array the [n,,] indicates the nth rowand [,m,] indicates the mth column, and so on.
> x <- 1:8
> y <- array(x, dim=c(2, 2, 2))
> y
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 28 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding arrays
While a matrix is a 2-dimensional table of like elements, an array is thegeneralization of matrices to n-dimensions. Stratified contingency tables inepidemiology are represented as array data objects in R. For example, theRCT previously shown comparing the number deaths among diabeticsubjects that received tolbutamide vs. placebo is now also stratified by agegroup:
Table : Deaths among subjects who received tolbutamide and placebo in theUnversity Group Diabetes Program (1970), stratifying by age
Age<55 Age≥55 Combined
Tolb Plac Tolb Plac Tolb Plac
Deaths 8 5 22 16 30 21
Survivors 98 115 76 69 174 184
Total 106 120 98 85 204 205
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 29 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding arrays
> tdat <- c(8, 98, 5, 115, 22, 76, 16, 69)
> tdat <- array(tdat, c(2, 2, 2))
> dimnames(tdat) <- list(Outcome=c("Deaths", "Survivors"),
+ Treatment=c("Tolbutamide", "Placebo"),
+ "Age group"=c("Age<55", "Age>=55"))
> tdat
, , Age group = Age<55
Treatment
Outcome Tolbutamide Placebo
Deaths 8 5
Survivors 98 115
, , Age group = Age>=55
Treatment
Outcome Tolbutamide Placebo
Deaths 22 16
Survivors 76 69
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 30 / 60
Working with R data objects Working with vectors, matrices, & arrays
Table : Example of 4-dimensional array: Year 2000 population estimates by age,ethnicity, sex, and county
Ethnicity
County/Sex Age White AfrAmer AsianPI Latino Multirace AmerInd
Alameda
Female <=19 58,160 31,765 40,653 49,738 10,120 839
20–44 112,326 44,437 72,923 58,553 7,658 1,401
45–64 82,205 24,948 33,236 18,534 2,922 822
65+ 49,762 12,834 16,004 7,548 1,014 246
Male <=19 61,446 32,277 42,922 53,097 10,102 828
20–44 115,745 36,976 69,053 69,233 6,795 1,263
45–64 81,332 20,737 29,841 17,402 2,506 687
65+ 33,994 8,087 11,855 5,416 711 156
San Francisco
Female <=19 14,355 6,986 23,265 13,251 2,940 173
20–44 85,766 10,284 52,479 23,458 3,656 526
45–64 35,617 6,890 31,478 9,184 1,144 282
65+ 27,215 5,172 23,044 5,773 554 121
Male <=19 14,881 6,959 24,541 14,480 2,851 165
20–44 105,798 11,111 48,379 31,605 3,766 782
45–64 43,694 7,352 26,404 8,674 1,220 354
65+ 20,072 3,329 17,190 3,428 450 76
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 31 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding arrays
Figure : Schematic representation of a 4-dimensional array: Year 2000 populationestimates by age (1), race (2), sex (3), and county (4)
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 32 / 60
Working with R data objects Working with vectors, matrices, & arrays
Understanding arrays
Figure : Schematic of a theoretical 5-D array (e.g., data by age (1), race (2), sex(3), party affiliation (4), and state (5)). We can see that the field “state” has 3levels, and the field “party affiliation” has 2 levels; however, it is not apparent thenumber of age, race, and sex levels. Although not displayed, age levels would berepresented by row names (along 1st dimension), race levels by column names(along 2nd dimension), and sex levels by depth names (along 3rd dimension).
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 33 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding lists
Up to now, we have been working with atomic data objects (vector, matrix,array). In contrast, lists, data frames, and functions are recursive dataobjects. Recursive data objects have more flexibility in combining diversedata objects into one object. A list provides the most flexibility. Think of alist object as a collection of “bins” that can contain any R object. Listsare very useful for collecting results of an analysis or a function into onedata object where all its contents are readily accessible by indexing.
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 34 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding lists
A list is a collection of data objects without any restrictions:
> x <- c(11, 22, 34)
> y <- c("Male", "Female", "Male")
> z <- matrix(c(67, 34, 56,22), 2, 2)
> mylist <- list(x, y, z)
> mylist
[[1]]
[1] 11 22 34
[[2]]
[1] "Male" "Female" "Male"
[[3]]
[,1] [,2]
[1,] 67 56
[2,] 34 22
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 35 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding lists
Names can be assigned to each bin of a list.
> names(mylist) <- c("Age", "Sex", "Data")
> mylist
$Age
[1] 11 22 34
$Sex
[1] "Male" "Female" "Male"
$Data
[,1] [,2]
[1,] 67 56
[2,] 34 22
> mylist$Sex
[1] "Male" "Female" "Male"
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 36 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding lists
Figure : Schematic representation of a list of length four. The first bin [1]
contains a smiling face [[1]], the second bin [2] contains a flower [[2]], thethird bin [3] contains a lightning bolt [[3]], and the fourth bin [[4]] containsa heart [[4]]. When indexing a list object, single brackets [·] indexes the bin,and double brackets [[·]] indexes the bin contents. If the bin has a name, then$name also indexes the contents.
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 37 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding lists
For example, using the UGDP clinical trial data, suppose we performFisher’s exact test for testing the null hypothesis of independence of rowsand columns in a contingency table with fixed marginals.
> udat <- read.csv("http://www.medepi.net/data/ugdp.txt")
> tab <- xtabs(~ Status + Treatment, data = udat)[,2:1]
> tab
Treatment
Status Tolbutamide Placebo
Death 30 21
Survivor 174 184
> ftab <- fisher.test(tab)
> ftab
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 38 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding lists
> ftab
Fisher’s Exact Test for Count Data
data: tab
p-value = 0.1813
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.8013768 2.8872863
sample estimates:
odds ratio
1.509142
The default display only shows partial results. The total results are storedin the object ftab. Let’s evaluate the structure of ftab and extract someresults:
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 39 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding lists
> str(ftab)
List of 7
$ p.value : num 0.181
$ conf.int : atomic [1:2] 0.801 2.887
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 1.51
..- attr(*, "names")= chr "odds ratio"
$ null.value : Named num 1
..- attr(*, "names")= chr "odds ratio"
$ alternative: chr "two.sided"
$ method : chr "Fisher’s Exact Test for Count Data"
$ data.name : chr "tab"
- attr(*, "class")= chr "htest"
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 40 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding lists
Let’s index some of the bins from ftab.
> ftab$estimate
odds ratio
1.5091
> ftab$conf.int
[1] 0.80138 2.88729
> ftab$conf.int[2]
[1] 2.887286
attr(,"conf.level")
[1] 0.95
> ftab$p.value
[1] 0.18126
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 41 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames
A data frame is a list with a 2-dimensional (tabular) structure.Epidemiologists are very experienced working with data frames where eachrow usually represents data collected on individual subjects (also calledrecords or observations) and columns represent fields for each type of datacollected (also called variables).
> subjno <- c(1, 2, 3, 4)
> age <- c(34, 56, 45, 23)
> sex <- c("Male", "Male", "Female", "Male")
> case <- c("Yes", "No", "No", "Yes")
> mydat <- data.frame(subjno, age, sex, case)
> mydat
subjno age sex case
1 1 34 Male Yes
2 2 56 Male No
3 3 45 Female No
4 4 23 Male Yes
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 42 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames
Epidemiologists are familiar with tabular data sets where each row is arecord and each column is a field. A record can be data collected onindividuals or groups. We usually refer to the field name as a variable(e.g., age, gender, ethnicity). Fields can contain numeric or characterdata. In R, these types of data sets are handled by data frames. Eachcolumn of a data frame is usually either a factor or numeric vector,although it can have complex, character, or logical vectors. Data frameshave the functionality of matrices and lists. For example, here is the first10 rows of the infert data set, a matched case-control study published in1976 that evaluated whether infertility was associated with priorspontaneous or induced abortions.
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 43 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames
> data(infert)
> str(infert)
‘data.frame’: 248 obs. of 8 variables:
$ education : Factor w/ 3 levels "0-5yrs",..: 1 1 ...
$ age : num NA 45 NA 23 35 36 23 32 21 28 ...
$ parity : num 6 1 6 4 3 4 1 2 1 2 ...
$ induced : num 1 1 2 2 1 2 0 0 0 0 ...
$ case : num 1 1 1 1 1 1 1 1 1 1 ...
$ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ...
$ stratum : int 1 2 3 4 5 6 7 8 9 10 ...
$ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ...
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 44 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames
> infert[1:10, 1:6]
education age parity induced case spontaneous
1 0-5yrs NA 6 1 1 2
2 0-5yrs 45 1 1 1 0
3 0-5yrs NA 6 2 1 0
4 0-5yrs 23 4 2 1 0
5 6-11yrs 35 3 1 1 1
6 6-11yrs 36 4 2 1 1
7 6-11yrs 23 1 0 1 0
8 6-11yrs 32 2 0 1 0
9 6-11yrs 21 1 0 1 1
10 6-11yrs 28 2 0 1 0
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 45 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames
The fields are obviously vectors. Let’s explore a few of these vectors to seewhat we can learn about their structure in R.
> #age variable
> infert$age
[1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31 27 30 26
...
[235] 25 32 25 31 38 26 31 31 25 31 34 35 29 23
> mode(infert$age)
[1] "numeric"
> class(infert$age)
[1] "numeric"
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 46 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames
> # education variable
> infert$education
[1] 0-5yrs 0-5yrs 0-5yrs 0-5yrs 6-11yrs 6-11yrs
...
[247] 12+ yrs 12+ yrs
Levels: 0-5yrs 6-11yrs 12+ yrs
> mode(infert$education)
[1] "numeric"
> class(infert$education)
[1] "factor"
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 47 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames and factors
A factor is R’s representation of categorical fields and keeps track of allpossible category levels.
> sex <- sample(c("Male", "Female"), 100, replace = TRUE)
> mode(sex); class(sex)
[1] "character"
[1] "character"
> table(sex)
sex
Female Male
51 49
> sexf <- factor(sex, levels = c("Male", "Female", "Transgender"))
> table(sexf)
sexf
Male Female Transgender
49 51 0
> mode(sexf); class(sexf)
[1] "numeric"
[1] "factor"
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 48 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames and lists
Infert data is a matched case-control study evaluating the association ofhistory of abortions and infertility. Use conditional logistic regression.
> mod3 <- clogit(case ~ spontaneous + induced +
+ strata(stratum), data = infert)
> mod3
Call:
clogit(case ~ spontaneous + induced + strata(stratum), data =
coef exp(coef) se(coef) z p
spontaneous 1.99 7.29 0.352 5.63 1.8e-08
induced 1.41 4.09 0.361 3.91 9.4e-05
> summod3 <- summary(mod3)
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 49 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames and lists
> summod3
n= 248
coef exp(coef) se(coef) z Pr(>|z|)
spontaneous 1.9859 7.2854 0.3524 5.635 1.75e-08 ***
induced 1.4090 4.0919 0.3607 3.906 9.38e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
exp(coef) exp(-coef) lower .95 upper .95
spontaneous 7.285 0.1373 3.651 14.536
induced 4.092 0.2444 2.018 8.298
Rsquare= 0.193 (max possible= 0.519 )
Likelihood ratio test= 53.15 on 2 df, p=2.869e-12
Wald test = 31.84 on 2 df, p=1.221e-07
Score (logrank) test = 48.44 on 2 df, p=3.032e-11
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 50 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frames and lists
> str(summod3)
List of 12
$ call : language coxph(formula = Surv(rep(1, 248L), case) ~ spontaneous
$ fail : NULL
$ na.action : NULL
$ n : int 248
$ loglik : num [1:2] -90.8 -64.2
$ coefficients: num [1:2, 1:5] 1.986 1.409 7.285 4.092 0.352 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "spontaneous" "induced"
.. ..$ : chr [1:5] "coef" "exp(coef)" "se(coef)" "z" ...
$ conf.int : num [1:2, 1:4] 7.285 4.092 0.137 0.244 3.651 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "spontaneous" "induced"
.. ..$ : chr [1:4] "exp(coef)" "exp(-coef)" "lower .95" "upper .95"
$ logtest : Named num [1:3] 5.32e+01 2.00 2.87e-12
... [output truncated]
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 51 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding data frame and lists
> summod3$coef
coef exp(coef) se(coef) z Pr(>|z|)
spontaneous 1.985876 7.285423 0.3524435 5.634592 1.754734e-08
induced 1.409012 4.091909 0.3607124 3.906191 9.376245e-05
> summod3$coef[1, ]
coef exp(coef) se(coef) z Pr(>|z|)
1.985876e+00 7.285423e+00 3.524435e-01 5.634592e+00 1.754734e-08
> summod3$coef[ ,2]
spontaneous induced
7.285423 4.091909
> summod3$coef[1,2]
[1] 7.285423
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 52 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding functions
Risk Ratio confidence interval from baby Rothman, p. 135
rr.wald <- function(x, conf.level = 0.95){
## prepare input
x1 <- x[1,1]; n1 <- sum(x[1,])
x0 <- x[2,1]; n0 <- sum(x[2,])
## do calculations
p1 <- x1/n1 ##risk among exposed
p0 <- x0/n0 ##risk among unexposed
RR <- p1/p0;
logRR <- log(RR)
SElogRR <- sqrt(1/x1 - 1/n1 + 1/x0 - 1/n0)
Z <- qnorm(0.5*(1 + conf.level))
LCL <- exp(logRR - Z*SElogRR)
UCL <- exp(logRR + Z*SElogRR)
##collect output
list(x = x, risks = c(p1 = p1, p0 = p0), risk.ratio = RR,
conf.int = c(LCL, UCL), conf.level = conf.level)
}
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 53 / 60
Working with R data objects Working with lists, data frames, and functions
Understanding functions
Run rr.wald function on UGDP RCT data (results displayed in 2columns).
> tab
Treatment
Status Tolbutamide Placebo
Death 30 21
Survivor 174 184
> rr.wald(tab)
$x
Treatment
Status Tolbutamide Placebo
Death 30 21
Survivor 174 184
$risks
p1 p0
0.5882353 0.4860335
$risk.ratio
[1] 1.210277
$conf.int
[1] 0.9396227 1.5588927
$conf.level
[1] 0.95
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 54 / 60
Working with R data objects Working with lists, data frames, and functions
The epitools package
The following epidemiologists, directly or indirectly, contributed to’epitools’:
Tomas Aragon, MD, DrPH, , UC Berkeley
Michael P. Fay, PhD, Mathematical Statistician National Institute ofAllergy and Infectious Diseases
Wayne Enanoria, PhD, MPH, UC Berkeley
Travis Porco, PhD, MPH, UC San Francisco
Michael Samuel, DrPH, California Department of Public Health
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 55 / 60
Working with R data objects Working with lists, data frames, and functions
Using epitools for outbreak investigations
Using the epitab function (only arguments are displayed);
epitab(x, y = NULL,
method = c("oddsratio", "riskratio", "rateratio"),
conf.level = 0.95,
rev = c("neither", "rows", "columns", "both"),
oddsratio = c("wald", "fisher", "midp", "small"),
riskratio = c("wald", "boot", "small"),
rateratio = c("wald", "midp"),
pvalue = c("fisher.exact", "midp.exact", "chi2"),
correction = FALSE,
verbose = FALSE)
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 56 / 60
Working with R data objects Working with lists, data frames, and functions
Hypothesis testing using Oswego: Passing 2 vectors
> library(epitools) #load ’epitools’ package
> data(oswego) #load Oswego dataset
> attach(oswego) #attach dataset
> round(epitab(jello, ill, method = "riskratio")$tab, 2)
Outcome
Predictor N p0 Y p1 riskratio lower upper p.value
N 22 0.42 30 0.58 1.00 NA NA NA
Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44
> round(epitab(jello, ill, method = "oddsratio")$tab, 2)
Outcome
Predictor N p0 Y p1 oddsratio lower upper p.value
N 22 0.76 30 0.65 1.00 NA NA NA
Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44
> detach(oswego) #detach dataset
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 57 / 60
Working with R data objects Working with lists, data frames, and functions
Hypothesis testing using Oswego: Passing a table
> jello.tab1
ill
jello N Y
N 22 30
Y 7 16
> round(epitab(jello.tab1)$tab, 2)
ill
jello N p0 Y p1 oddsratio lower upper p.value
N 22 0.76 30 0.65 1.00 NA NA NA
Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44
> round(epitab(jello.tab1, method = "risk")$tab, 2)
ill
jello N p0 Y p1 riskratio lower upper p.value
N 22 0.42 30 0.58 1.00 NA NA NA
Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 58 / 60
Working with R data objects Working with lists, data frames, and functions
Hypothesis testing using Oswego: Passing one vector
> round(epitab(c(22, 30, 7, 16))$tab, 2)
Outcome
Predictor Disease1 p0 Disease2 p1 oddsratio lower upper p.value
Exposed1 22 0.76 30 0.65 1.00 NA NA NA
Exposed2 7 0.24 16 0.35 1.68 0.59 4.76 0.44
> round(epitab(c(22, 30, 7, 16), method = "risk")$tab, 2)
Outcome
Predictor Disease1 p0 Disease2 p1 riskratio lower upper p.value
Exposed1 22 0.42 30 0.58 1.00 NA NA NA
Exposed2 7 0.30 16 0.70 1.21 0.84 1.72 0.44
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 59 / 60
Working with R data objects Working with lists, data frames, and functions
Summary
1 BackgroundCostQualityCommunity
2 Getting started with RFull-function calculator/spreadsheetExtensible statistical packagesHigh quality graphics toolMulti-use programming language
3 Working with R data objectsAtomic vs. recursive data objectsWorking with vectors, matrices, & arraysWorking with lists, data frames, and functions
Tomas Aragon, MD, DrPH (medepi.com) Understanding R for Epidemiologists September 8, 2014 60 / 60