introduction to data analysis in r
DESCRIPTION
Introduction to Data Analysis in R. Department of Statistical Sciences and Operations Research Computation Seminar Series Speaker: Edward Boone Email: [email protected]. What is R?. - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Data Analysis in R
Department of Statistical Sciences and Operations ResearchComputation Seminar SeriesSpeaker: Edward BooneEmail: [email protected]
What is R?
The R statistical programming language is a free open source package based on the S language developed by Bell Labs.
The language is very powerful for writing programs. Many statistical functions are already built in. Contributed packages expand the functionality to
cutting edge research. Since it is a programming language, generating
computer code to complete tasks is required.
Getting Started
Where to get R? Go to www.r-project.org Downloads: CRAN Set your Mirror: Anyone in the USA is fine. Select Windows 95 or later. Select base. Select R-2.4.1-win32.exe
The others are if you are a developer and wish to change the source code.
Getting Started
The R GUI?
Getting Started
Opening a script. This gives you a script window.
Getting Started
Submitting a program:
Use button
Right mouse click and run selection.
Submit Selection
Getting Started
Basic assignment and operations. Arithmetic Operations:
+, -, *, /, ^ are the standard arithmetic operators. Matrix Arithmetic.
* is element wise multiplication %*% is matrix multiplication
Assignment To assign a value to a variable use “<-”
Getting Started
How to use help in R? R has a very good help system built in. If you know which function you want help with
simply use ?_______ with the function in the blank.
Ex: ?hist. If you don’t know which function to use, then use
help.search(“_______”). Ex: help.search(“histogram”).
Importing Data
How do we get data into R? Remember we have no point and click… First make sure your data is in an easy to
read format such as CSV (Comma Separated Values).
Use code: D <- read.table(“path”,sep=“,”,header=TRUE)
Working with data.
Accessing columns. D has our data in it…. But you can’t see it
directly. To select a column use D$column.
Working with data.
Subsetting data. Use a logical operator to do this.
==, >, <, <=, >=, <> are all logical operators. Note that the “equals” logical operator is two = signs.
Example: D[D$Gender == “M”,] This will return the rows of D where Gender is “M”. Remember R is case sensitive! This code does nothing to the original dataset. D.M <- D[D$Gender == “M”,] gives a dataset with the
appropriate rows.
Summary Statistics
Mean mean(D$wg)
[1] NA It doesn’t know what to do with the missing
values. mean(D$wg, na.rm=TRUE)
[1] 16.62016
Summary Statistics
Medianmedian(D$wg,na.rm=TRUE)
[1] 15
Quantiles quantile(D$wg,c(.25,.5,.75),na.rm=TRUE)
25% 50% 75%
8 15 20
Summarysummary(D$wg)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 8.00 15.00 16.62 20.00 70.00 132.00
Summary Statistics
Rangerange(D$wg,na.rm=TRUE)
IQRIQR(D$wg,na.rm=TRUE)
Standard Deviationsd(D$wg,na.rm=TRUE)
Basic Testing
One Sample T-Testt.test(D$wg, alternative="greater", mu=10,
conf.level=0.95)
Output One Sample t-test
data: D$wg t = 8.7429, df = 257, p-value < 2.2e-16alternative hypothesis: true mean is greater than 10 95 percent confidence interval: 15.37016 Inf sample estimates:mean of x 16.62016
Basic Testing
Comparing Men to Women on weight gain. Two Sample T-Test
wg.M <- D[D$Gender =="M",]wg.F <- D[D$Gender =="F",]t.test(wg.M$wg, wg.F$wg, alternative="two.sided", mu=0, conf.level
= 0.95) Output
Welch Two Sample t-test
data: wg.M$wg and wg.F$wg t = 1.2645, df = 142.21, p-value = 0.2081alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.132955 5.155447 sample estimates:mean of x mean of y 18.08571 16.07447
Basic Testing
Paired T-TestPt <- read.csv("H:\\Pairedt.csv",header=TRUE)t.test(Pt$A,Pt$B, alternative="two.sided", mu=0, paired = TRUE, conf.level = 0.95)
Output
Paired t-test
data: Pt$A and Pt$B t = -0.1064, df = 9, p-value = 0.9176alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.558007 1.418007 sample estimates:mean of the differences -0.07
The lm Function
R organizes many linear models into the lm function. Regression ANOVA
Regression
Consider the following dataset.cherry <- read.csv("H:\\cherry.csv",header=TRUE)
plot(cherry$Height,cherry$Volume)
Regression
Fit a regression model to the data.lm(Volume ~ Height, data = cherry)
Output:Call:
lm(formula = Volume ~ Height, data = cherry)
Coefficients:
(Intercept) Height
-87.124 1.543
Regression
Maybe there is more…cherry.lm <- lm(Volume ~ Height,data = cherry)summary(cherry.lm)
Output:Call:lm(formula = Volume ~ Height, data = cherry)
Residuals: Min 1Q Median 3Q Max -21.274 -9.894 -2.894 12.067 29.852
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -87.1236 29.2731 -2.976 0.005835 ** Height 1.5433 0.3839 4.021 0.000378 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.4 on 29 degrees of freedomMultiple R-Squared: 0.3579, Adjusted R-squared: 0.3358 F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784
Regression
And more…par(mfrow=c(2,2))
plot(cherry.lm)
Regression
What all is there?names(cherry.lm)
Output[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
Regression
What all is there? names(summary(cherry.lm))
Output[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
And even more…
Regression
Plot the line
par(mfrow=c(1,1))plot(cherry$Height, cherry$Volume)abline(cherry.lm)
Regression
Fit confidence intervals to line.predict.lm(cherry.lm, interval="confidence")
Prediction intervals to a new data set can also be generated.
Influence measures can be generated as well. Influence.measures() function does lots of
these.
Multiple Linear Regression
Matrix Plotpairs(cherry)
Multiple Linear Regression To fit a multiple linear regression:
cherry.lm <- lm(Volume ~ Height + Diam, data=cherry)summary(cherry.lm)
Call:lm(formula = Volume ~ Height + Diam, data = cherry)
Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***Height 0.3393 0.1302 2.607 0.0145 * Diam 4.7082 0.2643 17.816 < 2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedomMultiple R-Squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Regression
ANOVAanova(cherry.lm)
Analysis of Variance Table
Response: Volume
Df Sum Sq Mean Sq F value Pr(>F)
Height 1 2901.2 2901.2 192.53 4.503e-14 ***
Diam 1 4783.0 4783.0 317.41 < 2.2e-16 ***
Residuals 28 421.9 15.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What type of Sums of Squares are these?
Regression
ANOVAcherry.lm <- lm(Volume ~ Diam + Height, data=cherry)
anova(cherry.lm)
Analysis of Variance Table
Response: Volume
Df Sum Sq Mean Sq F value Pr(>F)
Diam 1 7581.8 7581.8 503.1503 < 2e-16 ***
Height 1 102.4 102.4 6.7943 0.01449 *
Residuals 28 421.9 15.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression
Confidence Intervalsconfint(cherry.lm, level = 0.97)
1.5 % 98.5 %(Intercept) -77.73792775 -38.2373901
Diam 4.10395112 5.3123699
Height 0.04167615 0.6368263
Model Search
AICextractAIC(cherry.lm)
[1] 3.00000 86.93578
The first value is the effective degrees of freedom and the second value is the AIC
Stepwise.step(cherry.lm)
This is based on AIC however can be specified to use BIC. Default is AIC
Can do forward, backward and stepwise selection. Default is stepwise.
ANOVA
ANOVAdo2 <- read.csv("H:\\do2.csv",header=TRUE)
do2.aov <- aov(DO2 ~ Stream, data=do2)
summary(do2.aov)
Output Df Sum Sq Mean Sq F value Pr(>F)
Stream 2 37.056 18.528 4.5595 0.01898 *
Residuals 29 117.844 4.064
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA
You can also use the lm frameworkdo2.lm <- lm(DO2 ~ Stream, data=do2)anova(do2.lm)
Output Analysis of Variance Table
Response: DO2 Df Sum Sq Mean Sq F value Pr(>F) Stream 2 37.056 18.528 4.5595 0.01898 *Residuals 29 117.844 4.064 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple Comparisons
Tukey’s HSD ProcedureTukeyHSD(do2.aov)
Output
Tukey multiple comparisons of means 95% family-wise confidence level
Fit: aov(formula = DO2 ~ Stream, data = do2)
$Stream diff lwr upr p adjPiney-Cedar 0.4966667 -1.634953 2.62828674 0.8342003South Run-Cedar -2.0533333 -4.184953 0.07828674 0.0607655South Run-Piney -2.5500000 -4.776405 -0.32359544 0.0221846
Multiple Comparisons
Tukey’s HSD Procedure using the plot command.
do2.HSD <- TukeyHSD(do2.aov)plot(do2.HSD)
Multiple Comparisons
Tukey’s HSD Proceduredo2.HSD <- TukeyHSD(do2.aov, conf.level = 0.97)
do2.HSD
Output Tukey multiple comparisons of means 97% family-wise confidence level
Fit: aov(formula = DO2 ~ Stream, data = do2)
$Stream
diff lwr upr p adj
Piney-Cedar 0.4966667 -1.832410 2.8257437 0.8342003
South Run-Cedar -2.0533333 -4.382410 0.2757437 0.0607655
South Run-Piney -2.5500000 -4.982642 -0.1173584 0.0221846
Multiple Comparisons
Plot it.
Kruskal Wallis Test
Tukey’s HSD Procedurekruskal.test(DO2 ~ Stream, data=do2)
Output Kruskal-Wallis rank sum test
data: DO2 by Stream
Kruskal-Wallis chi-squared = 7.5097, df = 2, p-value = 0.02340
Loess Regression
Regression from a non-parametric frameworkcherry.lo <- loess(Volume ~ Diam, data=cherry)
cherry.lo$fitted
Output [1] 9.358813 10.471276 11.213760 17.518128 18.258269 18.629063 19.368892 [8] 19.368892 19.742982 20.117965 20.503236 20.890123 20.890123 21.950202
[15] 22.930251 26.020681 26.020681 27.657015 29.364601 29.799478 30.701120
[22] 31.658354 33.187952 41.983894 43.878359 50.561902 51.968665 54.854374
[29] 55.590857 55.590857 76.749972
Loess Regression
cherry.plo <- predict(cherry.lo, cherry$Diam)plot(cherry$Diam,cherry$Volume)lines(cherry$Diam,cherry.plo)
Summary
R is programming environment with many standard statistical procedures already included.
Easy to control the procedure you are interested in.
No support. Allows user to modify many existing
procedures.
Summary
All of the R code and files can be found at:
www.people.vcu.edu/~elboone2/CSS.htm