introduction to data analysis in r

Introduction to Data Analysis in R

Department of Statistical Sciences and Operations ResearchComputation Seminar SeriesSpeaker: Edward BooneEmail: [email protected]

What is R?

The R statistical programming language is a free open source package based on the S language developed by Bell Labs.

The language is very powerful for writing programs. Many statistical functions are already built in. Contributed packages expand the functionality to

cutting edge research. Since it is a programming language, generating

computer code to complete tasks is required.

Getting Started

Where to get R? Go to www.r-project.org Downloads: CRAN Set your Mirror: Anyone in the USA is fine. Select Windows 95 or later. Select base. Select R-2.4.1-win32.exe

The others are if you are a developer and wish to change the source code.

Getting Started

The R GUI?

Getting Started

Opening a script. This gives you a script window.

Getting Started

Submitting a program:

Use button

Right mouse click and run selection.

Submit Selection

Getting Started

Basic assignment and operations. Arithmetic Operations:

+, -, *, /, ^ are the standard arithmetic operators. Matrix Arithmetic.

* is element wise multiplication %*% is matrix multiplication

Assignment To assign a value to a variable use “<-”

Getting Started

How to use help in R? R has a very good help system built in. If you know which function you want help with

simply use ?_______ with the function in the blank.

Ex: ?hist. If you don’t know which function to use, then use

help.search(“_______”). Ex: help.search(“histogram”).

Importing Data

How do we get data into R? Remember we have no point and click… First make sure your data is in an easy to

read format such as CSV (Comma Separated Values).

Use code: D <- read.table(“path”,sep=“,”,header=TRUE)

Working with data.

Accessing columns. D has our data in it…. But you can’t see it

directly. To select a column use D$column.

Working with data.

Subsetting data. Use a logical operator to do this.

==, >, <, <=, >=, <> are all logical operators. Note that the “equals” logical operator is two = signs.

Example: D[D$Gender == “M”,] This will return the rows of D where Gender is “M”. Remember R is case sensitive! This code does nothing to the original dataset. D.M <- D[D$Gender == “M”,] gives a dataset with the

appropriate rows.

Summary Statistics

Mean mean(D$wg)

[1] NA It doesn’t know what to do with the missing

values. mean(D$wg, na.rm=TRUE)

[1] 16.62016

Summary Statistics

Medianmedian(D$wg,na.rm=TRUE)

[1] 15

Quantiles quantile(D$wg,c(.25,.5,.75),na.rm=TRUE)

25% 50% 75%

8 15 20

Summarysummary(D$wg)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.00 8.00 15.00 16.62 20.00 70.00 132.00

Summary Statistics

Rangerange(D$wg,na.rm=TRUE)

IQRIQR(D$wg,na.rm=TRUE)

Standard Deviationsd(D$wg,na.rm=TRUE)

Basic Testing

One Sample T-Testt.test(D$wg, alternative="greater", mu=10,

conf.level=0.95)

Output One Sample t-test

data: D$wg t = 8.7429, df = 257, p-value < 2.2e-16alternative hypothesis: true mean is greater than 10 95 percent confidence interval: 15.37016 Inf sample estimates:mean of x 16.62016

Basic Testing

Comparing Men to Women on weight gain. Two Sample T-Test

wg.M <- D[D$Gender =="M",]wg.F <- D[D$Gender =="F",]t.test(wg.M$wg, wg.F$wg, alternative="two.sided", mu=0, conf.level

= 0.95) Output

Welch Two Sample t-test

data: wg.M$wg and wg.F$wg t = 1.2645, df = 142.21, p-value = 0.2081alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.132955 5.155447 sample estimates:mean of x mean of y 18.08571 16.07447

Basic Testing

Paired T-TestPt <- read.csv("H:\\Pairedt.csv",header=TRUE)t.test(Pt$A,Pt$B, alternative="two.sided", mu=0, paired = TRUE, conf.level = 0.95)

Output

Paired t-test

data: Pt$A and Pt$B t = -0.1064, df = 9, p-value = 0.9176alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.558007 1.418007 sample estimates:mean of the differences -0.07

The lm Function

R organizes many linear models into the lm function. Regression ANOVA

Regression

Consider the following dataset.cherry <- read.csv("H:\\cherry.csv",header=TRUE)

plot(cherry$Height,cherry$Volume)

Regression

Fit a regression model to the data.lm(Volume ~ Height, data = cherry)

Output:Call:

lm(formula = Volume ~ Height, data = cherry)

Coefficients:

(Intercept) Height

-87.124 1.543

Regression

Maybe there is more…cherry.lm <- lm(Volume ~ Height,data = cherry)summary(cherry.lm)

Output:Call:lm(formula = Volume ~ Height, data = cherry)

Residuals: Min 1Q Median 3Q Max -21.274 -9.894 -2.894 12.067 29.852

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -87.1236 29.2731 -2.976 0.005835 ** Height 1.5433 0.3839 4.021 0.000378 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.4 on 29 degrees of freedomMultiple R-Squared: 0.3579, Adjusted R-squared: 0.3358 F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784

Regression

And more…par(mfrow=c(2,2))

plot(cherry.lm)

Regression

What all is there?names(cherry.lm)

Output[1] "coefficients" "residuals" "effects" "rank"

[5] "fitted.values" "assign" "qr" "df.residual"

[9] "xlevels" "call" "terms" "model"

Regression

What all is there? names(summary(cherry.lm))

Output[1] "call" "terms" "residuals" "coefficients"

[5] "aliased" "sigma" "df" "r.squared"

[9] "adj.r.squared" "fstatistic" "cov.unscaled"

And even more…

Regression

Plot the line

par(mfrow=c(1,1))plot(cherry$Height, cherry$Volume)abline(cherry.lm)

Regression

Fit confidence intervals to line.predict.lm(cherry.lm, interval="confidence")

Prediction intervals to a new data set can also be generated.

Influence measures can be generated as well. Influence.measures() function does lots of

these.

Multiple Linear Regression

Matrix Plotpairs(cherry)

Multiple Linear Regression To fit a multiple linear regression:

cherry.lm <- lm(Volume ~ Height + Diam, data=cherry)summary(cherry.lm)

Call:lm(formula = Volume ~ Height + Diam, data = cherry)

Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***Height 0.3393 0.1302 2.607 0.0145 * Diam 4.7082 0.2643 17.816 < 2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedomMultiple R-Squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16

Regression

ANOVAanova(cherry.lm)

Analysis of Variance Table

Response: Volume

Df Sum Sq Mean Sq F value Pr(>F)

Height 1 2901.2 2901.2 192.53 4.503e-14 ***

Diam 1 4783.0 4783.0 317.41 < 2.2e-16 ***

Residuals 28 421.9 15.1

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What type of Sums of Squares are these?

Regression

ANOVAcherry.lm <- lm(Volume ~ Diam + Height, data=cherry)

anova(cherry.lm)

Analysis of Variance Table

Response: Volume

Df Sum Sq Mean Sq F value Pr(>F)

Diam 1 7581.8 7581.8 503.1503 < 2e-16 ***

Height 1 102.4 102.4 6.7943 0.01449 *

Residuals 28 421.9 15.1

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression

Confidence Intervalsconfint(cherry.lm, level = 0.97)

1.5 % 98.5 %(Intercept) -77.73792775 -38.2373901

Diam 4.10395112 5.3123699

Height 0.04167615 0.6368263

Model Search

AICextractAIC(cherry.lm)

[1] 3.00000 86.93578

The first value is the effective degrees of freedom and the second value is the AIC

Stepwise.step(cherry.lm)

This is based on AIC however can be specified to use BIC. Default is AIC

Can do forward, backward and stepwise selection. Default is stepwise.

ANOVA

ANOVAdo2 <- read.csv("H:\\do2.csv",header=TRUE)

do2.aov <- aov(DO2 ~ Stream, data=do2)

summary(do2.aov)

Output Df Sum Sq Mean Sq F value Pr(>F)

Stream 2 37.056 18.528 4.5595 0.01898 *

Residuals 29 117.844 4.064

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA

You can also use the lm frameworkdo2.lm <- lm(DO2 ~ Stream, data=do2)anova(do2.lm)

Output Analysis of Variance Table

Response: DO2 Df Sum Sq Mean Sq F value Pr(>F) Stream 2 37.056 18.528 4.5595 0.01898 *Residuals 29 117.844 4.064 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Comparisons

Tukey’s HSD ProcedureTukeyHSD(do2.aov)

Output

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = DO2 ~ Stream, data = do2)

$Stream diff lwr upr p adjPiney-Cedar 0.4966667 -1.634953 2.62828674 0.8342003South Run-Cedar -2.0533333 -4.184953 0.07828674 0.0607655South Run-Piney -2.5500000 -4.776405 -0.32359544 0.0221846


Tukey’s HSD Procedure using the plot command.

do2.HSD <- TukeyHSD(do2.aov)plot(do2.HSD)


Tukey’s HSD Proceduredo2.HSD <- TukeyHSD(do2.aov, conf.level = 0.97)

do2.HSD

Output Tukey multiple comparisons of means 97% family-wise confidence level

Fit: aov(formula = DO2 ~ Stream, data = do2)

$Stream

diff lwr upr p adj

Piney-Cedar 0.4966667 -1.832410 2.8257437 0.8342003

South Run-Cedar -2.0533333 -4.382410 0.2757437 0.0607655

South Run-Piney -2.5500000 -4.982642 -0.1173584 0.0221846


Plot it.

Kruskal Wallis Test

Tukey’s HSD Procedurekruskal.test(DO2 ~ Stream, data=do2)

Output Kruskal-Wallis rank sum test

data: DO2 by Stream

Kruskal-Wallis chi-squared = 7.5097, df = 2, p-value = 0.02340

Loess Regression

Regression from a non-parametric frameworkcherry.lo <- loess(Volume ~ Diam, data=cherry)

cherry.lo$fitted

Output [1] 9.358813 10.471276 11.213760 17.518128 18.258269 18.629063 19.368892 [8] 19.368892 19.742982 20.117965 20.503236 20.890123 20.890123 21.950202

[15] 22.930251 26.020681 26.020681 27.657015 29.364601 29.799478 30.701120

[22] 31.658354 33.187952 41.983894 43.878359 50.561902 51.968665 54.854374

[29] 55.590857 55.590857 76.749972

Loess Regression

cherry.plo <- predict(cherry.lo, cherry$Diam)plot(cherry$Diam,cherry$Volume)lines(cherry$Diam,cherry.plo)

Summary

R is programming environment with many standard statistical procedures already included.

Easy to control the procedure you are interested in.

No support. Allows user to modify many existing

procedures.

Summary

All of the R code and files can be found at:

www.people.vcu.edu/~elboone2/CSS.htm

introduction to data analysis in r

Documents