introduction to data analysis in r

43
Introduction to Data Analysis in R Department of Statistical Sciences and Operations Research Computation Seminar Series Speaker: Edward Boone Email: [email protected]

Upload: newton

Post on 31-Jan-2016

60 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Data Analysis in R. Department of Statistical Sciences and Operations Research Computation Seminar Series Speaker: Edward Boone Email: [email protected]. What is R?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Data Analysis in R

Introduction to Data Analysis in R

Department of Statistical Sciences and Operations ResearchComputation Seminar SeriesSpeaker: Edward BooneEmail: [email protected]

Page 2: Introduction to Data Analysis in R

What is R?

The R statistical programming language is a free open source package based on the S language developed by Bell Labs.

The language is very powerful for writing programs. Many statistical functions are already built in. Contributed packages expand the functionality to

cutting edge research. Since it is a programming language, generating

computer code to complete tasks is required.

Page 3: Introduction to Data Analysis in R

Getting Started

Where to get R? Go to www.r-project.org Downloads: CRAN Set your Mirror: Anyone in the USA is fine. Select Windows 95 or later. Select base. Select R-2.4.1-win32.exe

The others are if you are a developer and wish to change the source code.

Page 4: Introduction to Data Analysis in R

Getting Started

The R GUI?

Page 5: Introduction to Data Analysis in R

Getting Started

Opening a script. This gives you a script window.

Page 6: Introduction to Data Analysis in R

Getting Started

Submitting a program:

Use button

Right mouse click and run selection.

Submit Selection

Page 7: Introduction to Data Analysis in R

Getting Started

Basic assignment and operations. Arithmetic Operations:

+, -, *, /, ^ are the standard arithmetic operators. Matrix Arithmetic.

* is element wise multiplication %*% is matrix multiplication

Assignment To assign a value to a variable use “<-”

Page 8: Introduction to Data Analysis in R

Getting Started

How to use help in R? R has a very good help system built in. If you know which function you want help with

simply use ?_______ with the function in the blank.

Ex: ?hist. If you don’t know which function to use, then use

help.search(“_______”). Ex: help.search(“histogram”).

Page 9: Introduction to Data Analysis in R

Importing Data

How do we get data into R? Remember we have no point and click… First make sure your data is in an easy to

read format such as CSV (Comma Separated Values).

Use code: D <- read.table(“path”,sep=“,”,header=TRUE)

Page 10: Introduction to Data Analysis in R

Working with data.

Accessing columns. D has our data in it…. But you can’t see it

directly. To select a column use D$column.

Page 11: Introduction to Data Analysis in R

Working with data.

Subsetting data. Use a logical operator to do this.

==, >, <, <=, >=, <> are all logical operators. Note that the “equals” logical operator is two = signs.

Example: D[D$Gender == “M”,] This will return the rows of D where Gender is “M”. Remember R is case sensitive! This code does nothing to the original dataset. D.M <- D[D$Gender == “M”,] gives a dataset with the

appropriate rows.

Page 12: Introduction to Data Analysis in R

Summary Statistics

Mean mean(D$wg)

[1] NA It doesn’t know what to do with the missing

values. mean(D$wg, na.rm=TRUE)

[1] 16.62016

Page 13: Introduction to Data Analysis in R

Summary Statistics

Medianmedian(D$wg,na.rm=TRUE)

[1] 15

Quantiles quantile(D$wg,c(.25,.5,.75),na.rm=TRUE)

25% 50% 75%

8 15 20

Summarysummary(D$wg)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.00 8.00 15.00 16.62 20.00 70.00 132.00

Page 14: Introduction to Data Analysis in R

Summary Statistics

Rangerange(D$wg,na.rm=TRUE)

IQRIQR(D$wg,na.rm=TRUE)

Standard Deviationsd(D$wg,na.rm=TRUE)

Page 15: Introduction to Data Analysis in R

Basic Testing

One Sample T-Testt.test(D$wg, alternative="greater", mu=10,

conf.level=0.95)

Output One Sample t-test

data: D$wg t = 8.7429, df = 257, p-value < 2.2e-16alternative hypothesis: true mean is greater than 10 95 percent confidence interval: 15.37016 Inf sample estimates:mean of x 16.62016

Page 16: Introduction to Data Analysis in R

Basic Testing

Comparing Men to Women on weight gain. Two Sample T-Test

wg.M <- D[D$Gender =="M",]wg.F <- D[D$Gender =="F",]t.test(wg.M$wg, wg.F$wg, alternative="two.sided", mu=0, conf.level

= 0.95) Output

Welch Two Sample t-test

data: wg.M$wg and wg.F$wg t = 1.2645, df = 142.21, p-value = 0.2081alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.132955 5.155447 sample estimates:mean of x mean of y 18.08571 16.07447

Page 17: Introduction to Data Analysis in R

Basic Testing

Paired T-TestPt <- read.csv("H:\\Pairedt.csv",header=TRUE)t.test(Pt$A,Pt$B, alternative="two.sided", mu=0, paired = TRUE, conf.level = 0.95)

Output

Paired t-test

data: Pt$A and Pt$B t = -0.1064, df = 9, p-value = 0.9176alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.558007 1.418007 sample estimates:mean of the differences -0.07

Page 18: Introduction to Data Analysis in R

The lm Function

R organizes many linear models into the lm function. Regression ANOVA

Page 19: Introduction to Data Analysis in R

Regression

Consider the following dataset.cherry <- read.csv("H:\\cherry.csv",header=TRUE)

plot(cherry$Height,cherry$Volume)

Page 20: Introduction to Data Analysis in R

Regression

Fit a regression model to the data.lm(Volume ~ Height, data = cherry)

Output:Call:

lm(formula = Volume ~ Height, data = cherry)

Coefficients:

(Intercept) Height

-87.124 1.543

Page 21: Introduction to Data Analysis in R

Regression

Maybe there is more…cherry.lm <- lm(Volume ~ Height,data = cherry)summary(cherry.lm)

Output:Call:lm(formula = Volume ~ Height, data = cherry)

Residuals: Min 1Q Median 3Q Max -21.274 -9.894 -2.894 12.067 29.852

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -87.1236 29.2731 -2.976 0.005835 ** Height 1.5433 0.3839 4.021 0.000378 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.4 on 29 degrees of freedomMultiple R-Squared: 0.3579, Adjusted R-squared: 0.3358 F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784

Page 22: Introduction to Data Analysis in R

Regression

And more…par(mfrow=c(2,2))

plot(cherry.lm)

Page 23: Introduction to Data Analysis in R

Regression

What all is there?names(cherry.lm)

Output[1] "coefficients" "residuals" "effects" "rank"

[5] "fitted.values" "assign" "qr" "df.residual"

[9] "xlevels" "call" "terms" "model"

Page 24: Introduction to Data Analysis in R

Regression

What all is there? names(summary(cherry.lm))

Output[1] "call" "terms" "residuals" "coefficients"

[5] "aliased" "sigma" "df" "r.squared"

[9] "adj.r.squared" "fstatistic" "cov.unscaled"

And even more…

Page 25: Introduction to Data Analysis in R

Regression

Plot the line

par(mfrow=c(1,1))plot(cherry$Height, cherry$Volume)abline(cherry.lm)

Page 26: Introduction to Data Analysis in R

Regression

Fit confidence intervals to line.predict.lm(cherry.lm, interval="confidence")

Prediction intervals to a new data set can also be generated.

Influence measures can be generated as well. Influence.measures() function does lots of

these.

Page 27: Introduction to Data Analysis in R

Multiple Linear Regression

Matrix Plotpairs(cherry)

Page 28: Introduction to Data Analysis in R

Multiple Linear Regression To fit a multiple linear regression:

cherry.lm <- lm(Volume ~ Height + Diam, data=cherry)summary(cherry.lm)

Call:lm(formula = Volume ~ Height + Diam, data = cherry)

Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***Height 0.3393 0.1302 2.607 0.0145 * Diam 4.7082 0.2643 17.816 < 2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedomMultiple R-Squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16

Page 29: Introduction to Data Analysis in R

Regression

ANOVAanova(cherry.lm)

Analysis of Variance Table

Response: Volume

Df Sum Sq Mean Sq F value Pr(>F)

Height 1 2901.2 2901.2 192.53 4.503e-14 ***

Diam 1 4783.0 4783.0 317.41 < 2.2e-16 ***

Residuals 28 421.9 15.1

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What type of Sums of Squares are these?

Page 30: Introduction to Data Analysis in R

Regression

ANOVAcherry.lm <- lm(Volume ~ Diam + Height, data=cherry)

anova(cherry.lm)

Analysis of Variance Table

Response: Volume

Df Sum Sq Mean Sq F value Pr(>F)

Diam 1 7581.8 7581.8 503.1503 < 2e-16 ***

Height 1 102.4 102.4 6.7943 0.01449 *

Residuals 28 421.9 15.1

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 31: Introduction to Data Analysis in R

Regression

Confidence Intervalsconfint(cherry.lm, level = 0.97)

1.5 % 98.5 %(Intercept) -77.73792775 -38.2373901

Diam 4.10395112 5.3123699

Height 0.04167615 0.6368263

Page 32: Introduction to Data Analysis in R

Model Search

AICextractAIC(cherry.lm)

[1] 3.00000 86.93578

The first value is the effective degrees of freedom and the second value is the AIC

Stepwise.step(cherry.lm)

This is based on AIC however can be specified to use BIC. Default is AIC

Can do forward, backward and stepwise selection. Default is stepwise.

Page 33: Introduction to Data Analysis in R

ANOVA

ANOVAdo2 <- read.csv("H:\\do2.csv",header=TRUE)

do2.aov <- aov(DO2 ~ Stream, data=do2)

summary(do2.aov)

Output Df Sum Sq Mean Sq F value Pr(>F)

Stream 2 37.056 18.528 4.5595 0.01898 *

Residuals 29 117.844 4.064

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 34: Introduction to Data Analysis in R

ANOVA

You can also use the lm frameworkdo2.lm <- lm(DO2 ~ Stream, data=do2)anova(do2.lm)

Output Analysis of Variance Table

Response: DO2 Df Sum Sq Mean Sq F value Pr(>F) Stream 2 37.056 18.528 4.5595 0.01898 *Residuals 29 117.844 4.064 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 35: Introduction to Data Analysis in R

Multiple Comparisons

Tukey’s HSD ProcedureTukeyHSD(do2.aov)

Output

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = DO2 ~ Stream, data = do2)

$Stream diff lwr upr p adjPiney-Cedar 0.4966667 -1.634953 2.62828674 0.8342003South Run-Cedar -2.0533333 -4.184953 0.07828674 0.0607655South Run-Piney -2.5500000 -4.776405 -0.32359544 0.0221846

Page 36: Introduction to Data Analysis in R

Multiple Comparisons

Tukey’s HSD Procedure using the plot command.

do2.HSD <- TukeyHSD(do2.aov)plot(do2.HSD)

Page 37: Introduction to Data Analysis in R

Multiple Comparisons

Tukey’s HSD Proceduredo2.HSD <- TukeyHSD(do2.aov, conf.level = 0.97)

do2.HSD

Output Tukey multiple comparisons of means 97% family-wise confidence level

Fit: aov(formula = DO2 ~ Stream, data = do2)

$Stream

diff lwr upr p adj

Piney-Cedar 0.4966667 -1.832410 2.8257437 0.8342003

South Run-Cedar -2.0533333 -4.382410 0.2757437 0.0607655

South Run-Piney -2.5500000 -4.982642 -0.1173584 0.0221846

Page 38: Introduction to Data Analysis in R

Multiple Comparisons

Plot it.

Page 39: Introduction to Data Analysis in R

Kruskal Wallis Test

Tukey’s HSD Procedurekruskal.test(DO2 ~ Stream, data=do2)

Output Kruskal-Wallis rank sum test

data: DO2 by Stream

Kruskal-Wallis chi-squared = 7.5097, df = 2, p-value = 0.02340

Page 40: Introduction to Data Analysis in R

Loess Regression

Regression from a non-parametric frameworkcherry.lo <- loess(Volume ~ Diam, data=cherry)

cherry.lo$fitted

Output [1] 9.358813 10.471276 11.213760 17.518128 18.258269 18.629063 19.368892 [8] 19.368892 19.742982 20.117965 20.503236 20.890123 20.890123 21.950202

[15] 22.930251 26.020681 26.020681 27.657015 29.364601 29.799478 30.701120

[22] 31.658354 33.187952 41.983894 43.878359 50.561902 51.968665 54.854374

[29] 55.590857 55.590857 76.749972

Page 41: Introduction to Data Analysis in R

Loess Regression

cherry.plo <- predict(cherry.lo, cherry$Diam)plot(cherry$Diam,cherry$Volume)lines(cherry$Diam,cherry.plo)

Page 42: Introduction to Data Analysis in R

Summary

R is programming environment with many standard statistical procedures already included.

Easy to control the procedure you are interested in.

No support. Allows user to modify many existing

procedures.

Page 43: Introduction to Data Analysis in R

Summary

All of the R code and files can be found at:

www.people.vcu.edu/~elboone2/CSS.htm