introduction on r
TRANSCRIPT
-
7/31/2019 Introduction on R
1/95
8/29/12
-
7/31/2019 Introduction on R
2/95
8/29/12
package R is a free software environment forstatistical computing and graphics
R is an integrated suite of softwarefacilities for data manipulation,simulation, calculation and graphical
display. It handles and analyzes data very
effectively and it contains a suite of
operators for calculations on arraysand matrices.
-
7/31/2019 Introduction on R
3/95
8/29/12
Note:
1. R is case sensitive2. all alphanumeric symbols are
allowed plus . and _
3. a name must start with . or a letter,and if it starts with . the secondcharacter must not be a digit
example : Names starting with adigit is not accepted. You caninstead use .
-
7/31/2019 Introduction on R
4/95
8/29/12
3. Commands are separated either bya semi-colon (;), or by a newline.
4. Comments can be put almost
anywhere, starting with a hashmark(#), everything to the end of the lineis a comment
5. Do not use names of variables in adata-frame as names of objects. If youdo so, the object will shadow the
variable with the same name in
Note:
-
7/31/2019 Introduction on R
5/95
8/29/12
For example, suppose the followingrepresents eight tosses of a fair die:
2 5 1 6 5 5 4 1
COMMAND:
> dieroll dieroll
[1] 2 5 1 6 5 5 4 1
A Simple Example: the c()
Function
-
7/31/2019 Introduction on R
6/95
8/29/12
When entering commands in R,
you can save yourself a lot of typingwhen you learn to use the arrowkeys effectively. Each command
you submit is stored in the Historyand the up arrow will navigatebackwards along this history and
the down arrow forwards. The leftand right arrow keys movebackwards and forwards along the
command line.
-
7/31/2019 Introduction on R
7/95
8/29/12
The Workspace
All variables or objects created in Rare stored in whats called theworkspace. To see what variables are
in the workspace, you can use thefunction ls() to list them (this functiondoesnt need any argument between
the parentheses).
-
7/31/2019 Introduction on R
8/95
8/29/12
Currently, we only
have:> ls()
[1] "dieroll
The Workspace
-
7/31/2019 Introduction on R
9/95
8/29/12
If we define a new variable a simplefunction of the variable dieroll it willbe added to the workspace:
> newdieroll newdieroll
[1] 1.0 2.5 0.5 3.0 2.5 2.5 2.00.5
> ls()
[1] "dieroll" "newdieroll"
The Workspace
-
7/31/2019 Introduction on R
10/95
8/29/12
To remove objects from theworkspace (youll want to do thisoccasionally when your workspace gets
too cluttered), use the rm() function:
> rm(newdieroll) # this was a sillyvariable anyway
> ls()
[1] "dieroll"
The Workspace
-
7/31/2019 Introduction on R
11/95
8/29/12
Get in the habit of savingyour work it will
probably help you in thefuture.
-
7/31/2019 Introduction on R
12/95
8/29/12
Getting Help
using the function help()
> help(log)
-
7/31/2019 Introduction on R
13/95
8/29/12
Call help for matrix.
-
7/31/2019 Introduction on R
14/95
8/29/12
> a A A[,1] [,2] [,3] [,4]
[1,] 1 3 5 7[2,] 2 4 6 8
-
7/31/2019 Introduction on R
15/95
8/29/12
> b B B
[,1] [,2] [,3] [,4][1,] 2 6 10 14
-
7/31/2019 Introduction on R
16/95
8/29/12
> a A A
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8[4,] 4 9
[5,] 5 10
-
7/31/2019 Introduction on R
17/95
8/29/12
> B B
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6[4,] 7 8
[5,] 9 10
-
7/31/2019 Introduction on R
18/95
8/29/12
> C C[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
-
7/31/2019 Introduction on R
19/95
8/29/12
Exercises #1
1. Use the help system to findinformation on the R functions meanand median.
2. Get a list of all the functions in Rthat contains the string test.
3. Create the vector myselfcontaininginfo of your age, height (in inches/cm),and phone number.
4. Create the vector myfamilycontaining the names of your father,mother brothers & sisters.
1 0 0
0 1 0
0 0 1
-
7/31/2019 Introduction on R
20/95
8/29/12
DataManagem
ent
-
7/31/2019 Introduction on R
21/95
8/29/12
Data Management
>myclassmates myclassmates[1] "Stephen""Christopher"
-
7/31/2019 Introduction on R
22/95
8/29/12
SequencesSometimes we will need to
create a string of numericalvalues that have a regularpattern. Instead of typing thesequence out, we can define thepattern using some special
operators and functions.1. Colon operator
2. Sequence function seq()
-
7/31/2019 Introduction on R
23/95
8/29/12
Sequences
The colon operator creates avector of numbers (between
two specified numbers) thatare one unit apart:
> 1:9
[1] 1 2 3 4 5 6 7 8 9
-
7/31/2019 Introduction on R
24/95
8/29/12
Sequences
> c(1.5:10,10)
[1] 1.5 2.5 3.5 4.5
5.5 6.5 7.5 8.5 9.510.0
-
7/31/2019 Introduction on R
25/95
8/29/12
Sequences
The sequence function cancreate a string of values with
any increment you wish. Youcan either specify theincremental value or the
desired length of thesequence:
-
7/31/2019 Introduction on R
26/95
8/29/12
Sequences
> seq(1,5) #same as 1:5
[1] 1 2 3 4 5
> seq(1,5,by=.5) #increment by 0.5
[1] 1.0 1.5 2.0 2.5 3.0 3.5
-
7/31/2019 Introduction on R
27/95
8/29/12
Sequences
> seq(1,6,by=.5)
[1] 1.0 1.5 2.0 2.5 3.03.5 4.0 4.5 5.0 5.5 6.0
-
7/31/2019 Introduction on R
28/95
8/29/12
Sequences
The replicate functioncan repeat a value or asequence of values aspecified number oftimes
-
7/31/2019 Introduction on R
29/95
8/29/12
Sequences
How to repeat the value10 ten times?
> rep(10,10)
[1] 10 10 10 10 10 10 10
10 10 10
-
7/31/2019 Introduction on R
30/95
8/29/12
Sequences
How to repeat the stringA,B,C,D twice?
>rep(c("A","B","C","D"),2)
[1] "A" "B" "C" "D" "A""B" "C" "D"
-
7/31/2019 Introduction on R
31/95
8/29/12
Sequences
How to make a 4x4 matrixof zeroes?
> matrix(rep(0,16),nrow=4)
[,1] [,2] [,3] [,4][1,] 0 0 0 0
[2,] 0 0 0 0
-
7/31/2019 Introduction on R
32/95
8/29/12
Reading in Data: SingleVectors
Data can be read directlyfrom encoded by using
the scan() function. Sinceusing c() can sometimes
be tiresome.
-
7/31/2019 Introduction on R
33/95
8/29/12
Reading in Data: SingleVectors
Suppose that we count thenumber of passengers (not
including the driver) in thenext 10 automobiles at anintersection:
2 4 0 1 1 2 3 1 04
-
7/31/2019 Introduction on R
34/95
8/29/12
Reading in Data: SingleVectors
> passengers
-
7/31/2019 Introduction on R
35/95
8/29/12
Reading in Data: SingleVectors
How to print out thevalues of passengers?
> passengers
[1] 2 4 0 1 1 2 3 10 4
-
7/31/2019 Introduction on R
36/95
8/29/12
How to Create DataFrames?
Individual variables aredesignated as columns of
the data frame and haveunique names. However, all
of the columns in a dataframe must be of the samelength.
-
7/31/2019 Introduction on R
37/95
8/29/12
How to Create DataFrames?
Suppose that in the lastexperiment we also
recorded the seatbelt use ofthe driver: Y = seatbelt
worn, N = seatbelt notworn. Data:
Y , N, Y, Y, Y, Y, Y, Y, Y
C
-
7/31/2019 Introduction on R
38/95
8/29/12
How to Create DataFrames?
Note: Since these data are textbased, we need to put quotesaround each data value.
>seatbelt seatbelt
[1] "Y" "N" "Y" "Y" "Y" "Y"
H C D
-
7/31/2019 Introduction on R
39/95
8/29/12
How to Create DataFrames?
How to combine thevariables passengers and
seatbelts into a single dataframe ?
> car.dat
-
7/31/2019 Introduction on R
40/95
8/29/12
>car.dat
passengers seatbealt
1 2 Y
2 4 N
3 0 Y4 1 Y
5 1 Y
6 2 Y
7 3 Y
8 1 Y
-
7/31/2019 Introduction on R
41/95
8/29/12
NOTE: when using dataframe all of thecolumns in a dataframe must be of thesame length.
ANOTHER WAY OF
-
7/31/2019 Introduction on R
42/95
8/29/12
ANOTHER WAY OFENCODING DATA
How to usespreadsheet?
You can access the
editor by using eitherthe edit() or fix()
command ANOTHER WAY OF
-
7/31/2019 Introduction on R
43/95
8/29/12
ANOTHER WAY OFENCODING DATA
> new.data new.data
-
7/31/2019 Introduction on R
44/95
8/29/12
Encode the educatorsdata. Use the followingvariables only
job sex carsalary
-
7/31/2019 Introduction on R
45/95
8/29/12
> educators educators
-
7/31/2019 Introduction on R
46/95
8/29/12
I want to edit my data. Whatshould I do?
> new.data educators
-
7/31/2019 Introduction on R
47/95
8/29/12
How to save file?
Click the save
icon.
H t l d bj t i th
-
7/31/2019 Introduction on R
48/95
8/29/12
How to load object in thenext session?
Click on the File, then LoadWorkspace.
Click on the file you want to open.
Then, type
anyname
-
7/31/2019 Introduction on R
49/95
8/29/12
SUMMARIZI
NG DATA
-
7/31/2019 Introduction on R
50/95
8/29/12
Numerical SummariesName Operation
mean() arithmetic mean
median() sample median
fivenum() five-number summarysummary() generic summary function fordata and model fits
min(), max() smallest/largest valuesquantile() calculate sample quantiles(percentiles)
var(), sd() sample variance, sample
-
7/31/2019 Introduction on R
51/95
8/29/12
When using the data set use
> attach(object)
# add data with objectto search path
-
7/31/2019 Introduction on R
52/95
8/29/12
Numerical Summaries
> mean (salary)[1] 62350.79
> table(sex)
sex0 1
8 6
-
7/31/2019 Introduction on R
53/95
8/29/12
Numerical Summaries
If there are missing valuesuse:
mean(x,na.rm="true")
-
7/31/2019 Introduction on R
54/95
8/29/12
GRAPHS
-
7/31/2019 Introduction on R
55/95
8/29/12
Type
>sex.freqsex.freq
sex
0 18 6
-
7/31/2019 Introduction on R
56/95
8/29/12
> cbind(sex.freq)
>sex.freq
0 81 6
-
7/31/2019 Introduction on R
57/95
8/29/12
>barplot(sex.freq)
-
7/31/2019 Introduction on R
58/95
8/29/12
>pie(sex.freq)
-
7/31/2019 Introduction on R
59/95
8/29/12
>attach(educators)
>sexfreqbarplot(sexfreq)
-
7/31/2019 Introduction on R
60/95
8/29/12
>hist(car,right=FALSE)
-
7/31/2019 Introduction on R
61/95
8/29/12
>boxplot(salary,vertical=TR
UE)
-
7/31/2019 Introduction on R
62/95
8/29/12
>boxplot(salary,horizontal=T
RUE)
-
7/31/2019 Introduction on R
63/95
8/29/12
Statistical
Inference
How to access built in
-
7/31/2019 Introduction on R
64/95
8/29/12
How to access built indatasets?
Type
>data()
Output
How to access built in
-
7/31/2019 Introduction on R
65/95
8/29/12
How to access built indatasets?
Is there a dataset withfilename trees?
Use the dataset trees
Type:
> data(trees)
How to access built in
-
7/31/2019 Introduction on R
66/95
8/29/12
How to access built indatasets?
> treesGirth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.24 10.5 72 16.4
5 10.7 81 18.8
-
7/31/2019 Introduction on R
67/95
8/29/12
One sample ttest
Using the trees dataset, testthe hypothesis that the mean
black cherry tree heightis 70ft. versus a two-sidedalternative.
> data(trees)
> t.test(trees$Height,
=
-
7/31/2019 Introduction on R
68/95
8/29/12
One sample ttest
One Sample t-test
data: Height
t = 5.2429, df = 30, p-value = 1.173e-05
alternative hypothesis: true mean is notequal to 70
95 percent confidence interval:
73.6628 78.3372
sample estimates:
Two-sample ttest
-
7/31/2019 Introduction on R
69/95
8/29/12
Two-sample ttest
The recovery time (in days) is measured
for 10 patients taking a new drug andfor 10 different patients taking aplacebo6. We wish to test the
hypothesis that the mean recovery timefor patients taking the drug is less thanfort those taking a placebo (under anassumption of normality and equalpopulation variances). The data are:
With drug: 15, 10, 13, 7, 9, 8, 21, 9,14, 8
-
7/31/2019 Introduction on R
70/95
8/29/12
Two-sample ttest
> drug plac t.test(drug, plac,alternative = "less",var.equal = T)
-
7/31/2019 Introduction on R
71/95
8/29/12
Two-sample ttest
Two Sample t-testdata: drug and plac
t = -0.5331, df = 18, p-value = 0.3002
alternative hypothesis: true difference inmeans is less than 0
95 percent confidence interval:
-Inf 2.027436
sample estimates:
mean of x mean of y
-
7/31/2019 Introduction on R
72/95
8/29/12
Two-sample ttest
-
7/31/2019 Introduction on R
73/95
8/29/12
Paired ttest
An experiment was performed todetermine if a new gasoline additivecan increase the gas mileage of cars.
In the experiment, six cars areselected and driven with and withoutthe additive. The gas mileages (in
miles per gallon, mpg) are givenbelow.
Car 1 2 3 4 5
6
i d
-
7/31/2019 Introduction on R
74/95
8/29/12
Paired ttest
> add noadd t.test(add, noadd,paired=T, alt ="greater")
Paired t test
-
7/31/2019 Introduction on R
75/95
8/29/12
Paired t-test
data: add and noaddt = 3.9994, df = 5, p-value =0.005165
alternative hypothesis: truedifference in means is greater than0
95 percent confidence interval:0.3721225 Infsample estimates:
mean of the differences
ANOVA
-
7/31/2019 Introduction on R
76/95
8/29/12
ANOVA
> aov(x ~ a) # one-wayANOVA model
> aov(x ~ a + b) # two-wayANOVA with no interaction
>aov(x ~ a + b + a:b) # two-
way ANOVA withinteraction
> aov(x ~ a*b) # exactly the
ANOVA
-
7/31/2019 Introduction on R
77/95
8/29/12
ANOVA
The strength of three different rubbercompounds; four specimens of eachtype were tested for their tensile
strength (measured in pounds persquare inch):
ANOVA
-
7/31/2019 Introduction on R
78/95
8/29/12
ANOVA> str type type type
[1] A A A A B B B B C C C C
ANOVA
-
7/31/2019 Introduction on R
79/95
8/29/12
ANOVA
To calculate the sample meansof the subgroups, type
> tapply(str,type,mean) A B C
3213.75 3330.00 3552.50
ANOVA
-
7/31/2019 Introduction on R
80/95
8/29/12
ANOVA
To calculate the variances:
> tapply(str,type,var)
A B C
6172.917 6733.333
2541.667
ANOVA
-
7/31/2019 Introduction on R
81/95
8/29/12
ANOVA
>anova.fit
-
7/31/2019 Introduction on R
82/95
8/29/12
ANOVA
To extract the ANOVAtable, use the Rfunction summary():
>summary(anova.fit)
ANOVA
-
7/31/2019 Introduction on R
83/95
8/29/12
ANOVA
Df Sum Sq Mean Sq F value Pr(>F)
type 2 237029 118515 23.020.000289 ***
Residuals 9 46344 5149
---
Signif. codes: 0 *** 0.001 ** 0.01 *0.05 . 0.1 1
M lti l i t t
-
7/31/2019 Introduction on R
84/95
8/29/12
Multiple comparison test
>TukeyHSD(anova.
fit)
Tukey multiple comparisons of
-
7/31/2019 Introduction on R
85/95
8/29/12
y p pmeans
95% family-wise confidence level
Fit: aov(formula = str ~ type)
$type
diff lwr upr p adj
B-A 116.25 -25.41926 257.9193
0 1085202
LINEAR REGRESSION
-
7/31/2019 Introduction on R
86/95
8/29/12
LINEAR REGRESSION
> lm(y ~ x) # simple linearregression (SLR) model
> lm(y ~ x1 + x2) # a regression
plane
> lm(y ~ x1 + x2 + x3) # linearmodel with three regressors
> lm(y ~ x 1) # SLR w/ anintercept of zero
> lm(y ~ x + I(x^2)) # quadratic
LINEAR REGRESSION
-
7/31/2019 Introduction on R
87/95
8/29/12
LINEAR REGRESSION
Consider the cars dataset.The data give the speed
(speed) of cars and thedistances (dist) taken tocome to a complete stop. Fita linear regression modelusing speed as the
independent variable and
-
7/31/2019 Introduction on R
88/95
8/29/12
names(car
s)
LINEAR REGRESSION
-
7/31/2019 Introduction on R
89/95
8/29/12
LINEAR REGRESSION
>fit fit
-
7/31/2019 Introduction on R
90/95
8/29/12
Call:
lm(formula = dist ~ speed)
Coefficients:
(Intercept) speed
-17.579 3.932
LINEAR REGRESSION
-
7/31/2019 Introduction on R
91/95
8/29/12
LINEAR REGRESSION
>
summary(fit
)
Call:
-
7/31/2019 Introduction on R
92/95
8/29/12
lm(formula = dist ~ speed)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
-
7/31/2019 Introduction on R
93/95
8/29/12
> anova(fit)
ow o re r eveEMPLOYEEALL data from txt
-
7/31/2019 Introduction on R
94/95
8/29/12
EMPLOYEEALL data from txt
file?>emp
-
7/31/2019 Introduction on R
95/95
How to rename a variable?
> names(emp)