assignment-3 using r for correlation and regression analysis. · 2017. 9. 17. · assignment-3...

18
Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous value, such as the volume of rain Classification: to predict a categorical class label, such as weather: rainy, sunnny, cloudy or snowy. Regression is to build a function of independent variables (also known as predictors) to predict a dependent variable (also called response). For example, banks assess the risk of home-loan applicants based on their age, income, expenses, occupation, number of dependents, total credit limit, etc. Linear regression models Generalized linear models (GLM) Linear regression is to predict response with a linear function of predictors as follows: y = c0 + c1x1 + c2x2 + _ _ _ + ckxk ; where x1; x2; _ _ _ ; xk are predictors, y is the response to predict, and c0; c1; _ _ _ ; ck are coefficients to learn. Linear regression in R: lm() eg.The Australian Consumer Price Index (CPI) data: quarterly CPIs from 2008 to 2010 z > year <- rep(2008:2010, each = 4) > quarter <- rep(1:4, 3) > cpi <- c(162.2, 164.6, 166.5, 166, 166.2, 167, 168.6, 169.5, 171, + 172.1, 173.3, 174) > plot(cpi, xaxt = "n", ylab = "CPI", xlab = "") > axis(1, labels = paste(year, quarter, sep = "Q"), at = 1:12, las = 3) ## correlation between CPI and year / quarter > cor(year, cpi) [1] 0.9096316 > cor(quarter, cpi) [1] 0.3738028 ## build a linear regression model with function lm() > fit <- lm(cpi ~ year + quarter) > fit Call: lm(formula = cpi ~ year + quarter) Coefficients: (Intercept) year quarter -7644.488 3.888 1.167 With the above linear model, CPI is calculated as cpi = c0 + c1*year + c2*quarter; where c0, c1 and c2 are coefficients from model fit. What will the CPI be in 2011? > cpi2011 <- fit$coefficients[[1]] + + fit$coefficients[[2]] * 2011 + + fit$coefficients[[3]] * (1:4) An easier way is to use function predict(). > attributes(fit) $names [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" $class [1] "lm"

Upload: others

Post on 16-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

Assignment-3

Using R for Correlation and Regression analysis.

Regression: to predict a continuous value, such as the volume of rain Classification: to predict a categorical class label, such as weather: rainy, sunnny, cloudy or snowy. Regression is to build a function of independent variables (also known as predictors) to predict a dependent variable (also called response). For example, banks assess the risk of home-loan applicants based on their age, income, expenses, occupation, number of dependents, total credit limit, etc. Linear regression models Generalized linear models (GLM) Linear regression is to predict response with a linear function of predictors as follows: y = c0 + c1x1 + c2x2 + _ _ _ + ckxk ; where x1; x2; _ _ _ ; xk are predictors, y is the response to predict, and c0; c1; _ _ _ ; ck are coefficients to learn. Linear regression in R: lm() eg.The Australian Consumer Price Index (CPI) data: quarterly CPIs from 2008 to 2010 z > year <- rep(2008:2010, each = 4) > quarter <- rep(1:4, 3) > cpi <- c(162.2, 164.6, 166.5, 166, 166.2, 167, 168.6, 169.5, 171, + 172.1, 173.3, 174) > plot(cpi, xaxt = "n", ylab = "CPI", xlab = "") > axis(1, labels = paste(year, quarter, sep = "Q"), at = 1:12, las = 3) ## correlation between CPI and year / quarter > cor(year, cpi) [1] 0.9096316 > cor(quarter, cpi) [1] 0.3738028 ## build a linear regression model with function lm() > fit <- lm(cpi ~ year + quarter) > fit Call: lm(formula = cpi ~ year + quarter) Coefficients: (Intercept) year quarter -7644.488 3.888 1.167 With the above linear model, CPI is calculated as cpi = c0 + c1*year + c2*quarter; where c0, c1 and c2 are coefficients from model fit. What will the CPI be in 2011? > cpi2011 <- fit$coefficients[[1]] + + fit$coefficients[[2]] * 2011 + + fit$coefficients[[3]] * (1:4) An easier way is to use function predict(). > attributes(fit) $names [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" $class [1] "lm"

Page 2: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

Function residuals(): differences between observed values and Observed values.

> residuals(fit) 1 2 3 4 5 6 -0.57916667 0.65416667 1.38750000 -0.27916667 -0.46666667 -0.83333333 7 8 9 10 11 12 -0.40000000 -0.66666667 0.44583333 0.37916667 0.41250000 -0.05416667 > summary(fit) Call: lm(formula = cpi ~ year + quarter) Residuals: Min 1Q Median 3Q Max -0.8333 -0.4948 -0.1667 0.4208 1.3875 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7644.4875 518.6543 -14.739 1.31e-07 *** year 3.8875 0.2582 15.058 1.09e-07 *** quarter 1.1667 0.1885 6.188 0.000161 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7302 on 9 degrees of freedom Multiple R-squared: 0.9672, Adjusted R-squared: 0.9599 F-statistic: 132.5 on 2 and 9 DF, p-value: 2.108e-07

3D Plot of the Fitted Model > library(scatterplot3d) > s3d <- scatterplot3d(year, quarter, cpi, highlight.3d = T, type = "h", + lab = c(2, 3)) > s3d$plane3d(fit) Prediction of CPIs in 2011 data2011 <- data.frame(year = 2011, quarter = 1:4) cpi2011 <- predict(fit, newdata = data2011) style <- c(rep(1, 12), rep(2, 4)) plot(c(cpi, cpi2011), xaxt = "n", ylab = "CPI", xlab = "", pch = style, col = style) axis(1, at = 1:16, las = 3, labels = c(paste(year, quarter, sep = "Q"), "2011Q1", "2011Q2", "2011Q3", "2011Q4")) Iris Example-- Decision Tree > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > set.seed(1234) > ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3)) > train.data <- iris[ind == 1, ] > test.data <- iris[ind == 2, ] Build a ctree I Control the training of decision trees: MinSplit, MinBusket,MaxSurrogate and MaxDepth I Target variable: Species I Independent variables: all other variables > myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + + Petal.Width > iris_ctree <- ctree(myFormula, data = train.data) > table(predict(iris_ctree), train.data$Species)

Page 3: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

setosa versicolor virginica setosa 40 0 0 versicolor 0 37 3 virginica 0 1 31 > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 112 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 2)* weights = 40 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397 5)* weights = 21 4) Petal.Length > 4.4 6)* weights = 19 3) Petal.Width > 1.7 7)* weights = 32 > plot(iris_ctree) > plot(iris_ctree, type = "simple") > # predict on test data > testPred <- predict(iris_ctree, newdata = test.data) > table(testPred, test.data$Species) testPred setosa versicolor virginica setosa 10 0 0 versicolor 0 12 2 virginica 0 0 14 Package randomForest very fast cannot handle data with missing values a limit of 32 to the maximum number of levels of each categorical attribute extensions: extendedForest, gradientForest Package party: cforest() not limited to the above maximum levels slow needs more memory > ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3)) > train.data <- iris[ind==1,] > test.data <- iris[ind==2,] > library(randomForest) > library(randomForest) randomForest 4.6-12 Type rfNews() to see new features/changes/bug fixes. Attaching package: ‘randomForest’ The following object is masked from ‘package:ggplot2’: margin > rf <- randomForest(Species ~ ., data=train.data, ntree=100, + proximity=T) > table(predict(rf), train.data$Species) setosa versicolor virginica

Page 4: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

setosa 33 0 0 versicolor 0 32 3 virginica 0 2 35 > print(rf) Call: randomForest(formula = Species ~ ., data = train.data, ntree = 100, proximity = T) Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 2 OOB estimate of error rate: 4.76% Confusion matrix: setosa versicolor virginica class.error setosa 33 0 0 0.00000000 versicolor 0 32 2 0.05882353 virginica 0 3 35 0.07894737 > plot(rf, main = "") > importance(rf) MeanDecreaseGini Sepal.Length 6.537014 Sepal.Width 1.698005 Petal.Length 29.369949 Petal.Width 31.466699 > varImpPlot(rf) > irisPred <- predict(rf, newdata = test.data) > table(irisPred, test.data$Species) irisPred setosa versicolor virginica setosa 17 0 0 versicolor 0 16 3 virginica 0 0 9 > plot(margin(rf, test.data$Species))

Chi Square Correlation It is Test of Independence

Two random variables x and y are called independent if the probability distribution of one variable is not affected by the presence of another. Assume fij is the observed frequency count of events belonging to both i-th category of x and j-th category of y. Also assume eij to be the corresponding expected count if x and y are independent. The null hypothesis of the independence assumption is to be rejected if the p-value of the following Chi-squared test statistics is less than a given significance level α.

Example In the built-in data set survey, the Smoke column records the students smoking habit, while the Exer column records their exercise level. The allowed values in Smoke are "Heavy", "Regul" (regularly), "Occas" (occasionally) and "Never". As for Exer, they are "Freq" (frequently), "Some" and "None". We can tally the students smoking habit against the exercise level with the table function in R. The result is called the contingency table of the two variables. > library(MASS) > tbl = table(survey$Smoke, survey$Exer) > tbl Freq None Some Heavy 7 1 3 Never 87 18 84 Occas 12 3 4 Regul 9 1 7

Page 5: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

Problem

Test the hypothesis whether the students smoking habit is independent of their exercise level at .05 significance level.

Solution We apply the chisq.test function to the contingency table tbl, and found the p-value to be 0.4828. > chisq.test(tbl) Pearson's Chi-squared test data: tbl X-squared = 5.4885, df = 6, p-value = 0.4828 Warning message: In chisq.test(tbl) : Chi-squared approximation may be incorrect

Answer

As the p-value 0.4828 is greater than the .05 significance level, we do not reject the null hypothesis that the smoking habit is independent of the exercise level of the students.

Enhanced Solution The warning message found in the solution above is due to the small cell values in the contingency table. To avoid such warning, we combine the second and third columns of tbl, and save it in a new table named ctbl. Then we apply the chisq.test function against ctbl instead. > ctbl = cbind(tbl[,"Freq"], tbl[,"None"] + tbl[,"Some"]) > ctbl [,1] [,2] Heavy 7 4 Never 87 102 Occas 12 7 Regul 9 8 > chisq.test(ctbl) Pearson's Chi-squared test data: ctbl X-squared = 3.2328, df = 3, p-value = 0.3571

Multinomial Goodness of Fit

Multinomial Goodness of Fit

A population is called multinomial if its data is categorical and belongs to a collection of discrete non-overlapping classes.

The null hypothesis for goodness of fit test for multinomial distribution is that the observed

frequency fi is equal to an expected count ei in each category. It is to be rejected if the p-value of the following Chi-squared test statistics is less than a given significance level α.

Example In the built-in data set survey, the Smoke column records the survey response about the student’s smoking

habit. As there are exactly four proper response in the survey: "Heavy", "Regul" (regularly), "Occas" (occasionally) and "Never", the Smoke data is multinomial. It can be confirmed with the levels function in R. > library(MASS) > levels(survey$Smoke) [1] "Heavy" "Never" "Occas" "Regul"

As discussed in the tutorial Frequency Distribution of Qualitative Data, we can find the frequency distribution with the table function.

> smoke.freq = table(survey$Smoke)

Page 6: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

> smoke.freq Heavy Never Occas Regul 11 189 19 17

Problem Suppose the campus smoking statistics is as below. Determine whether the sample data in survey supports it at .05 significance level. Heavy Never Occas Regul 4.5% 79.5% 8.5% 7.5%

Solution We save the campus smoking statistics in a variable named smoke.prob. Then we apply the chisq.test function

and perform the Chi-Squared test. R version 3.4.1 (2017-06-30) -- "Single Candle" Copyright (C) 2017 The R Foundation for Statistical Computing Platform: i386-w64-mingw32/i386 (32-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > source('D:/Assignment_2_R-Work.R') [1] 10 > source('D:/Assignment_2_R-Work.R') [1] 10 Error in library("ggplot") : there is no package called ‘ggplot’ > > > source('D:/Assignment_2_R-Work.R') [1] 10 Error in library("ggplot2") : there is no package called ‘ggplot2’ > > > > > source('D:/Assignment_2_R-Work.R') > dim(iris) [1] 150 5 > source('D:/Assignment_2_R-Work.R') 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > source('D:/Assignment_2_R-Work.R') 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > > > > > >

Page 7: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

> > > source('D:/Assignment_2_R-Work.R') > names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" [4] "Petal.Width" "Species" > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > attributes(iris) $names [1] "Sepal.Length" "Sepal.Width" "Petal.Length" [4] "Petal.Width" "Species" $row.names [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 [29] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 [43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 [57] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 [71] 71 72 73 74 75 76 77 78 79 80 81 82 83 84 [85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 [99] 99 100 101 102 103 104 105 106 107 108 109 110 111 112 [113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 [141] 141 142 143 144 145 146 147 148 149 150 $class [1] "data.frame" > > > iris[1:3, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa > > > tail(iris,3) Sepal.Length Sepal.Width Petal.Length Petal.Width 148 6.5 3.0 5.2 2.0 149 6.2 3.4 5.4 2.3 150 5.9 3.0 5.1 1.8 Species 148 virginica 149 virginica 150 virginica > iris[1:10,"Sepal.Length"] [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 > summary(iris) Sepal.Length Sepal.Width Petal.Length Min. :4.300 Min. :2.000 Min. :1.000 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 Median :5.800 Median :3.000 Median :4.350 Mean :5.843 Mean :3.057 Mean :3.758 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 Max. :7.900 Max. :4.400 Max. :6.900 Petal.Width Species Min. :0.100 setosa :50 1st Qu.:0.300 versicolor:50 Median :1.300 virginica :50 Mean :1.199 3rd Qu.:1.800 Max. :2.500

Page 8: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

> library(Hmisc) Error in library(Hmisc) : there is no package called ‘Hmisc’ > describe(iris[, c(1, 5)]) # check columns 1 & 5 Error in describe(iris[, c(1, 5)]) : could not find function "describe" > range(iris$Sepal.Length) [1] 4.3 7.9 > quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65)) 10% 30% 65% 4.80 5.27 6.20 > var(iris$Sepal.Length) [1] 0.6856935 > hist(iris$Sepal.Length) > plot(density(iris$Sepal.Length)) > table(iris$Species) setosa versicolor virginica 50 50 50 > pie(table(iris$Species)) > barplot(table(iris$Species)) > > > > > > > > > > > > > > > > > source('D:/Assignment_2_R-Work.R') 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > > > > > source('D:/Assignment_2_R-Work.R') 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > > > source('D:/Assignment_2_R-Work.R') 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > > > > >

Page 9: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

> > > > > > > > > > > > > > > > > > cor(iris$Sepal.Length, iris$Petal.Length) [1] 0.8717538 > cov(iris$Sepal.Length, iris$Petal.Length) [1] 1.274315 > cov(iris[, 1:4]) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707 Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394 Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094 Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063 > aggregate(Sepal.Length ~ Species, summary, data = iris) Species Sepal.Length.Min. Sepal.Length.1st Qu. Sepal.Length.Median 1 setosa 4.300 4.800 5.000 2 versicolor 4.900 5.600 5.900 3 virginica 4.900 6.225 6.500 Sepal.Length.Mean Sepal.Length.3rd Qu. Sepal.Length.Max. 1 5.006 5.200 5.800 2 5.936 6.300 7.000 3 6.588 6.900 7.900 > boxplot(Sepal.Length ~ Species, data = iris) > with(iris, plot(Sepal.Length, Sepal.Width, col = Species, + pch = as.numeric(Species))) > plot(jitter(iris$Sepal.Length), jitter(iris$Sepal.Width)) Warning message: In doTryCatch(return(expr), name, parentenv, handler) : unable to open file 'C:/Users/admin/AppData/Local/Temp/RtmpMVZCGt/3dd76b0ebe514e8788394f9fde289c69.png' for writing > 5 6 Error: unexpected numeric constant in "5 6" > pairs(iris) > library(scatterplot3d) Error in library(scatterplot3d) : there is no package called ‘scatterplot3d’ > installed.package("scatterplot3d") Error in installed.package("scatterplot3d") : could not find function "installed.package" > install.package("scatterplot3d") Error in install.package("scatterplot3d") : could not find function "install.package" > scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width) Error in scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width) : could not find function "scatterplot3d" > install.packages("scatterplot3d") trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/scatterplot3d_0.3-40.zip' Content type 'application/zip' length 309312 bytes (302 KB) downloaded 302 KB package ‘scatterplot3d’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\admin\AppData\Local\Temp\RtmpMVZCGt\downloaded_packages

Page 10: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width) Error in scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width) : could not find function "scatterplot3d" > library(scatterplot3d) > scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width) > library(rgl) Error in library(rgl) : there is no package called ‘rgl’ > install.packages(rgl) Error in install.packages : object 'rgl' not found > install.packages("rgl") also installing the dependencies ‘httpuv’, ‘xtable’, ‘R6’, ‘sourcetools’, ‘htmlwidgets’, ‘htmltools’, ‘shiny’ trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/httpuv_1.3.5.zip' Content type 'application/zip' length 932403 bytes (910 KB) downloaded 910 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/xtable_1.8-2.zip' Content type 'application/zip' length 710412 bytes (693 KB) downloaded 693 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/R6_2.2.2.zip' Content type 'application/zip' length 316803 bytes (309 KB) downloaded 309 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/sourcetools_0.1.6.zip' Content type 'application/zip' length 528073 bytes (515 KB) downloaded 515 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/htmlwidgets_0.9.zip' Content type 'application/zip' length 851970 bytes (832 KB) downloaded 832 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/htmltools_0.3.6.zip' Content type 'application/zip' length 618426 bytes (603 KB) downloaded 603 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/shiny_1.0.5.zip' Content type 'application/zip' length 2834087 bytes (2.7 MB) downloaded 2.7 MB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/rgl_0.98.1.zip' Content type 'application/zip' length 3436885 bytes (3.3 MB) downloaded 3.3 MB package ‘httpuv’ successfully unpacked and MD5 sums checked package ‘xtable’ successfully unpacked and MD5 sums checked package ‘R6’ successfully unpacked and MD5 sums checked package ‘sourcetools’ successfully unpacked and MD5 sums checked package ‘htmlwidgets’ successfully unpacked and MD5 sums checked package ‘htmltools’ successfully unpacked and MD5 sums checked package ‘shiny’ successfully unpacked and MD5 sums checked package ‘rgl’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\admin\AppData\Local\Temp\RtmpMVZCGt\downloaded_packages > library(rgl) > plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width) > dist.matrix <- as.matrix(dist(iris[, 1:4])) > heatmap(dist.matrix) > library(lattice) > levelplot(Petal.Width ~ Sepal.Length * Sepal.Width, iris, cuts = 9, + col.regions = rainbow(10)[10:1]) > filled.contour(volcano, color = terrain.colors, asp = 1, plot.axes = contour(volcano, + add = T)) > 100 [1] 100 > 120

Page 11: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

[1] 120 > 140 [1] 140 > 160 [1] 160 > 180 [1] 180 > persp(volcano, theta = 25, phi = 30, expand = 0.5, col = "lightblue") > library(MASS) > parcoord(iris[1:4], col = iris$Species) > library(lattice) > parallelplot(~iris[1:4] | Species, data = iris) > library(ggplot2) Error in library(ggplot2) : there is no package called ‘ggplot2’ > install.packages("ggplot2") also installing the dependencies ‘colorspace’, ‘RColorBrewer’, ‘dichromat’, ‘munsell’, ‘labeling’, ‘viridisLite’, ‘rlang’, ‘gtable’, ‘plyr’, ‘reshape2’, ‘scales’, ‘tibble’, ‘lazyeval’ trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/colorspace_1.3-2.zip' Content type 'application/zip' length 446797 bytes (436 KB) downloaded 436 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RColorBrewer_1.1-2.zip' Content type 'application/zip' length 26979 bytes (26 KB) downloaded 26 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/dichromat_2.0-0.zip' Content type 'application/zip' length 147934 bytes (144 KB) downloaded 144 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/munsell_0.4.3.zip' Content type 'application/zip' length 134448 bytes (131 KB) downloaded 131 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/labeling_0.3.zip' Content type 'application/zip' length 41095 bytes (40 KB) downloaded 40 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/viridisLite_0.2.0.zip' Content type 'application/zip' length 57871 bytes (56 KB) downloaded 56 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/rlang_0.1.2.zip' Content type 'application/zip' length 465416 bytes (454 KB) downloaded 454 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/gtable_0.2.0.zip' Content type 'application/zip' length 58119 bytes (56 KB) downloaded 56 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/plyr_1.8.4.zip' Content type 'application/zip' length 1219124 bytes (1.2 MB) downloaded 1.2 MB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/reshape2_1.4.2.zip' Content type 'application/zip' length 610052 bytes (595 KB) downloaded 595 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/scales_0.5.0.zip' Content type 'application/zip' length 694675 bytes (678 KB) downloaded 678 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/tibble_1.3.4.zip' Content type 'application/zip' length 676602 bytes (660 KB) downloaded 660 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/lazyeval_0.2.0.zip' Content type 'application/zip' length 140091 bytes (136 KB)

Page 12: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

downloaded 136 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/ggplot2_2.2.1.zip' Content type 'application/zip' length 2785410 bytes (2.7 MB) downloaded 2.7 MB package ‘colorspace’ successfully unpacked and MD5 sums checked package ‘RColorBrewer’ successfully unpacked and MD5 sums checked package ‘dichromat’ successfully unpacked and MD5 sums checked package ‘munsell’ successfully unpacked and MD5 sums checked package ‘labeling’ successfully unpacked and MD5 sums checked package ‘viridisLite’ successfully unpacked and MD5 sums checked package ‘rlang’ successfully unpacked and MD5 sums checked package ‘gtable’ successfully unpacked and MD5 sums checked package ‘plyr’ successfully unpacked and MD5 sums checked package ‘reshape2’ successfully unpacked and MD5 sums checked package ‘scales’ successfully unpacked and MD5 sums checked package ‘tibble’ successfully unpacked and MD5 sums checked package ‘lazyeval’ successfully unpacked and MD5 sums checked package ‘ggplot2’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\admin\AppData\Local\Temp\RtmpMVZCGt\downloaded_packages > library(ggplot2) > qplot(Sepal.Length, Sepal.Width, data = iris, facets = Species ~ .) > pdf("myPlot.pdf") > x <- 1:50 > plot(x, log(x)) > graphics.off() > # Save as a postscript file > postscript("myPlot2.ps") > x <- -20:20 > plot(x, x^2) > graphics.off() > pdf("myPlot.pdf") > x <- 1:50 > plot(x, log(x)) > graphics.off() > # Save as a postscript file > postscript("myPlot2.ps") > x <- -20:20 > plot(x, x^2) > graphics.off()\ Error: unexpected input in "graphics.off()\" > year <- rep(2008:2010, each = 4) > quarter <- rep(1:4, 3) > cpi <- c(162.2, 164.6, 166.5, 166, 166.2, 167, 168.6, 169.5, 171, + 172.1, 173.3, 174) > plot(cpi, xaxt = "n", ylab = "CPI", xlab = "") > axis(1, labels = paste(year, quarter, sep = "Q"), at = 1:12, las = 3) > cov(iris$Sepal.Length, iris$Petal.Length) [1] 1.274315 > cor(year, cpi) [1] 0.9096316 > cor(quarter, cpi) [1] 0.3738028 > fit <- lm(cpi ~ year + quarter) > fir Error: object 'fir' not found > fit Call: lm(formula = cpi ~ year + quarter) Coefficients: (Intercept) year quarter -7644.488 3.888 1.167 > cpi2011 <- fit$coefficients[[1]] + + fit$coefficients[[2]] * 2011 +

Page 13: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

+ fit$coefficients[[3]] * (1:4) > attributes(fit) $names [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" $class [1] "lm" > residuals(fit) 1 2 3 4 5 6 -0.57916667 0.65416667 1.38750000 -0.27916667 -0.46666667 -0.83333333 7 8 9 10 11 12 -0.40000000 -0.66666667 0.44583333 0.37916667 0.41250000 -0.05416667 > summary(fit) Call: lm(formula = cpi ~ year + quarter) Residuals: Min 1Q Median 3Q Max -0.8333 -0.4948 -0.1667 0.4208 1.3875 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7644.4875 518.6543 -14.739 1.31e-07 *** year 3.8875 0.2582 15.058 1.09e-07 *** quarter 1.1667 0.1885 6.188 0.000161 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7302 on 9 degrees of freedom Multiple R-squared: 0.9672, Adjusted R-squared: 0.9599 F-statistic: 132.5 on 2 and 9 DF, p-value: 2.108e-07 > library(scatterplot3d) > s3d <- scatterplot3d(year, quarter, cpi, highlight.3d = T, type = "h", + lab = c(2, 3)) > s3d$plane3d(fit) > s3d$plane3d(fit) > cov(iris$Sepal.Length, iris$Petal.Length) [1] 1.274315 > s3d$plane3d(fit) > The iris Data Error: unexpected symbol in "The iris" > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > set.seed(1234) > ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3)) > train.data <- iris[ind == 1, ] > test.data <- iris[ind == 2, ] > > set.seed(1234) Error: unexpected '>' in ">" > > ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3)) Error: unexpected '>' in ">" > > train.data <- iris[ind == 1, ] Error: unexpected '>' in ">" > > test.data <- iris[ind == 2, ] Error: unexpected '>' in ">" > library(party) Error in library(party) : there is no package called ‘party’ > install.packages("party")

Page 14: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

also installing the dependencies ‘TH.data’, ‘multcomp’, ‘mvtnorm’, ‘modeltools’, ‘strucchange’, ‘coin’, ‘zoo’, ‘sandwich’ trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/TH.data_1.0-8.zip' Content type 'application/zip' length 5213265 bytes (5.0 MB) downloaded 5.0 MB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/multcomp_1.4-7.zip' Content type 'application/zip' length 633203 bytes (618 KB) downloaded 618 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/mvtnorm_1.0-6.zip' Content type 'application/zip' length 233316 bytes (227 KB) downloaded 227 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/modeltools_0.2-21.zip' Content type 'application/zip' length 138817 bytes (135 KB) downloaded 135 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/strucchange_1.5-1.zip' Content type 'application/zip' length 854520 bytes (834 KB) downloaded 834 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/coin_1.2-1.zip' Content type 'application/zip' length 1462129 bytes (1.4 MB) downloaded 1.4 MB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/zoo_1.8-0.zip' Content type 'application/zip' length 902527 bytes (881 KB) downloaded 881 KB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/sandwich_2.4-0.zip' Content type 'application/zip' length 1243109 bytes (1.2 MB) downloaded 1.2 MB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/party_1.2-3.zip' Content type 'application/zip' length 719839 bytes (702 KB) downloaded 702 KB package ‘TH.data’ successfully unpacked and MD5 sums checked package ‘multcomp’ successfully unpacked and MD5 sums checked package ‘mvtnorm’ successfully unpacked and MD5 sums checked package ‘modeltools’ successfully unpacked and MD5 sums checked package ‘strucchange’ successfully unpacked and MD5 sums checked package ‘coin’ successfully unpacked and MD5 sums checked package ‘zoo’ successfully unpacked and MD5 sums checked package ‘sandwich’ successfully unpacked and MD5 sums checked package ‘party’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\admin\AppData\Local\Temp\RtmpMVZCGt\downloaded_packages > library(party) Loading required package: grid Loading required package: mvtnorm Loading required package: modeltools Loading required package: stats4 Loading required package: strucchange Loading required package: zoo Attaching package: ‘zoo’ The following objects are masked from ‘package:base’: as.Date, as.Date.numeric Loading required package: sandwich > myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + + Petal.Width > iris_ctree <- ctree(myFormula, data = train.data) > table(predict(iris_ctree), train.data$Species)

Page 15: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

setosa versicolor virginica setosa 40 0 0 versicolor 0 37 3 virginica 0 1 31 > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 112 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 2)* weights = 40 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397 5)* weights = 21 4) Petal.Length > 4.4 6)* weights = 19 3) Petal.Width > 1.7 7)* weights = 32 > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 112 1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643 2)* weights = 40 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939 4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397 5)* weights = 21 4) Petal.Length > 4.4 6)* weights = 19 3) Petal.Width > 1.7 7)* weights = 32 > plot(iris_ctree) > plot(iris_ctree, type = "simple") > # predict on test data > testPred <- predict(iris_ctree, newdata = test.data) > table(testPred, test.data$Species) testPred setosa versicolor virginica setosa 10 0 0 versicolor 0 12 2 virginica 0 0 14 > ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3)) > train.data <- iris[ind==1,] > test.data <- iris[ind==2,] > library(randomForest) Error in library(randomForest) : there is no package called ‘randomForest’ > install.packages("randomForest") trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/randomForest_4.6-12.zip' Content type 'application/zip' length 179550 bytes (175 KB) downloaded 175 KB package ‘randomForest’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\admin\AppData\Local\Temp\RtmpMVZCGt\downloaded_packages > library(randomForest) randomForest 4.6-12 Type rfNews() to see new features/changes/bug fixes.

Page 16: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

Attaching package: ‘randomForest’ The following object is masked from ‘package:ggplot2’: margin > rf <- randomForest(Species ~ ., data=train.data, ntree=100, + proximity=T) > table(predict(rf), train.data$Species) setosa versicolor virginica setosa 33 0 0 versicolor 0 32 3 virginica 0 2 35 > print(rf) Call: randomForest(formula = Species ~ ., data = train.data, ntree = 100, proximity = T) Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 2 OOB estimate of error rate: 4.76% Confusion matrix: setosa versicolor virginica class.error setosa 33 0 0 0.00000000 versicolor 0 32 2 0.05882353 virginica 0 3 35 0.07894737 > plot(rf, main = "") > importance(rf) MeanDecreaseGini Sepal.Length 6.537014 Sepal.Width 1.698005 Petal.Length 29.369949 Petal.Width 31.466699 > varImpPlot(rf) > irisPred <- predict(rf, newdata = test.data) > table(irisPred, test.data$Species) irisPred setosa versicolor virginica setosa 17 0 0 versicolor 0 16 3 virginica 0 0 9 > plot(margin(rf, test.data$Species)) > library(mass) Error in library(mass) : there is no package called ‘mass’ > library(MASS) > tbl = table(survey$Smoke, survey$Exer) > tbl Freq None Some Heavy 7 1 3 Never 87 18 84 Occas 12 3 4 Regul 9 1 7 > chisq.test(tbl) Pearson's Chi-squared test data: tbl X-squared = 5.4885, df = 6, p-value = 0.4828 Warning message: In chisq.test(tbl) : Chi-squared approximation may be incorrect > ctbl = cbind(tbl[,"Freq"], tbl[,"None"] + tbl[,"Some"]) > ctbl [,1] [,2] Heavy 7 4

Page 17: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous

Never 87 102 Occas 12 7 Regul 9 8 > chisq.test(ctbl) Pearson's Chi-squared test data: ctbl X-squared = 3.2328, df = 3, p-value = 0.3571 > library(MASS) > levels(survey$Smoke) [1] "Heavy" "Never" "Occas" "Regul" > smoke.freq = table(survey$Smoke) > smoke.freq Heavy Never Occas Regul 11 189 19 17 > smoke.prob = c(.045, .795, .085, .075) > chisq.test(smoke.freq, p=smoke.prob) Chi-squared test for given probabilities data: smoke.freq X-squared = 0.10744, df = 3, p-value = 0.9909 Answer As the p-value 0.991 is greater than the .05 significance level, we do not reject the null hypothesis that the sample data in survey supports the campus-wide smoking statistics.

References:- http://www.r-tutor.com/elementary-statistics/goodness-fit/multinomial-goodness-fit

http://www.rdatamining.com/docs/regression-and-classification-with-r

************************************THE END***************************

Page 18: Assignment-3 Using R for Correlation and Regression analysis. · 2017. 9. 17. · Assignment-3 Using R for Correlation and Regression analysis. Regression: to predict a continuous