# simple linear regression project

Post on 12-Apr-2017

57 views

Embed Size (px)

TRANSCRIPT

Vishrut Mehta Japan Shah

SIMPLE LINEAR REGRESSION PROJECT (IE 5318)

Problem statement: In this project of simple linear regression analysis, we are trying to determine the relationship between one response variable (Y) and four predictor variables (X1, X2, X3, X4). We want to determine how the different values of all the predictor variables affect the value of the response variable. And among all predictor variable, which variable shows the most linear relationship with the response. The variables: The variables are as follows

Response Variable (Y): Number of the followers of a person on the Twitter (in millions) Predictor Variable (X1): Number of tweets posted by a particular person Predictor Variable (X2): Number of years passed since that person has joined the twitter Predictor Variable (X3): Number of photos and videos posted Predictor Variable (X4): Number of people that person is following back

The data collection process: We have used the official site of Forbes to find the first 100 most followed people on the twitter (http://www.forbes.com/sites/maddieberg/2015/06/29/twitters-most-followed-celebrities-retweets-dont-always-mean-dollars/#35671f137ef3) Also we are using twitter to collect the data for the response variable and each predictor variables. (www.twitter.com) We searched for these people on the twitter for their accounts and verified them as their original account by looking at the symbol , which is the symbol for the official account of persons on the twitter. We are using the first 40 most followed people amongst them. Why modeling this data set would be meaningful? As we all know how much social media is affecting the lives of people in today's world and it is constantly evolving with the new social media sites. It has revolutionized how we look at the things. While some people have made their way to success through their hard work and years of experience, the others have made it through social media which also includes the same amount of hard work. Amongst all of them today, the second most popular social media site is Twitter according to ebizmba.com (http://www.ebizmba.com/articles/social-networking-websites). It is necessary to understand on which factors the popularity of the people is driven. We are going to determine that among all these four predictor variables, which variable affects the popularity of the person the most.

Scatter plots of the response variable vs each predictor variable: Scatter plot of response variable Y vs X1

Here, there is a very less positive linear association between No of followers and No of tweets. There are no outliers. There is no curvature. It is not strong predictor for the response variable. If we increase the no of tweets by one then the no of followers would be increase very less.

Scatter plot of response variable Y vs X

Here, there is a positive linear association between No of followers and Years since they joined. There is no curvature and outliers are not present. It is strong predictor for the response variable. We can say from the plot that years since they joined impact the no of followers.

0102030405060708090

0 10000 20000 30000 40000 50000

No_of_

followe

rs

No_of_tweets

0102030405060708090

4 5 6 7 8 9 10

No_of_

followe

rs

Years_Since_they_joined

Scatter plot of response variable Y vs X3

Here, there is a positive linear association between No of followers and No of photos and videos posted. There is no curvature. We have removed one outlier. After removing one outlier, the graph becomes positive otherwise there was a negative linear association between response and explanatory variable. Scatter plot of response variable Y vs X4

Here, there is a positive linear association between No of followers and No of people following. There is no curvature and there is one outlier.

0102030405060708090

0 500 1000 1500 2000 2500 3000 3500

No_of_

followe

rs

No_of _photos and videos posted

0102030405060708090

0 100000 200000 300000 400000 500000 600000 700000

No_of_

followe

rs

No _of _people are following

Selecting one predictor variable: From the scatter plot shown above, we can say that all the predictor variable are having positive trend. However, we decided to take as Years since they joined as a predictor variable because it shows strong relationship with response as compared to other predictors.

II.SIMPLE LINEAR REGRESSION MODEL Data taken into the consideration Y: No_of_followers X: Years_Since_they_joined

The model form can be stated as follows: Yi = 0 +1Xi +i Where:

Yi is the response variable 0 and 1 are parameters where 0 is the intercept and 1 is the slope Xi is predictor variable and is known constant i is a random error with E () = 0 and variance 2(i)= 2, i, j are uncorrelated so that their covariance is zero.

Fit a regression line

Software Output

Fitted model is (No_of_followers) Y = -50.60528 + 12.52331X2 (Years_Since_they_joined) The values for the parameter estimates are - 0 (Intercept): -50.60528, 1 (Slope): 12.52311 From the above output we can summarize as follows,

Sum of squares indicates the amount of variability associated with each source. Model Sum of squares is 3195. 47982. This indicates the amount of variability explained by the model. Sum of square error (SSE) is 8254.40793. This indicates an unexplained amount of variability. Corrected total (SSTO) is 11450. This indicates amount of variability in the response.

Mean square given as ratio of sum of squares and its corresponding degree of freedom. Mean square model is 3195. 47982. Mean square error is 217.22126, which is an estimate of the population variance.

Root MSE is 14. 73843.This indicates that standard deviation of No of followers(Y) at each value of years since they joined(X2)

Departure Mean is 35. 49250. This indicates average of No of followers(Y) for the 40 observation.

The coefficient of determination (R2) is calculated as a ratio of sum of square model to the sum of square total. Here R2 is 0. 2791. This shows that among all the variables, the variable X2(years since they joined twitter) reduces the variability in our response variable (Y) by 27.91%.

By taking the square root of R2 we get the value of coefficient of correlation (r) 0. 5282, which means that there is 52.82% linear relationship between our response variable and predictor variable.

There is 41% variation in the sample mean compared to the mean of population. The standard error for the intercept and slope shows that how the variable would be if you resampled F value is 14.71, which is calculated by taking ratio of MSR and MSE. And the corresponding p-value is 0.0005.

Hypothesis: Ho: 1 = 0 and H1:1 0 Decision rule: If p < then reject Ho Conclusion: From the table p= 0.0005 and =0.05. As 0.0005 < 0.05, we reject Ho. It is a strong conclusion. From this we can say that Years since they joined(x2) explains amount of variability in Y.

III. Inferences

A. Inferences on the parameter:

(Table 1.1) From the above output we can say that we are 95% confident that the increment of the mean number of followers lies between (5.913337, 19.13325) for each additional increase in the year.

B. Inferences on the True Line and Prediction: Taking 95% confidence interval,

CI (for mean response) = y|x=7.5 t (1-/2; n-2) S{yh}

Now,

S {yh} = sqrt [MSE * (1/n + (xn -x) 2 / ()2)] = sqrt [217.22 * ((1/40) + (7.5-6.875)2 / 20.375)] = 3.0975 y|x=7.5 = -50.6052 + (12.5233 * 7.5) = 43.31955 CI = 43.31955 t (0.975, 38) *3.097 = 43.31955 (2.021 * 3.0975 CI = (37.659, 49.5797)

PI (prediction interval) = y|x=12 t (1-/2; n-2) S{pred} Now, S {pred} = sort [ S2 {yh} + MSE] = sqrt [9.5790 + 217.22] = 15.0598 PI = y|x=7.5 t (0.975; 38) * 15.0598 = 43.31955 (2.021 * 15.0598) PI = (37.059, 49.5795)

The limits for Working-Hotellling confidence bands = y|x=7.5 sqrt [ 2F (1-, n-2)] * S{yh} CB = 43.31955 sqrt [2F (0.95, 38)] * 3.0975 = 43.31955 sqrt [2* 3.15] * 3.0975 CB = (35.5479, 51.0911)

Confidence band limits at several xh along the range of x

Confidence bands with fitted line and data observations plot:

We are 95% confident that confidence band consists the entire fitted regression line. IV. Model Assumptions

The mean of the response variable is linearly related to the predictor variable. The probability distribution of the response variable (Y) doesnt depend on the

level of explanatory variable X. So the variance of the response variable remains constant.

The error terms are normally distributed with mean 0. The error terms have equal variance. The error terms are uncorrelated.

Conducting residual analysis to verify the model assumptions using plots. These

assumptions are: I. The linear model is reasonable.

II. The residuals have constant variance. III. The residuals are normally distributed. IV. The residuals are uncorrelated. V. No outliers.

-40-20

020406080

100120

0 10 20 30 40 50 60 70 80 90

Y

X2

Y Vs X2

A. Box plot of residual

(Plot 1.1) The box plot is basically used to find the symmetry of the residuals. Also it shows if there are any outliers point in the residuals. From above graph, we can say that there are two outliers.

B. Residual vs years since they joined twitter

(Plot 1.2) By observing this graph, we conclude that there is no funnel shape and no curvature. Which means that there is a constant variance in the residuals and there is linear relation between our response variable and the predictor variable.

Outliers

C. Normal Probability plot of residuals

(Plot 1.3) The normal probability plot is always made by plotting residual on Y-axis and their expected values on the X-axis. Here our graph is right-skewed that is concave upward shaped. This means the errors are not normal.

Test for normality and constant variance A. Test normality

H0: Normality is ok H1: Normality is violated. Decision rule: if < c (, n) then reject H0. Here c (, n) = c (0.05, 38) = 0.972 which is less than (0.94567). Therefore, we reject H0. Which is a strong conclusion. The normality is violated.

B. Modified Levene test for nonconstant variance:

In the modified levene test we divide the data into two groups. We have taken the n1=35 for group1 and n2=5 for group2. The results from the SAS output of this test is shown below.

Hypothesis: H0 = variance is constant H1= variance is not constant Decision rule: If p. Conclusion: Thus we fail to reject H0. Which means we accept H0. Thus the variance is constant. It is a weak conclusion.

Overall discussion on which model assumptions are satisfied: From the residual and x2 graph, we found that there is no curvature and there is no funnel shape. So it proves that linear model is reasonable and there is a constant variance. Constant variance assumption is also proved by the modified levene test. Lack of fit test

Hypothesis Ho: no lack in linear fit H1: Lack in linear fit Decision rule: p< then reject Ho. 0.6979>0.05. So we fail to reject Ho and it is a weak conclusion .We can say that there is no lack in linear fit.

Final Discussion In simple linear regression analysis, we try to find the relationship between the two variables. One will be the independent variable i.e. X and other will be dependent variable i.e. Y. Out of the four predictor variable, we have select X2 i.e. Years Since they joined for the further analysis because it shows the strong upward trend with the response. By performing statistical analysis, we came out with following equation (No_of_followers) Y = -50.60528 + 12.52331X2 (Years_Since_they_joined). From various test, we can say that there is linear relationship. Using residual plots such as residual vs x2, boxplot and normal probability plot, we try to find what assumption satisfies. From graphs, we found that residuals are not normal and there is a constant variance. From the lack of fit test, we can say that model is a good fit. Ideas for future analysis For the future analysis, we can take x4 (Number of people they are following back) as the predictor variable because it has 47.85% correlation with our response variable. Although it is a good correlation, the normal probability plot of this variable shows right-skewness, which means there may be chances of having non-normality.

From all these points, this predictor variable may not prove to be an efficient variable to find the linear relation with our response variable.

0102030405060708090

0 20 40 60 80 100 120

Y

Sample Percentile

Normal Probability Plot