lecture 6 correlation and regression stat 3120 statistical methods i
TRANSCRIPT
Lecture 6Correlation and
Regression
STAT 3120Statistical Methods I
STAT3120 – Correlation and Linear Regression
Dependent Variable
Independent (predictor) Variable
Statistical Test
Comments
Quantitative Categorical T-TEST (one, two or paired sample)
Determines if categorical variable (factor) affects dependent variable; typically used for experimental or planned change studies
Quantitative Quantitative Correlation/Regression Analysis
Test establishes a regression model; used to explain, predict or control dependent variable
Categorical Categorical Chi-Square Tests if variables are statistically independent (i.e. are they related or not?)
STAT3120 - Correlation Correlation coefficients assess strength of linear relationship between
two quantitative variables.• The correlation measure ranges from -1 to +1. • A negative correlation means that X and Y are inversely related. • A positive correlation means that X and Y are directly related. • zero correlation means that X and Y are not linearly related. • A correlation of +1 indicates X and Y are directly related and that
all the points fall on the same straight line. • A correlation of -1 indicates X and Y are inversely related and that
all the points fall on the same straight line
Plot Scatter Diagram of Each Predictor variable and Dependent Variable • Look of Departures from Linearity• Look for extreme data points (Outliers)
Examine Partial Correlation• Can’t determine causality, but isolate confounding variables
STAT3120 - Correlation
For example, lets take two variables and evaluate their correlation…open the stats98 dataset in Excel…
What would you expect the correlation of the Verbal SAT scores and the Math SAT scores to be? Why?
What would you expect the correlation of the Math SAT scores and the percent taking the test to be? Why?
STAT3120 - Correlation
What would you expect the correlation of the Verbal SAT scores and the Math SAT scores to be? Why?
STAT3120 - Correlation
What would you expect the correlation of the Math SAT scores and the Percent of HS students that took the test? Why?
STAT3120 - Correlation
Lets pull up the 2000 Florida Vote Count in Excel…
STAT3120 - Correlation
Lets pull up the UCDAVIS2 dataset in Excel…plot Ideal Height versus Actual Height…what would you expect the correlation value to be? Can you explain someone’s Ideal Height using their Actual Height?
STAT3120 – Regression
STAT3120 - Regression
From the previous slide, the “regression line” has been imposed onto the relationship between ideal height and height.
The equation of this line takes the general form of y=mx+b, where:
• Y is the dependent variable (ideal height)• M is the slope of the line• X is the independent variable (actual height)• B is the Y-intercept.
When we discussion regression models, we transform this equation to be:
Y = bo + b1x1 + …bnxn
Where bo is the y-intercept and b1 is the slope of the line. The “slope” is also the effect of a one unit change of x on y.
STAT3120 - Regression
From the previous slide, the model equation is presented in the form of the equation of a line: y=.8174x +14.271.
From this, we would say:
1.For every 1 inch of change in someone’s actual height, there is a .8174 inch change in their ideal height.2.Everyone “starts” with 14.271 inches.3.If someone has an actual height of 68 inches, their ideal height is 69.85 inches.
That R2 value of .7372 is interpreted as “73.72% of the change in ideal height can be explained by a linear model with actual height as the only predictor”.
STAT3120 - Regression
Lets do this in SAS.
After you import the data, the code to run a correlation looks like this:
Proc Corr data=jlp.ucdavis2;Var Idealht Height;Run;
The output looks like this:
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0
Number of Observations
IDEALHT HEIGHT
IDEALHT 1.00000
231
0.85861 <.0001
231
HEIGHT 0.85861 <.0001
231
1.00000
239
STAT3120 - Regression
The SAS Code to develop a regression model on the data looks like this:
Proc Reg data=jlp.ucdavis2;Model idealht = height/p r;output out=preds p=pred r=resid;run;
In this code, the regression model is developed using the “model” statement. Here, the dependent variable of interest is set first. The independent variable(s) then follow after the = sign.
The p and r options after the / will produce the predictions and the residuals respectively.
The output statement will create a new (temporary) dataset called “preds” that will contain the predictions and the residuals – so that we can examine them.
STAT3120 - Regression
Here is some of the associated output:
Analysis of Variance
Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 1 2278.18568 2278.18568 642.39 <.0001
Error 229 812.12596 3.54640
Corrected Total 230 3090.31164
Root MSE 1.88319 R-Square 0.7372
Dependent Mean 68.77818 Adj R-Sq 0.7361
Coeff Var 2.73806
Parameter Estimates
Variable DF
Parameter
Estimate
Standard
Error t Value Pr > |t|
Intercept 1 14.27119 2.15413 6.63 <.0001
HEIGHT 1 0.81738 0.03225 25.35 <.0001
This information tells us about the performance of the model
This information tells us about the Equation of the model and the impact of the predictor(s).