notes bivariate data chapters 7 - 9. bivariate data explores relationships between two quantitative...

30
Notes Bivariate Data Chapters 7 - 9

Upload: karen-preston

Post on 31-Dec-2015

230 views

Category:

Documents


5 download

TRANSCRIPT

Notes Bivariate Data Chapters 7 - 9

Bivariate DataExplores relationships between two quantitative variables.

The explanatory variable attempts to explain the observed outcomes. (In algebra this is your independent variable – “x”)

The response variable measures an outcome of a study. (In algebra this is your dependent variable – “y”)

○ When we gather data, we usually have in mind which variables are which.

○ Beware! – this explanatory/response relationship suggests a cause and effect relationship that may not exist in all data sets. Use common sense!!

○ A Lurking Variable is a variable that has an important effect on the relationship among the variables in a study but is not included among the variables being studied.

○ Lurking variables can suggest a relationship when there isn’t one or can hide a relationship that exists.

Displaying the Variables○We always graph our data right?

○You use a scatterplot to graph the relationship between 2 quantitative variables. Each point represents an individual.

○Remember that not all bivariate relationships are linear!!! We will talk about non-linear in the next unit.

Interpret a Scatterplot○ Here is what we look for:

○ 1) direction (positive, negative) D○ 2) form (linear, or not linear)

S○ 3) strength (correlation, r)

S○ 4) deviations from the pattern (outliers)

U

SUDS!!

• Remember on outlier is an individual observation that falls outside the overall pattern of the graph.

○ There is no outlier test for bivariate data. It’s a judgment call

○ Categorical variables can be added to scatterplots by changing the symbols in the plot. (See P. 199 for examples)

○ Visual inspection is often not a good judge of how strong a linear relationship is. Changing the plotting scales or the amount of white space around a cloud of points can be deceptive. So….

A measure for strength...○  

Facts about Correlation:○ 1) positive r – positive association (positive

slope) negative r – negative association (negative slope)

○ 2) r must fall between –1 and 1 inclusive. ○ 3) r values close to –1 or 1 indicate that the

points lie close to a straight line.○ 4) r values close to 0 indicate a weak linear

relationship.○ 5) r values of –1 or 1 indicate a perfect linear

relationship.○ 6) correlation only measures the strength in

linear relationships (not curves).○ 7) correlation can be strongly affected by

extreme values (outliers).

Least-Squares Regression Line○ The least-squares regression line

(LSRL) is a mathematical model for the data.

○ This line is also known as the line of best fit or the regression line.

Formal definition…○ The least-squares regression line of

y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

The form…○  

Some new formulas…○  

Why do we do regression? ○ The purpose of regression is to

determine a model that we can use for making predictions.

Communication is always the goal!!!○ When we write the equation for a LSRL

we do not use x & y, we use the variable names themselves…

○ For example:○ Predicted score = 52 + 1.5(hours studied)

Another measure of strength…○ The coefficient of determination, r2,

is the fraction of the variation in the value of y that is explained by the linear model.

○ When we explain r2 then we say… ___% of the variability in ___(y) can be

explained by this linear model.

Deviations for single points○ A residual is the vertical difference

between an actual point and the LSRL at one specific value of x. That is,

Residual = observed y – predicted yor

Residual = y –

○ The mean of the residuals is always zero.

A new plot…○ A residual plot plots the residuals on

the vertical axis against the explanatory variables on the horizontal axis.

○ Such a plot magnifies the residuals and makes patterns easier to see.

Why do I need a residual plot?○ Remember that all data is not linear in

shape!!! The residual plot clearly shows if linear is appropriate.

○ A residual plot show good linear fit when the points are randomly scattered about y = 0 with no obvious patterns.

To create a residual plot on the calculator: ○ 1)You must have done a linear

regression with the data you wish to use.

○ 2) From the Stat-Plot, Plot # menu choose scatterplot and leave the x list with the x values.

○ 3) Change the y-list to “RESID” chosen from the list menu.

○ 4) Zoom – 9

○ In scatterplots we can have points that are outliers or influential points or both.

○ An observation can be an outlier in the x direction, the y direction, or in both directions.

○ An observation is influential if removing it or adding it) would markedly change the position of the regression line.

○ Extrapolation is the use of a regression model for prediction outside the domain of values of the explanatory variable x.

○ Such predictions cannot be trusted.

Association vs. Causation

○A strong association between two variables is NOT enough to draw conclusions about cause & effect.

Association vs Causation○Strong association between two

variables x and y can reflect:○ A) Causation – Change in x causes change

in y

○ B) Common response – Both x and y are Responding to some other unobserved factor

○ C) Confounding – the effect on y of the explanatory variable x is hopelessly mixed up with the effects on y of other variables.

Association vs Causation

○Cause and Effect can only be determined from a well designed experiment.

○ Data with no apparent linear relationship can also be examined in two ways to see if a relationship still exists:○ 1) Check to see if breaking the data down

into subsets or groups makes a difference.○ 2) If the data is curved in some way and

not linear, a relationship still exists. We will explore that in the next chapter.