association for interval level variables

39
Association for Interval Level Variables Chapter 15

Upload: zora

Post on 23-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Association for Interval Level Variables. Chapter 15. Introduction. When referring to interval-ratio variables a commonly used synonym for association is correlation We will be looking for the existence, strength, and direction of the relationship - PowerPoint PPT Presentation

TRANSCRIPT

Association for Interval Level Variables

Association for Interval Level VariablesChapter 15IntroductionWhen referring to interval-ratio variables a commonly used synonym for association is correlationWe will be looking for the existence, strength, and direction of the relationshipWe will only look at bivariate relationships in this chapterScattergramsThe first step is to construct and examine a scattergramExample in the bookAnalysis of how dual wage-earner families cope with houseworkThey want to know if the number of children in the family is related to the amount of time the husband contributes to housekeeping choresScattergram of Relationship Between the Two Variables

Regression of Husbands Hours of Housework By The Number of Children in the FamilyConstruction of a ScattergramDraw two axes of about equal length and at right angles to each otherPut the independent (X) variable along the horizontal axis (the abscissa) and the dependent (Y) variable along the vertical axis (the ordinate)For each person, locate the point along the abscissa that corresponds to the scores of that person on the X variableDraw a straight line up from that point and at right angles to the axisThen locate the point along the ordinate that corresponds to the score of that same case on the Y variablePlace a dot there to represent the case, and then repeat with all casesRegression Line and its PurposeIt checks for linearity of the data points on the scattergramIt gives information about the existence, strength, and direction of the associationIt is used to predict the score of a case on one variable from the score of that case on the other variableIt is a floating mean through all the data pointsScattergram of Relationship Between the Two Variables

Regression of Husbands Hours of Housework By The Number of Children in the FamilyExistence of a RelationshipTwo variables are associated if the distributions of Y change for the various conditions of XThe scores along the abscissa (number of children) are conditions of values of XThe dots above each X value can be thought of as the conditional distributions of Y (scores on Y for each value of X)In other words, Y tends to increase as X increasesExistence of a RelationshipThe existence of a relationship is reinforced by the fact that the regression line lies at an angle to the X axis (the abscissa)There is no linear relationship between two interval-level variables when the regression line on a scattergram is parallel to the horizontal axisScattergram of Relationship Between the Two Variables

Regression of Husbands Hours of Housework By The Number of Children in the FamilyStrength of the AssociationThe strength of the association is judged by observing the spread of the dots around the regression lineA perfect association between variables can be seen on a scattergram when all dots lie on the regression lineThe closer the dots to the regression line, the stronger the associationSo, for a given X. there should not be much variety on the Y variableScattergram of Relationship Between the Two Variables

Regression of Husbands Hours of Housework By The Number of Children in the FamilyDirection of the RelationshipThe direction of the relationship can be judged by observing the angle of the regression line with respect to the abscissaThe relationship is positive when the line slopes upward from left to rightThe association is negative when it slopes downYour book shows a positive relationship, because cases with high scores on X also tend to have high scores on YFor a negative relationship, high scores on X would tend to have low scores on Y, and vice versaYour book also shows a zero relationshipno association between variables, in that they are randomly associated with each other

Linearity The key assumption (first step in model) with correlation and regression is that the two variables have an essentially linear relationshipThe points or dots must form a pattern of a straight lineIt is important to begin with a scattergram before doing correlations and regressionsIf the relationship is nonlinear, you may need to treat the variables as if they were ordinal rather than interval-ratioRegression and PredictionThe final use of the scattergram is to predict scores of cases on one variable from their score on the otherMay want to predict the number of hours of housework a husband with a family of four children would do each weekYou use regression to predict outside the range of the data with caution, since you do not have any data to show what happens beyond the scope of the datait may have suddenly gone downThe Predicted Score on YThe symbol for this is Y, or Y prime, though in other books, it is most often Y hat, but that symbol is difficult to do on a computer or to print in booksIt is found by first locating the score on X (X=4), for four children) and then drawing a straight line from that point on the abscissa to the regression lineFrom the regression line, another straight line parallel to the abscissa is drawn across to the Y axis or ordinateY is found at the point where the line from the regression line crosses the Y axisOr, you can compute Y = a + bXY is the expected Y value for a given XFormula for the Regression LineThe formula for a straight line that fits closest to the conditional means of YY = a + bXWhere Y = score on the dependent variablea = the Y intercept or the point where the regression line crosses the Y axisb = the slope of the regression line or the amount of change produced in Y by a unit change in XX = score on the independent variableRegression LineThe position of the least-squares regression line is defined by two elementsThe Y intercept and the slope of the lineThe weaker the effect of X on Y (the weaker the association between the variables) the lower the value of the slope (b)If the two variables are unrelated, the least-squares regression line would be parallel to the abscissa, and b would be 0 (the line would have no slope)Scattergram of Relationship Between the Two Variables

Regression of Husbands Hours of Housework By The Number of Children in the FamilyEquations for the Slope of the Regression LineYou need to compute b first, since it is needed in the formula for aSlope:

Which is the covariance of X and Y divided by the variance of X

Interpretation of the Value of the SlopeIf you put your scattergram on graph paper, you can see that as X increases one box, b is how many units that Y increases on the regression lineSo, a slope of .69 indicates that, for each unit increase in X. there is an increase of .69 units in YIf the slope is 1.5, for every unit of change in X there is an increase of 1.5 units in YThey refer to units, since correlation and regression allow you to compare apples and orangestwo completely different variables

Scattergram of Relationship Between the Two Variables

Regression of Husbands Hours of Housework By The Number of Children in the FamilyInterpretation of b cont.So, to find what one unit of X is or one unit of Y is, you have to go back to the labels for each variableFor the example in your book which has a b (beta) of .69The addition of each child (an increase of one unit in Xone unit is one child)Results in an increase of .69 hours of housework being done by the husband (an increase of .69 unitsor hoursin Y)Formula for the Intercept of the Regression Line

Interpretation of the InterceptThe intercept for the example in the book is 1.49The least-squares regression line will cross the Y axis at the point where Y equals 1.49You need a second point to draw the regression lineYou can begin at Y of 1.49, and for the next value of X, which is 1 child, you will go up .69 units of YOr, you can use the intersection of the mean of X and the mean of Ythe regression line always goes through this point

Interpretation of a cont.Most of the time, you cant interpret the value of the interceptTechnically, it is the value that Y would take if X were zeroBut, most often a zero X is not meaningfulOr, in the case in your book, zero is outside the range of the dataYou dont have any information about the hours of housework that husbands do when they have no children Technically, the intercept of 1.49 is the amount of predicted housework a husband of zero children would do, but you cant say that with certaintyLeast Squares Regression LineNow that you know a and b, you can fill in the full least-squares regression lineY = a + bXY = (1.49) + (.69) XThis formula can be used to predict scores on Y as was mentioned earlierFor any value of X, it will give you the predicted value of Y (Y)The predictions of husbands housework are educated guessesThe accuracy of our predictions will increase as relationships become stronger (as dots are closer to the regression line)The Correlation Coefficient (Pearsons r)Pearsons r varies from 0 to plus or minus 1With 0 indicating no associationAnd + 1 and 1 indicating perfect positive and perfect negative relationshipsThe definitional formula for Pearsons r is in your bookSimilar to the formula for b (beta), the numerator is the covariation between X and Y (usually called the covariance)Interpretating r and r-squaredInterpretation of r will be the same as all the other measures of associationAn r of .5 would be a moderate positive linear relationship between the variablesInterpretation of the Coefficient of Determination (r-squared)The square of Pearsons r is also called the coefficient of determinationWhile r measures the strength of the linear relationship between two variablesBut values between 0 and 1 or -1 have no direct interpretationInterpretation, cont.The coefficient of determination can be interpreted with the logic of PRE (proportional reduction in error)First Y is predicted while ignoring the information supplied by XSecond the independent variable is taken into account when predicting the dependentWhen working with variables measured at the interval-ratio level, the predictions of Y under the first condition (while ignoring X) will be the mean of the Y scores (Y bar) for every caseWe know that the mean of any distribution is closer than any other point to all the scores in the distributionInterpretation, cont.Will make many errors in predicting YThe amount of error is shown in Figure 16.6The formula for the error is the sum of (Y minus Y bar) squaredThis is called the total variation in Y, meaning the total amount that all the points are off the mean of YThe next step will be to find the extent to which knowledge of X improves our ability to predict Y (will we make predictions that come closer to the actual points than will the mean of Y?)Interpretation, cont.If the two variables have a linear relationship, then predicting scores on Y from the least-squares regression equation will use knowledge of X and reduce our errors of predictionThe formula for the predicted Y score for each value of X will be: Y = a + bXThis is also the formula for the regression lineInterpretation, cont.In Figure 16.7, we measure the distance of the actual data points from the regression lineIf there is less distance (smaller errors) here than in the distance of the actual points from the mean of Y, then there is an association between the two variablesThe vertical lines from each data point to the regression line represent the amount of error in predicting Y that remains even after X has been taken into accountWe can calculate that precisely by looking at r-squaredr-squared is the proportion, or when multiplied by 100, is the percentage of variation in Y that is explained by XUnexplained VariationThat suggests that some of the variation in Y is unexplained by XThe proportion of the total variation in Y unexplained by X can also be found by subtracting the value of r-squared from 1.00Unexplained Variation, cont.Unexplained variation is usually attributed to the influence of three thingsSome combination of other variables, as in the example of the husbands houseworkMeasurement errorPeople over or under estimate how much time they spend doing houseworkRandom chanceYour sample may be biased, particularly if it is smallTesting Pearsons r for SignificanceWhen r is based on data from a random sample, you need to test r for its statistical significanceWhen testing Pearsons r for significance, the null hypothesis is that there is no linear association between the variables in the population from which the sample was drawnWe will use the t distribution for this testAssumptions for the Significance TestWe make some additional assumptions in Step 1Need to assume that both variables are normal in distribution Need to assume that the relationship between the two variables is roughly linear in formThe third new assumption involves the concept of homoscedasticityHomoscedasticityA homoscedastistic relationship is one where the variance of the Y scores is uniform for all values of XIf the Y scores are evenly spread above and below the regression line for the entire length of the line, the relationship is homoscedastisticIf the variance around the regression line is greater at one end or the other, the relationship is heteroscedastisticA visual inspection of the scattergram is usually sufficient to find the extent the relationship conforms to the assumptions of linearity and homoscedasticityIf the data points fall in a roughly symmetrical, cigar-shaped pattern, whose shape can be approximated with a straight line, then it is appropriate to proceed with this test of significanceScattergram of Relationship Between the Two Variables

Regression of Husbands Hours of Housework By The Number of Children in the Family