Download - Sociology 601 Class 19: November 3, 2008
Sociology 601 Class 19: November 3, 2008
• Review of correlation and standardized coefficients
• Statistical inference for the slope (9.5)
• Violations of Model Assumptions, and their effects (9.6)
1
9.5 Inference for a slope.• Problem: we have measures for the strength of association
between two linear variables, but no measures for the statistical significance of that association.• We know the slope & intercept for our sample; what can
we say about the slope & intercept for the population?
• Solution: hypothesis tests for a slope and confidence intervals for a slope.• Need a standard error for the coefficients
• Difficulties: additional assumptions, complications with estimating a standard error for a slope.
2
Assumptions Needed to make Population Inferences for slopes.
• The sample is selected randomly.
• X and Y are interval scale variables.
• The mean of Y is related to X by the linear equation E{Y} = + X.
• The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity)
• The conditional distribution of Y at each value of X is normal.
• There is no error in the measurement of X.
3
Common Ways to Violate These Assumptions
• The sample is selected randomly.o Cluster sampling (e.g., census tracts / neighborhoods) causes
observations in any cluster to be more similar than to observations outside the cluster.
o Two or more siblings in the same family.
o Sample = populations (e.g., states in the U.S.)
• X and Y are interval scale variables.o Ordinal scale attitude measures
o Nominal scale categories (e.g., race/ethnicity, religion)
4
Common Ways to Violate These Assumptions (2)• The mean of Y is related to X by the linear equation
E{Y} = + X.o U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita)
o Thresholds:
o Logarithmic (e.g., earnings <- education)
• The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity)o earnings <- education
o hours worked <- time
o adult child occupational status <- parental occupational status
5
Common Ways to Violate These Assumptions (3)• The conditional distribution of Y at each value of X is
normal.o earnings (skewed) <- education
o Y is binary, or a %
• There is no error in the measurement of X. o almost everything
o what is the effect of measurement error in x on b?
6
The Null hypothesis for slopes
Null hypothesis: the variables are statistically independent.
• Ho: = 0. The null hypothesis is that there is no linear relationship between X and Y.
• Implication for : E{Y} = + 0*X = ; = .
(Draw figure of distribution of Y, X when Ho is true)
7
Test Statistic for slopes
• What is the range of b’s we would get if we take repeated samples from a population and calculate b for each of those samples?
• That is, what is the standard error of the sample slope b’s?
• Test statistic: t = b /hat b
o where hat b is the standard error of the sample
slope b.o df for the t statistic (with one x – variable) is n-2o when n is large, the t statistic is asymptotically equivalent
to a z-statistic
• What would make hat b smaller?8
Calculating the s.e. of b
hat b = hat / (sX*sqrt(n-1))
where hat = sqrt(SSE/n-2)(= root MSE)
• the standard error of b is smaller when…o the sample size is largeo the standard deviation of X is large (there is a
wide range of X values) o the conditional standard deviation of Y is small.
9
Conclusions about Population
• P-value: calculated as in any t-test, but remember df = n-2a z-test is appropriate when n > 30 or so
• Conclusions: evaluate p-value based o n a previously selected alpha level
Rule of thumb: b should be at least 2x standard error.
10
Example of Inference about a Slope• In an analysis of poverty and crime in the 50 states plus
DC, a computer output provides the following:
• E{Murder rate} = -10.14 + 1.322*{Poverty rate}(Poverty rate in %, murder rate per 100,000)
• SSE = 3904.3 SST = 5743.3• N = 51 Sx = 4.584
• Do a hypothesis test to determine whether there is a linear relationship between crime rates and poverty rates.
11
Stata Example of Inference about a Slope• In an analysis of poverty and crime in the 50 states plus
DC, stata computer output provides the following:
regress murder poverty
Source | SS df MS Number of obs = 51-------------+------------------------------ F( 1, 49) = 23.08 Model | 1839.06931 1 1839.06931 Prob > F = 0.0000 Residual | 3904.25223 49 79.6786169 R-squared = 0.3202-------------+------------------------------ Adj R-squared = 0.3063 Total | 5743.32154 50 114.866431 Root MSE = 8.9263
------------------------------------------------------------------------------ murder | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339 _cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707-----------------------------------------------------------------------------
• Interpret whether there is a linear relationship between crime rates and poverty rates.
12
Example of Inference about a Slope• SSE = 3904.3 SST = 5743.3• N = 51 Sx = 4.58
• b= 1.323• b= 1.323
13€
b = Σ(X − X )(Y −Y )Σ(X − X )2
,
a =Y −bX
Example of Inference about a Slope• SSE = 3904.3 SST = 5743.3• N = 51 Sx = 4.58
• b= 1.323• seb= sqrt (SSE / (n-2) ) / (sx * sqrt(n-1))
= sqrt (3904.3/49) / ( 4.585*sqrt(50) )= sqrt (79.68) / (4.585 * 7.071)= 8.926 / 32.421= 0.275
• t = b / seb = 1.323 / 0.275 = 4.81• p < .00195% confidence interval for b = 0.783 to 1.861
14
Confidence interval for a slope.
• Confidence interval for a slope:c.i. = b ± t*hat b
the standard t-score for a 95% confidence interval is t.025 , with df = n-2
• An alternative to a confidence interval is to report both b and hat b .
15
Example of Confidence Interval of a Slope• SSE = 3904.3 SST = 5743.3• N = 51 Sx = 4.58
• b = 1.323• seb = 0.275
95% confidence interval for b = 1.322 +- 2.009*0.275
= 1.322 +- 0.552 = 0.783 to 1.861
16
Inference for a slope using STATA. regress attend regul
Source | SS df MS Number of obs = 18-------------+------------------------------ F( 1, 16) = 9.65 Model | 2240.05128 1 2240.05128 Prob > F = 0.0068 Residual | 3715.94872 16 232.246795 R-squared = 0.3761-------------+------------------------------ Adj R-squared = 0.3371 Total | 5956 17 350.352941 Root MSE = 15.24
------------------------------------------------------------------------------ attend | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- regul | -5.358974 1.72555 -3.11 0.007 -9.016977 -1.700972 _cons | 36.83761 5.395698 6.83 0.000 25.39924 48.27598------------------------------------------------------------------------------
• The significance test and confidence interval for b appear on the line with the name of the x-variable.
• Can you find SSE and SST? df for the model? r?17
Things to watch out for: extrapolation.
Extrapolation beyond observed values of X is dangerous.• The pattern may be nonlinear.• Even if the pattern is linear, the standard errors become
increasingly wide.• Be especially careful interpreting the Y-intercept: it may lie
outside the observed data.o e.g., year zeroo e.g., zero education in the U.S.o e.g., zero parity
19
Things to watch out for: outliers
• Influential observations and outliers may unduly influence the fit of the model.
• The slope and standard error of the slope may be affected by influential observations.
• This is an inherent weakness of least squares regression.
• You may wish to evaluate two models; one with and one without the influential observations.
20
Things to watch out for: truncated samples
Truncated samples cause the opposite problems of influential observations and outliers.
• Truncation on the X axis reduces the correlation coefficient for the remaining data.
• Truncation on the Y axis is a worse problem, because it violates the assumption of normally distributed errors.
•Examples: Topcoded income data, health as measured by number of days spent in a hospital in a year.
22
Things to watch out for: measurement error
Error in measurement of the X variable creates a bias that makes the correlation appear weaker.
This problem can be a measurement issue or an interpretation issue.
23