Download - Sociology 601 Class 19: November 3, 2008

Sociology 601 Class 19: November 3, 2008

• Review of correlation and standardized coefficients

• Statistical inference for the slope (9.5)

• Violations of Model Assumptions, and their effects (9.6)

1

9.5 Inference for a slope.• Problem: we have measures for the strength of association

between two linear variables, but no measures for the statistical significance of that association.• We know the slope & intercept for our sample; what can

we say about the slope & intercept for the population?

• Solution: hypothesis tests for a slope and confidence intervals for a slope.• Need a standard error for the coefficients

• Difficulties: additional assumptions, complications with estimating a standard error for a slope.

2

Assumptions Needed to make Population Inferences for slopes.

• The sample is selected randomly.

• X and Y are interval scale variables.

• The mean of Y is related to X by the linear equation E{Y} = + X.

• The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity)

• The conditional distribution of Y at each value of X is normal.

• There is no error in the measurement of X.

3

Common Ways to Violate These Assumptions

• The sample is selected randomly.o Cluster sampling (e.g., census tracts / neighborhoods) causes

observations in any cluster to be more similar than to observations outside the cluster.

o Two or more siblings in the same family.

o Sample = populations (e.g., states in the U.S.)

• X and Y are interval scale variables.o Ordinal scale attitude measures

o Nominal scale categories (e.g., race/ethnicity, religion)

4

Common Ways to Violate These Assumptions (2)• The mean of Y is related to X by the linear equation

E{Y} = + X.o U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita)

o Thresholds:

o Logarithmic (e.g., earnings <- education)

• The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity)o earnings <- education

o hours worked <- time

o adult child occupational status <- parental occupational status

5

Common Ways to Violate These Assumptions (3)• The conditional distribution of Y at each value of X is

normal.o earnings (skewed) <- education

o Y is binary, or a %

• There is no error in the measurement of X. o almost everything

o what is the effect of measurement error in x on b?

6

The Null hypothesis for slopes

Null hypothesis: the variables are statistically independent.

• Ho: = 0. The null hypothesis is that there is no linear relationship between X and Y.

• Implication for : E{Y} = + 0*X = ; = .

(Draw figure of distribution of Y, X when Ho is true)

7

Test Statistic for slopes

• What is the range of b’s we would get if we take repeated samples from a population and calculate b for each of those samples?

• That is, what is the standard error of the sample slope b’s?

• Test statistic: t = b /hat b

o where hat b is the standard error of the sample

slope b.o df for the t statistic (with one x – variable) is n-2o when n is large, the t statistic is asymptotically equivalent

to a z-statistic

• What would make hat b smaller?8

Calculating the s.e. of b

hat b = hat / (sX*sqrt(n-1))

where hat = sqrt(SSE/n-2)(= root MSE)

• the standard error of b is smaller when…o the sample size is largeo the standard deviation of X is large (there is a

wide range of X values) o the conditional standard deviation of Y is small.

9

Conclusions about Population

• P-value: calculated as in any t-test, but remember df = n-2a z-test is appropriate when n > 30 or so

• Conclusions: evaluate p-value based o n a previously selected alpha level

Rule of thumb: b should be at least 2x standard error.

10

Example of Inference about a Slope• In an analysis of poverty and crime in the 50 states plus

DC, a computer output provides the following:

• E{Murder rate} = -10.14 + 1.322*{Poverty rate}(Poverty rate in %, murder rate per 100,000)

• SSE = 3904.3 SST = 5743.3• N = 51 Sx = 4.584

• Do a hypothesis test to determine whether there is a linear relationship between crime rates and poverty rates.

11

Stata Example of Inference about a Slope• In an analysis of poverty and crime in the 50 states plus

DC, stata computer output provides the following:

regress murder poverty

Source | SS df MS Number of obs = 51-------------+------------------------------ F( 1, 49) = 23.08 Model | 1839.06931 1 1839.06931 Prob > F = 0.0000 Residual | 3904.25223 49 79.6786169 R-squared = 0.3202-------------+------------------------------ Adj R-squared = 0.3063 Total | 5743.32154 50 114.866431 Root MSE = 8.9263

------------------------------------------------------------------------------ murder | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339 _cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707-----------------------------------------------------------------------------

• Interpret whether there is a linear relationship between crime rates and poverty rates.

12

Example of Inference about a Slope• SSE = 3904.3 SST = 5743.3• N = 51 Sx = 4.58

• b= 1.323• b= 1.323

13€

b = Σ(X − X )(Y −Y )Σ(X − X )2

,

a =Y −bX

Example of Inference about a Slope• SSE = 3904.3 SST = 5743.3• N = 51 Sx = 4.58

• b= 1.323• seb= sqrt (SSE / (n-2) ) / (sx * sqrt(n-1))

= sqrt (3904.3/49) / ( 4.585*sqrt(50) )= sqrt (79.68) / (4.585 * 7.071)= 8.926 / 32.421= 0.275

• t = b / seb = 1.323 / 0.275 = 4.81• p < .00195% confidence interval for b = 0.783 to 1.861

14

Confidence interval for a slope.

• Confidence interval for a slope:c.i. = b ± t*hat b

the standard t-score for a 95% confidence interval is t.025 , with df = n-2

• An alternative to a confidence interval is to report both b and hat b .

15

Example of Confidence Interval of a Slope• SSE = 3904.3 SST = 5743.3• N = 51 Sx = 4.58

• b = 1.323• seb = 0.275

95% confidence interval for b = 1.322 +- 2.009*0.275

= 1.322 +- 0.552 = 0.783 to 1.861

16

Inference for a slope using STATA. regress attend regul

Source | SS df MS Number of obs = 18-------------+------------------------------ F( 1, 16) = 9.65 Model | 2240.05128 1 2240.05128 Prob > F = 0.0068 Residual | 3715.94872 16 232.246795 R-squared = 0.3761-------------+------------------------------ Adj R-squared = 0.3371 Total | 5956 17 350.352941 Root MSE = 15.24

------------------------------------------------------------------------------ attend | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- regul | -5.358974 1.72555 -3.11 0.007 -9.016977 -1.700972 _cons | 36.83761 5.395698 6.83 0.000 25.39924 48.27598------------------------------------------------------------------------------

• The significance test and confidence interval for b appear on the line with the name of the x-variable.

• Can you find SSE and SST? df for the model? r?17

Things to watch out for: extrapolation.

Extrapolation beyond observed values of X is dangerous.• The pattern may be nonlinear.• Even if the pattern is linear, the standard errors become

increasingly wide.• Be especially careful interpreting the Y-intercept: it may lie

outside the observed data.o e.g., year zeroo e.g., zero education in the U.S.o e.g., zero parity

19

Things to watch out for: outliers

• Influential observations and outliers may unduly influence the fit of the model.

• The slope and standard error of the slope may be affected by influential observations.

• This is an inherent weakness of least squares regression.

• You may wish to evaluate two models; one with and one without the influential observations.

20

Things to watch out for: truncated samples

Truncated samples cause the opposite problems of influential observations and outliers.

• Truncation on the X axis reduces the correlation coefficient for the remaining data.

• Truncation on the Y axis is a worse problem, because it violates the assumption of normally distributed errors.

•Examples: Topcoded income data, health as measured by number of days spent in a hospital in a year.

22

Things to watch out for: measurement error

Error in measurement of the X variable creates a bias that makes the correlation appear weaker.

This problem can be a measurement issue or an interpretation issue.

23

Download - Sociology 601 Class 19: November 3, 2008

Top Related