notes: linear regression tests & intervals

30
Today is Tuesday February 18 th Notes: Linear Regression Tests & Intervals Pgs. 208-211 Homework Notification Assigned Today: β€’ HWK 7.1 Residuals & Tests Due: Friday, February 21 st β€’ Death Packet Due: Monday, March 32nd β€’ FRQ Packet Due: Monday, March 5th Due Today: CWK 7.1 Residual Practice Assigned: Wednesday, February 11 th Due Tomorrow: No Assignment is Due Highly Recommended: Attend STATs Saturday this Saturday

Upload: others

Post on 26-Jan-2022

23 views

Category:

Documents


0 download

TRANSCRIPT

Today is Tuesday February 18th

Notes: Linear Regression

Tests & Intervals

Pgs. 208-211

Homework Notification Assigned Today:

β€’ HWK 7.1 Residuals & Tests

Due: Friday, February 21st

β€’ Death Packet

Due: Monday, March 32nd

β€’ FRQ Packet

Due: Monday, March 5th

Due Today: CWK 7.1 Residual Practice

Assigned: Wednesday, February 11th

Due Tomorrow: No Assignment is Due

Highly Recommended:

Attend STATs Saturday this Saturday

Name:____________________________ HWK 7.1 Residuals & Tests Period:______

Due Friday, February 21st

Here are advertised horsepower ratings and expected gas mileage for several 2001 vehicles.

Vehicle Horsepower Gas Mileage

mpg Calculate the following Statistics:

Audi A4 170 22

Ξ²1 Buick LeSabre 205 20

Chevy Blazer 190 15

Ξ²0 Chevy Prizm 125 31

Ford Excursion 310 10

r GMC Yukkon 285 13

Honda Civic 127 29

R2 Hyundai Elantra 140 25

Lexus 300 215 21 Regression

Equation: Lincoln LS 210 23

Mazada MPV 170 18 Estimate mpg

for 250 hp

Olds Alero 140 23

Toyota Camry 194 21 Calculate the Chevy

Blazer Residual

VW Beetle 115 29

Check Values οΏ½Μ…οΏ½ =185.43 οΏ½Μ…οΏ½ =21.43 Calculate the Chevy

Prizm Residual

Sx =58.19 Sy =6.09

Explain the following in context of the situation.

1. Slope:_______________________________________________________________________

______________________________________________________________________________

2. Slope Intercept_______________________________________________________________

______________________________________________________________________________

3. Correlation Coefficient__________________________________________________________

______________________________________________________________________________

4. Coefficient of Determination_____________________________________________________

______________________________________________________________________________

A. Calculate and explain the meaning of R2 in the context of this problem.

B. Compute and report the equation of the least squares regression line.

Identify all variables used in the equation.

C. Calculate the residual value for an β€œanxiety level” of 5.

D. Interpret the meaning of the slope of the regression equation in the context of this problem.

E. Create a Scatter Plot for the Data Create A Residual Plot for the Data

New state requirements force students to take a β€œhigh stakes” math test

in order to graduate from high school. Faced with the pressure-laden

situation, many students become nervous, which may interfere with their

ability to perform well. Concerned about β€œtest anxiety”, a researcher

enlists 24 student volunteers for a study. A psychologist interviews

them before the math test, assessing their anxiety levels on a scale from

1 to 10. The table shows the anxiety levels and the exam scores.

Student

Number

Anxiety

Level

Test

Score

1 10 35

2 8 61

3 7 68

4 7 39

5 6 79

6 2 94

7 3 87

8 5 81

9 6 80

10 8 57

11 8 45

12 10 44

13 10 27

14 9 47

15 7 60

16 6 85

17 4 29

18 6 100

19 6 41

20 10 37

21 3 66

22 9 72

23 2 68

24 8 80

F. Create a hypothesis test and provide a conclusion to determine whether or not there is statistical

evidence to suggest that β€œTest Anxiety” is associated with test scores in math.

G. Create a 95% confidence interval for the slope of the regression line.

Name:_________________________ Death Packet: Regression and Chi-Square Period:_____

(Due Monday, March 2nd)

1. In a random sample of 25 high school students, each was interviewed as to GPA and weekly hours

worked at part-time jobs. What is the critical t-value in calculating a 90 percent confidence interval

estimate for the slope of the resulting least squares regression line?

(A) 1.645

(B) 1.703

(C) 1.708

(D) 1.711

(E) 1.714

2. Suppose that the scatterplot of log X and log Y shows a strong positive correlation close to 1. Which

of the following is true?

(A) The variables X and Y will also have a correlation close to 1.

(B) A scatterplot of the variables X and Y will show a strong nonlinear pattern.

(C) The residual plot of the variable X and Y will show a random pattern.

(D) The residual plot of the variables Log X and Log Y will show a strong nonrandom pattern.

(E) None of the above are true.

3. Which of the following gives a proper ordering?

(A) π‘Ÿ1 < π‘Ÿ3 < π‘Ÿ2 (B) π‘Ÿ2 < π‘Ÿ3 < π‘Ÿ1

(C) π‘Ÿ2 < π‘Ÿ1 < π‘Ÿ3 < |π‘Ÿ2|

(D) π‘Ÿ3 < π‘Ÿ1 < |π‘Ÿ2| (E) π‘Ÿ1 < π‘Ÿ2 < π‘Ÿ3 < |π‘Ÿ2|

4. Consider the points (-1, 4), (2, 10), (4, 15), (7, 21), (10, n). What should n be so that the correlation

between the x and y values is r = 1?

(A) 26

(B) 27

(C) 28

(D) A value different from any of the above.

(E) No value for n can make r = 1.

Correlation r1 Correlation r2 Correlation r3

5. Four pairs of data are used in determining a regression line οΏ½Μ‚οΏ½ = -2 + 6x. If the four values of the

independent variable are 37, 52, 18 and 23, respectively, what is the mean of the four values of the

dependent variable? (A) 32.5 (B) 193 (C) 195 (D) 778 (E) The mean cannot be determined from the given information.

6. Data on the number of cancer deaths among Americans (in 1,000s) and years (since 2001) result in the

regression line: π·π‘’π‘Žπ‘‘β„Žπ‘ Μ‚ = 550 -6.05(Years) with r = .863. What is the correct interpretation of the

slope?

(A) The number of cancer deaths among Americans has been dropping by an average of 6,050 per

year since 2001.

(B) The baseline number of cancer deaths among Americans is 550,000.

(C) The regression line explains 74.5 percent of the variation in cancer deaths among Americans

over the year since 2001.

(D) The regression line explains 86.3 percent of the variation in cancer deaths among Americans

over the years since 2001.

(E) Cancer will be cured in the year 2092.

7.

(A) X has the largest residual, in absolute value, of any point on the scatterplot.

(B) X is an influential point.

(C) There will be no pattern in the residual plot

(D) X is an outlier

(E) None of the above are true statements.

8. Suppose the correlation between two variables is .85. If each of the y-variables is multiplied by -1,

which of the following is true about the new scatterplot?

(A) It slope up to the right, and the correlation is -.85

(B) It slopes down to the right, and the correlation is -.85

(C) It slopes up to the right, and the correlation is .85.

(D) It slopes down to the right, and the correlation is .85.

(E) None of the above is true.

9. Which of the following statements about the correlation coefficient r is incorrect?

(A) It is not affected by the measurement units of the variables.

(B) It is not affected by which variable is called x and which is called y.

(C) It is not affected by extreme values.

(D) It gives information about a linear relationship, not about causation.

(E) It always gives values between -1 and 1, even if the association is nonlinear.

To the right is a scatterplot with one pint labeled

X. Suppose you find the least squares regression

line. Which of the following is a true statement?

X

10. A linear regression analysis is performed on the data from two scatterplots, A and B, resulting in

identical least squares regression lines with positive slopes. Which of the following statements is

true?

(A) The sum of the squares of the residuals in A equals the sum of the squares of the residuals in B.

(B) The correlation in A equals the correlation in B.

(C) If the sum of the squares of the residuals in A is greater than the sum of the squares of the

residuals in B, then the correlation in A will be greater than the correlation in B.

(D) If the sum of the squares of the residuals in A is greater than the sum of the squares in of the

residuals in B, then the correlation in A will be less than the correlation in B.

(E) None of the above are true statements.

11. A study of weekly hours of television watched and SAT scores reports a correlation of r = -1.18.

From this information, we can conclude that

(A) Students who watch more TV tend to have lower SAT scores

(B) The fewer the hours in front of a TV, the higher a student’s SAT scores.

(C) There is little relationship between weekly hours of television watched and SAT scores.

(D) There is strong negative association between weekly hours of television watched and SAT

scores, but it would be wrong to conclude causation.

(E) A mistake in the arithmetic has been made.

12. A scatterplot of a company’s revenues versus time indicates a possible exponential relationship. A

linear regression on y = log(revenue in $1,000) against x = years since 2005 gives οΏ½Μ‚οΏ½ = 0.75 + 0.63x

with r = .68. Which of the following is a valid conclusion?

(A) On the average, revenue goes up 0.63 thousand dollars per year.

(B) The predicted revenue for year 2009 is approximately 1,862 thousand dollars.

(C) Forty-six percent of the variation in revenue can be explained by variation in time.

(D) Sixty-eight percent of the variation in revenue can be explained by variation in time.

(E) None of the above are valid conclusions.

13. Which of the following are possible residual plots?

(A) I only

(B) II only

(C) III only

(D) I and II

(E) I, II, and III

.4

.3

.2

.1

.2

.1

-.1

-.2

III I II .2

.1

-.1

-.2

14. Suppose a study finds that the correlation coefficient relating job satisfaction to salary is r = +1.

Which of the following is a proper conclusion?

(A) High salary causes high job satisfaction

(B) Low salary causes low job satisfaction

(C) There is a 100% cause-and-effect relationship between salary and job satisfaction.

(D) There is a very strong association between salary and job satisfaction.

(E) None of the above are proper conclusions.

15. Which of the following statements about the correlation r is true?

(A) When r = 0, there is no relationship between the variables.

(B) When r = .2, 20 percent of the variables are closely related.

(C) When r = 1, there is a perfect cause-and-effect relationship between the variables.

(D) A correlation close to 1 means that a linear model will give the best fit to the data.

(E) All the statements are false.

16. Consider the following three scatterplots:

Which has the greatest correlation coefficient r?

(A) I

(B) II

(C) III

(D) They have the same correlation coefficient.

(E) The question cannot be answered without additional information.

17. Suppose the correlation between two variables is r = .28. What will the new correlation be if .17 is

added to all values of the x-variable, every value of the y-variables is doubled, and the two variables

are interchanged?

(A) .28

(B) .45

(C) .56

(D) .90

(E) -.28

I

40

30

20

5 6 7 8

III

80

60

40

10 12 14 16

II

50

40

30

6 7 8 9

18. The number of students taking AP Statistics at a high school during the years 2000-2007 is fitted

with a least square regression line. The graph of the residuals and some computer output is as

follows.

How many students took AP Statistics in the year 2003?

(A) 47

(B) 48

(C) 52

(D) 53

(E) 58

19. Consider the scatterplot of midterm and final exam

scores for a class of 20 students.

Which of the following is incorrect?

(A) Both the correlation and slope are negative.

(B) The same number of students score 100 on the

midterm as did on the final.

(C) The same percentage of students scored 100 below 50 on each exam.

(D) Most students who scored below 50 on the midterm also scored below 50 on the final.

(E) Some students scored higher on the final exam than they did on the midterm.

20. If the standard deviation of a set of observations is 0, you can conclude

(A) That there is no relationship between the observations.

(B) That the average value is 0.

(C) That all observations are the same value

(D) That a mistake in the arithmetic has been made.

(E) None of the above.

21. Which of the following statements about the correlation r is incorrect?

(A) The correlation and the slope of the regression line always have the same sign.

(B) A correlation of -.32 and a correlation of =.32 show the same degree of clustering around the

regression line.

(C) Correlation r measures the strength and direction only of linear association.

(D) A correlation of .78 indicates a relationship that is 3 times as linear as on for which the

correlation is .26.

(E) Outliers can greatly affect the value of r.

0 1 2 3 4 5 6 7

5

0

-5

-10

-15

Dependent variable is: Students

Variable Coeff s.e t p

Constant 11 6.299 1.75 0.1313

Years 13.9826 1.506 9.25 0.0001

S = 9.758

R-sq =93.4%

R-sq(adj) = 92.4%

Fin

al E

xam

Sco

re

100

50

Midterm Exam Score 50 100

22. If every man married a woman who was exactly 3 years younger than he, what would be the

correlation between the ages of married men and women?

(A) Somewhat negative

(B) 0

(C) Somewhat positive

(D) Nearly 1

(E) 1

23. Which of the following statements about residuals is incorrect?

(A) The mean of the residuals is always zero.

(B) The sum of the residuals is always zero.

(C) The regression line for a residual plot is a horizontal line.

(D) The standard deviation of the residuals gives a measure of how the point in the scatterplot are

spread around the regression line.

(E) A residual equals the predicted y minus the observed y.

24. Data are obtained from a random sample of adult women with regard to their ages and their monthly

expenditures on health products.

The resulting regression equation is π‘¬π’™π’‘π’†π’π’…π’Šπ’•π’–π’“π’† Μ‚ = 43 + 0.23(Age) with r = .27. What percentage

of the variation can be explained by looking at ages?

(A) 0.23 percent

(B) 23 percent

(C) 7.29 percent

(D) 27 percent

(E) 52.0 percent

25. Consider the following 3 scatterplots:

Which of the following is a true statement about the correlations for the three scatterplots?

(A) None are 0.

(B) One is 0, one is negative, and one is positive.

(C) One is 0, and both of the others are negative

(D) Two are 0, and the other is -1.

(E) Two are 0, and the other is close to -1.

26. Data on ages (in years) and prices (in $100) for ten cars of a specific model result in the regression

line: π‘·π’“π’Šπ’„οΏ½Μ‚οΏ½ = 250 – 30(Age). Given that 64 percent of the variation in price is explained by variation in

age, what is the value of the correlation coefficient r?

(A) -.64

(B) -.80

(C) .64

(D) .80

(E) There is insufficient information to answer this question.

27. Which of the following is a true statement about the correlation coefficient r?

(A) A correlation of .3 means that 30 percent of the points are highly correlated

(B) The square of the correlation measures the proportion of the y-variance that is predictable

from a knowledge of x.

(C) Perfect correlation, that is, when the points lie exactly on a straight line, results in r = 0.

(D) Multiplying every y-value by -1 leaves the correlation unchanged.

(E) The unit of measure for correlation is the y-unit per x-unit.

28. A study is conducted relating AP Statistics exam scores to the total number of study hours for the

AP Statistics class put in by students during the academic year, and the correlation is found to be

.6. Which of the following is a true statement?

(A) On average, a 40 percent increase in study time results in a 24 percent increase in exam score.

(B) On average, a 60 percent increase in study time results in a 100 percent increase in exam score.

(C) Sixty percent of a student’s exam score can be explained by the number of study hours.

(D) Sixty percent of the variation in exam scores can be accounted for by this linear regression

model.

(E) Higher exam scores tend to be associated with higher numbers of study hours.

29. Which of the following scatterplots could

have resulted in this residual plot?

(The y-axis scales are not the same in the

scatterplots as in in the residual plot.)

(E) None of these scatterplots could result in the given residual plot.

Resi

dua

ls

A C B D

30. Consider the three points (4, 33), (5, 27), and (6, 15). Given any straight line, we can calculate the

sum of the squares of the three vertical distances from these points to the line. What is the

smallest possible value this sum can be?

(A) 2.45

(B) 6

(C) 8.66

(D) 36

(E) None of these values

31. A study is made relating Power (in cold cranking amps) of auto batteries as a function of Price (in

dollars). Data from a sample of 13 batteries generates the following computer output:

Dependent Variable is: Power

s = 74.29 R-sq = 68.3% R-sq(adj) = 65.5%

Source Sum of Squares df Mean Square

Regression 13.1088 1 131088

Residual 60704.5 11 5518.59

Variable Coefficient s.e. Coeff t-ratio P

Constant 410.997 67.79 6.06 0.000

Price 5.52979 1.135 4.87 0.000

Which of the following gives a 90 percent confidence interval for the slope of the regression line?

(A) 410.997 Β± 1.771(67.79)

(B) 410.997 Β± 1.796(67.79)

(C) 5.52979 Β± 1.645(74.29)

(D) 5.52979 Β± 1.771(1.135)

(E) 5.52979 Β± 1.796(1.135)

32. Suppose the correlation is negative. Given two points from the scatterplot, which of the following is

possible?

I. The first point has a larger x-value and a smaller y-value than the second point.

II. The first point has a larger x-value and a large y-value than the second point.

III. The first point has a smaller x-values and a larger y-value than the second point.

(A) I only

(B) II only

(C) III only

(D) I and III

(E) I, II, and III

33. A regression analysis on the relationship between average annual profits (in millions of dollars) of

insurance companies and the yearly number of major natural disasters (hurricanes, tornadoes,

earthquakes, etc. gives the following computer output:

Dependent Variable is: Profits

s = 10.17 R-sq = 46.5% R-sq(adj) = 40.5%

Source df SS MS F

Regression 1 807.309 807.309 7.81

Residual 9 930.327 103.37

Variable Coef s.e. Coeff t P

Constant 420.727 5.735 73.4 0.0001

Disasters -2.70909 0.9694 -2.79 0.0209

If the analysis was rerun using number of disasters as the dependent variable instead of profits, what

would be correlation coefficient?

(A) -.318

(B) .465

(C) -.465

(D) .682

(E) -.682

34. Which is the correct

scatterplot for the computer

0utput to the right?

Variable Coef s.e t p

Constant 10.3001 0.5668 18.2 0.0001

Explanatory -1.73391 0.197 -8.8 0.0001

S = 0.9093

R-sq =90.6%

R-sq(adj) = 89.5%

2 4 6 8 10

5

4

3

2

1

(A)

Explanatory

Resp

onse

1 2 3 4 5

10

8

6

4

2

(B)

Explanatory

Resp

onse

1 2 3 4 5

10

8

6

4

2

(C)

Explanatory

Resp

onse

1 2 3 4 5

10

8

6

4

2

(D)

Explanatory

Resp

onse

1 2 3 4 5

10

8

6

4

2

(E)

Explanatory

Resp

onse

35. A linear regression analysis is performed on two variables. Which of the following tells you that

another model probably gives a better fit?

(A) The correlation r is low.

(B) The mean of the residuals is 0.

(C) The p-value for H0: 𝛽 = 0 and HA: 𝛽 > 0 is low.

(D) The coefficient of determination is high.

(E) The residual plot has a pattern.

36. Which is the correct regression output for the scatterplot below?

(A) Variable Coef s.e t p

Constant 5.45333 0.5487 9.94 0.0001

Explanatory -0.451515 0.008844 -5.11 0.0009

S = 0.8033 R-sq = 76.5% R-sq(adj) = 73.6%

(B) Variable Coef s.e t p

Constant -5.45333 0.5487 9.94 0.0001

Explanatory 0.451515 0.008844 -5.11 0.0009

S = 0.8033 R-sq = 76.5% R-sq(adj) = 73.6%

(C) Variable Coef s.e t p

Constant 5.45333 0.5487 9.94 0.0001

Explanatory 0.451515 0.008844 -5.11 0.0009

S = 0.8033 R-sq = 36.5% R-sq(adj) = 33.6%

(D) Variable Coef s.e t p

Constant 5.45333 0.5487 9.94 0.0001

Explanatory -0.451515 0.008844 -5.11 0.0009

S = 0.8033 R-sq = 36.5% R-sq(adj) = 33.6%

(E) Variable Coef s.e t p

Constant 5.45333 0.5487 9.94 0.0001

Explanatory 0.451515 0.008844 -5.11 0.0009

S = 0.8033 R-sq = 76.5% R-sq(adj) = 73.6%

2 4 6 8 10

5

4

3

2

1

Explanatory

Resp

onse

37. An automotive insurance company is interested in the association between age of SUV’s and odometer

reading for their clients. Data from 20 randomly selected clients generates the following computer

output:

Dependent Variable is: Mileage

Source df SS MS F

Regression 1 21.5613e9 21.5613e9 87.3

Residual 18 4.44774e9 247.097e6

Variable Coeff s.e. Coeff t P

Constant 9717.95 7208 1.315 0.194

Age 15675.2 1678 9.34 0.000

s = 15720 R-sq = 82.9% R-sq(adj) = 81.9%

Which of the following give a 96 percent confidence interval for the slope of the regression line?

(A) 9,718 Β± 2.1054(7,208)

(B) 9,718 Β± 2.197(7,208)

(C) 9,718 Β± 2.214(7,208

√20)

(D) 15,675 Β± 2.197(15,720

√20)

(E) 15,675 Β± 2.214(1,678)

38. A theater owner believes that attendance actually goes up if the best seats are advertised for

higher prices. Below is the computer printout for the regression analysis of Attendance (in 1,000s)

versus Price (in dollars) of the best seats.

Dependent Variable is: Attendance

Source SS df MS F

Regression 105.226 1 105.226 6.15

Residual 102.649 6 17.1082

Variable Coef s.e. Coeff t P

Constant 18.2961 5.182 3.53 0.0124

Price 0.573437 0.2312 2.48 0.0478

s = 4.136 R-sq = 50.6% R-sq(adj) = 42.4%

What is the p-value for a t-test with H0: 𝛽 = 0 and HA: 𝛽 > 0?

(A) .0124

(B) .0239

(C) .0478

(D) .0956

(E) .5060

39. Below is the computer output for a regression analysis involving starting salary (in $1,000) and

college GPA

Variable Coef s.e. Coeff t P

Constant -0.73391 5.744 -0.128 0.9015

GPA 11.8204 1.848 5.4 0.0002

s = 3.772 R-sq = 83.6% R-sq(adj) = 81.6%

Source df SS MS F

Regression 1 581.798 581.798 40.9

Residual 8 113.802 14.2252

What is the equation for the least squares regression line?

(A) 𝐺𝑃𝐴 Μ‚ = -0.734 + 11.82(salary)

(B) π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦ Μ‚ = -0.734 + 11.82(GPA)

(C) 𝐺𝑃𝐴 Μ‚ = 0.9015 + 0.0002(salary)

(D) π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦ Μ‚ = 5.744 + 1.848(GPA)

(E) π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦ Μ‚ = 1.848 + 11.82(GPA)

40. Before Challenger went of at 31Β°F, each of the 23 earlier launches experienced from zero to three

O-ring failures. There was some speculation that the number of O-ring failures was related to the

temperature at lift-off. A computer printout, performed too late, is shown below.

Dependent Variable: Failures

s = .06673 R-sq = 31.5% R-sq(adj) = 28.2%

Variable Coef s.e. Coeff t P

Constant 4.79365 1.409 3.4 0.0027

Temperature -0.0626587 0.02016 -3.11 0.0052

Source df SS MS F

Regression 1 4.30166 4.30166 9.66

Residual 21 9.35052 0.445263

Is there evidence of a relationship between failure and temperature?

(A) There is no evidence of a relationship between number of O-ring failures and temperature at

lift-off.

(B) There is evidence of a relationship at the .10 level, but not at the .05 level.

(C) There is evidence of a relationship at the .05 level, but not at the .01 level.

(D) There is evidence of a relationship at the .01 level but not at the .001 level.

(E) There is evidence of a relationship at the .001 level.

41. A linear regression analysis relating secretaries’ salaries to years of experience yields:

οΏ½Μ‚οΏ½ = 19.78 + 2.405x, where x is years of experience and y is salary (in $1,000). Which of the

following is the most proper conclusion?

(A) A starting secretary will earn $19,780, while one with 70 years of experience should

earn $188,130.

(B) Starting secretaries average $19,780 with bonuses of $2,405 every year.

(C) There is a cause-and-effect relationship between secretaries’ salaries and experience with each

extra year of experience corresponding to an extra $2,405. In salary.

(D) Starting salaries for secretaries average $19,780 and each year or experience is associated

with an average extra $2,405.

(E) There is a high correlation between secretaries’ salaries and years of experience.

42. Data from an SRS of property owners is cross-classified by gender and support for a new educational

initiative giving the following table:

Male Female

For 35 45

Against 55 50

Is there evidence of a relationship between gender and support for the initiative among property

holders?

(A) There is strong evidence of a relationship between gender and support for the initiative.

(B) There is weak evidence of a relationship between gender and support for the initiative.

(C) There is no evidence of a relationship between gender and support for the initiative.

(D) Further information is needed to be able to perform a chi-square test of independence.

(E) The test is inconclusive.

43. Three professors are interviewed as to a sampling of their grades, and the following tables gives the

resulting counts.

Prof. A Prof. B Prof. C

Grades A, B 3 8 12

Grades C 15 9 8

Grades D, F 2 3 4

A statistics student runs a chi-square test of homogeneity. What is the most proper conclusion?

(A) There is no evidence of a relationship between these professors and grades.

(B) There is evidence at the 10 percent level, but not at the 5 percent level, that the professor give

different grade distributions.

(C) There is evidence at the 5 percent level, but not at the 1 percent level, that the professors give

different grade distributions.

(D) There is evidence at the 1 percent level that the professors give different grade distributions.

(E) A chi-square test of homogeneity is not appropriate.

44. A highway superintendent states that five bridges into a city are used in the ratio 2:3:3:4 during

the morning rush hour. A highway study of an SRS of 6,000 cars indicates that 920, 1,570, 1,480,

and 2,030 cars use the five bridges, respectively. Can the superintendent’s claim be rejected at the

1 or 5 percent level of significance?

(A) There is sufficient evidence to reject the claim at either of these two levels.

(B) There is sufficient evidence to reject the claim at the 1 percent level, but not at the 5 percent

level.

(C) There is sufficient evidence to reject the claim at the 5 percent level, but not at the 1 percent

level.

(D) There is not sufficient evidence to reject the claim at either of these two levels.

(E) There is not sufficient information to answer this question.

45. A study of accidents at a large factory reported the following numbers by shift:

Shift Morning Afternoon Night

Accidents 35 77 53

Is there sufficient evidence to say that the numbers of accidents on the three shifts are not the

same?

(A) There is sufficient evidence at the .001 significance level that the number of accidents on each

shift are not the same.

(B) There is sufficient evidence at the .01 level, but not at the .001 level, that the number of

accidents on each shift are not the same.

(C) There is sufficient evidence at the .05 level, but not at the .01 level, that the number of

accidents on each shift are not the same.

(D) There is sufficient evidence at the .10 level, but not at the .05 level, that the number of

accidents on each shift are not the same.

(E) There is not sufficient evidence to say that the number of accidents on each shift are not the

same.

46. To compare prices at a grocery store in the suburbs with one in the city, a housewife picks ten basic

items and checks the prices of these items at each store. Which test should she use to determine if

the prices are different at the two stores?

(A) 𝝌𝟐Test for goodness-of-fit

(B) 𝝌𝟐Test for independence

(C) Two sample Z-test

(D) Two sample t-test

(E) Matched pairs t-test

47. Last year the audience percentages captured by major news programs were as follows: (ABC, NBC,

CBS): 36 percent CNN 42 percent FOX 22 percent. In a random sample of 50 viewers this year, 23

watched (ABC, NBC, CBS), 17 watched CNN, and 10 watched FOX. If a goodness-of-fit test were

performed what would be the p-value?

(A) P(𝝌𝟐 >(πŸπŸ‘βˆ’πŸπŸ–)𝟐

πŸπŸ‘+

(πŸπŸ•βˆ’πŸπŸ)𝟐

πŸπŸ•+

(πŸπŸŽβˆ’πŸπŸ)𝟐

𝟏𝟎) with df = 2

(B) P(𝝌𝟐 >(πŸπŸ‘βˆ’πŸπŸ–)𝟐

πŸπŸ‘+

(πŸπŸ•βˆ’πŸπŸ)𝟐

πŸπŸ•+

(πŸπŸŽβˆ’πŸπŸ)𝟐

𝟏𝟎) with df = 3

(C) P(𝝌𝟐 >(πŸπŸ‘βˆ’πŸπŸ–)𝟐

πŸπŸ–+

(πŸπŸ•βˆ’πŸπŸ)𝟐

𝟐𝟏+

(πŸπŸŽβˆ’πŸπŸ)𝟐

𝟏𝟏) with df = 2

(D) P(𝝌𝟐 >(πŸπŸ‘βˆ’πŸπŸ–)𝟐

πŸπŸ–+

(πŸπŸ•βˆ’πŸπŸ)𝟐

𝟐𝟏+

(πŸπŸŽβˆ’πŸπŸ)𝟐

𝟏𝟏) with df = 3

(E) P(𝝌𝟐 >(πŸπŸ‘βˆ’πŸπŸ–)𝟐

πŸπŸ”.πŸ•+

(πŸπŸ•βˆ’πŸπŸ)𝟐

πŸπŸ”.πŸ•+

(πŸπŸŽβˆ’πŸπŸ)𝟐

πŸπŸ”.πŸ•) with df = 49

48. A geneticist claims that four species of fruit flies should appear in the ration 1:3:3:9. Suppose that

a sample of 480 flies contained 25, 92, 68, and 295 flies of each species, respectively. Does a chi-square test show sufficient evidence to reject the geneticist’s claim?

(A) The test proves the geneticist’s claim.

(B) The test proves the geneticist’s claim is false

(C) The test does not give sufficient evidence to reject the geneticist’s claim

(D) The test gives sufficient evidence to reject the geneticist’s claim.

(E) The test is inconclusive.

49. The table below shows the number of students referred for disciplinary reasons to the principal’s

office, broken down by the day of the week. A counselor would like to know if such referrals are

related to the day of the week. What is the value of chi-square for the appropriate test?

Monday Tuesday Wednesday Thursday Friday

12 5 9 4 15

(A) (12βˆ’9)2

12+

(5βˆ’9)2

5+

(9βˆ’9)2

9+

(4βˆ’9)2

4+

(15βˆ’9)2

15

(B) (12βˆ’9)2

9+

(5βˆ’9)2

9+

(9βˆ’9)2

9+

(4βˆ’9)2

9+

(15βˆ’9)2

9

(C) (12βˆ’9)2

45+

(5βˆ’9)2

45+

(9βˆ’9)2

45+

(4βˆ’9)2

45+

(15βˆ’9)2

45

(D) (12βˆ’5)2

12+

(5βˆ’5)2

5+

(9βˆ’5)2

9+

(4βˆ’5)2

4+

(15βˆ’5)2

15

(E) (12βˆ’5)2

5+

(5βˆ’5)2

5+

(9βˆ’5)2

5+

(4βˆ’5)2

5+

(15βˆ’5)2

5

50. Which of the following is the proper use of a chi-square test of independence?

(A) To test whether the distribution of counts on a categorical variable matches a claimed

distribution.

(B) To test whether the distribution of counts on a numerical variable matches a claimed

distribution

(C) To test whether the distribution of two different groups on the same categorical variable

matches

(D) To test whether two categorical variables on the same subjects are related.

(E) To test whether two numerical variables on the same subject are related.

51. A small appliance manufacturer sets up three locations to provide service for its customers. Logs are

kept noting whether or not calls about the problems are solved successfully. Data from a sample of

500 calls are summarized in the following table:

Location 1 Location 2 Location 3

Problem solved 124 98 103

Problem not solved 55 63 57

Assuming there is no association between the location and whether or not a problem is resolved

successfully, that is H0: independence, what is the expected number of successful calls (problem

solved) from location 2?

(A) (161)(98)

325

(B) (325)(98)

161

(C) (161)(325)

98

(D) (161)(325)

486

(E) (161)(325)

500

52. In the following table, what value of n results in a table showing perfect independence?

40 60

50 n

(A) 30

(B) 50

(C) 70

(D) 75

(E) 100

53. Two commercial flights per day are made from a small county airport. The airport manager tabulates

the number of on-time departures for a sample of 200 days.

Number of on-time departures 0 1 2

Observed number of days 12 75 113

What is the 𝝌𝟐 statistic for a goodness-of-fit test that the distribution is binomial with probability

equal to 0.8 that a flight leaves on time?

(A) (12βˆ’8)2

8+

(75βˆ’64)2

64+

(113βˆ’128)2

128

(B) (12βˆ’8)2

12+

(75βˆ’64)2

75+

(113βˆ’128)2

113

(C) (12βˆ’10)2

10+

(75βˆ’30)2

30+

(113βˆ’160)2

160

(D) (12βˆ’10)2

10+

(75βˆ’30)2

75+

(113βˆ’160)2

113

(E) (12βˆ’66)2

12+

(75βˆ’67)2

75+

(113βˆ’67)2

113

54. Which of the following is not true with regard to contingency tables for chi-square tests for

independence?

(A) Categorical rather than quantitative variables are being considered.

(B) Observed frequencies should be whole numbers.

(C) Expected frequencies should be whole numbers.

(D) Expected frequencies in each cell should be at least 5, and to achieve this, one sometimes

combines categories for one or the other or both of the variables.

(E) The expected frequency for any cell can be found by multiplying the row total by the column

total and dividing by the sample size.

55. Is preference between vanilla and chocolate ice cream independent of age? 210 interviews yielded

the following numbers:

Age 10-19 20-29 30-39 40-49

Prefer vanilla 22 31 21 40

Prefer chocolate 31 28 24 13

What is a reasonable conclusion?

(A) There is a very strong evidence of a relationship between age and ice cream preference.

(B) There is weak evidence of a relationship between age and ice cream preference.

(C) There is no evidence of a relationship between age and ice cream preference.

(D) Further information is needed to be able to perform a chi-square test of independence.

(E) The test is inconclusive.

56. A survey of 200 high school senior is conducted to see if there is a relationship between whether or

not a student is taking AP Statistics and whether the student plans to attend a public or private

college. The data is summarized in the following table:

Public Private

Taking AP Stat 33 77

Not taking AP Stat 47 43

What is the test statistics for a hypothesis test with H0: independence?

(A) (33βˆ’44)2

44+

(77βˆ’66)2

66+

(47βˆ’36)2

36+

(43βˆ’54)2

54

(B) (33βˆ’44)2

33+

(77βˆ’66)2

77+

(47βˆ’36)2

47+

(43βˆ’54)2

43

(C) (33βˆ’55)2

55+

(77βˆ’55)2

55+

(47βˆ’45)2

45+

(43βˆ’45)2

45

(D) (33βˆ’40)2

40+

(77βˆ’60)2

60+

(47βˆ’40)2

40+

(43βˆ’60)2

60

(E) (33βˆ’50)2

33+

(77βˆ’50)2

77+

(47βˆ’50)2

47+

(43βˆ’50)2

43

57. It is hypothesized that scores on a certain intelligence test are normally distributed with a mean of

100 and a standard deviation of 10. A psychologist runs a goodness-of-fit hypothesis on an SRS of

100 scores resulting in the table below. What is the 𝝌𝟐 statistic for this test?

Score: Below 90 90-100 100-110 Above 110

Number of people: 10 40 35 15

(A) (10βˆ’16)2

10+

(40βˆ’34)2

40+

(35βˆ’34)2

35+

(15βˆ’16)2

15

(B) (10βˆ’16)2

16+

(40βˆ’34)2

34+

(35βˆ’34)2

34+

(15βˆ’16)2

16

(C) (10βˆ’25)2

25+

(40βˆ’25)2

25+

(35βˆ’25)2

25+

(15βˆ’25)2

25

(D) (10βˆ’25)2

10+

(40βˆ’25)2

40+

(35βˆ’25)2

35+

(15βˆ’25)2

15

(E) (10βˆ’25)2

16+

(40βˆ’25)2

34+

(35βˆ’25)2

34+

(15βˆ’25)2

16

FRQ Review Linear Regression & Chi-Square

Due: Thursday, March 5th

1. 2001 Number 6 The statistics department at a large university is trying to determine if it is

possible to predict whether an applicant will successfully complete the Ph.D. program or will leave

before completing the program. The department is considering whether GPA (grade point average) in

undergraduate statistics and mathematics courses (a measure of performance) and mean number of

credit hours per semester (a measure of workload) would be helpful measures. To gather data, a

random sample of 20 entering students from the past 5 years is taken. The data are given below.

Successfully Completed Ph.D. Program

Student A B C D E F G H I J K L M

GPA 3.8 3.5 4.0 3.9 2.9 3.5 3.5 4.0 3.9 3.0 3.4 3.7 3.6

Credit hours 12.7 13.1 12.5 13.0 15.0 14.7 14.5 12.0 13.1 15.3 14.6 12.5 14.0

Did Not Complete Ph.D. Program

Student N O P Q R S T

GPA 3.6 2.9 3.1 3.5 3.9 3.6 3.3

Credit hours 11.1 14.5 14.0 10.9 11.5 12.1 12.0

The regression output at the top of the next page resulted from fitting a line to the data in each

group. The residual plot (not shown) indicated no unusual patterns, and the assumptions necessary

for inference were judged to be reasonable.

Successfully Completed Ph.D. Program

Predictor Coef StDev T P

Constant 23.514 1.684 13.95 0.000

GPA – 2.7555 0.4668 – 5.90 0.000

S = 0.5658 R-Sq = 76.0%

Did Not Complete Ph.D. Program

Predictor Coef StDev T P

Constant 24.200 3.474 6.97 0.001

GPA – 3.485 1.013 – 3.44 0.018

S = 0.8408 R-Sq = 70.3%

(a) Use an appropriate graphical display to compare the GPA’s for the two groups. Write a few

sentences commenting on your display.

(b) For the students who successfully completed the Ph.D. program, is there a significant

relationship between GPA and mean number of credit hours per semester?

Give a statistical justification to support your response.

(c) If a new applicant has a GPA of 3.5 and a mean number of credit hours per semester of 14.0,

do you think this applicant will successfully complete the Ph.D. program? Give a statistical

justification to support your response.

2. 2003B Question 1 A simple random sample of 9 students was selected from a large university. Each

of these students reported the number of hours he or she had allocated to studying and the number

of hours allocated to work each week. A least squares linear regression was performed and part of

the resulting computer output is shown below.

Predictor Coef StDev T P

Constant 8.107 2.731 2.97 0.021

Work 0.4919 0.1950 2.52 0.040

S = 4.349 R-Sq = 47.6% R-Sq (adj) = 40.1%

The scatterplot below displays the data that were collected from the 9 students.

(a) After point P, labeled on the graph on the previous page, was removed from the data, a second

linear regression was performed and the computer output is shown below.

Predictor Coef StDev T P

Constant 11.123 3.986 2.79 0.032

Work 0.1500 0.3834 0.39 0.709

S = 4.327 R-Sq = 2.5% R-Sq (adj) = 0.0%

Does point P exercise a large influence on the regression line? Explain.

(b) The researcher who conducted the study discovered that the number of hours spent studying

reported by the student represented by P was recorded incorrectly. The corrected data point for

this student is represented by the letter Q in the scatterplot below

Explain how the least squares regression line for the corrected data (in this part) would differ from the

least squares regression line for the original data.

3. 2006 Number 2 A manufacturer of dish detergent believes the height of soapsuds in the dishpan

depends on the amount of detergent used. A study of the suds’ heights for a new dish detergent was

conducted. Seven pans of water were prepared. All pans were of the same size and type and

contained the same amount of water. The temperature of the water was the same for each pan. An

amount of dish detergent was assigned at random to each pan, and that amount of detergent was

added to the pan. Then the water in the dishpan was agitated for a set amount of time, and the

height of the resulting suds was measured.

A plot of the data and the computer output from fitting a least squares regression line to the data

are shown below.

(a) Write the equation of the fitted regression line. Define any variables used in this equation.

(b) Note that s = 1.99821 in the computer output. Interpret this value in the context of this study.

(c) Identify and interpret the standard error of the slope.

4. 2002B Question 1 Animal-waste lagoons and spray fields near aquatic environments may significantly

degrade water quality and endanger health. The National Atmosphere Deposition Program has

monitored the atmospheric ammonia at swine farms since 1978. The data on the swine population

size (in thousands) and atmospheric ammonia (in parts per million) for one decade are given below.

Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997

Swine Population 0.38 0.50 0.60 0.75 0.95 1.20 1.40 1.65 1.80 1.85

Atmospheric Ammonia 0.13 0.21 0.29 0.22 0.19 0.26 0.36 0.37 0.33 0.38

a) Construct a scatterplot for these data.

b) The value for the correlation coefficient for these data is 0.85. Interpret this value.

c) Based on the scatterplot in part (a) and the value of the correlation coefficient in part (b), does

it appear that the amount of atmospheric ammonia is linearly related to the swine population size?

Explain.

d) What percent of the variability in atmospheric ammonia can be explained by swine population

size?

5. 2011 Number 5 Windmills generate electricity by transferring energy from wind to a turbine. A

study was conducted to examine the relationship between wind velocity in miles per hour (mph) and

electricity production in amperes for one particular windmill. For the windmill, measurements were

taken on twenty-five randomly selected days, and the computer output for the regression analysis

for predicting electricity production based on wind velocity is given below. The regression model

assumptions were checked and determined to be reasonable over the interval of wind speeds

represented in the data, which were from 10 miles per hour to 40 miles per hour.

(a) Use the computer output above to determine the equation of the least squares regression line.

Identify all variables used in the equation.

(b) How much more electricity would the windmill be expected to produce on a day when the wind

velocity is 25 mph than on a day when the wind velocity is 15 mph? Show how you arrived at your

answer.

(c) What proportion of the variation in electricity production is explained by its linear relationship with

wind velocity?

(d) Is there statistically convincing evidence that electricity production by the windmill is related to

wind velocity? Explain.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

.40

.35

.30

.25

.20

.15

.10

.05

Swin

Atm

osph

Predictor Coef SE Coef T P

Constant 0.137 0.126 1.09 0.289

Wind Velocity 0.240 0.019 12.63 0.000

S = 0.237 R-Sq=0.873 R-Sq (adj) = 0.868

S = 0.237 R-Sq = 0.873

6. 2005B Number 5 John believes that as he increases his walking speed, his pulse rate will increase.

He wants to model this relationship. John records his pulse rate, in beats per minute (bpm), while

walking at each of seven different speeds, in miles per hour (mph). A scatterplot and regression

output are shown below.

(a) Using the regression output, write the equation of the fitted regression line.

(b) Do your estimates of the slope and intercept parameters have meaningful interpretations in the

context of this question? If so, provide interpretations in this context. If not, explain why not.

(c) John wants to provide a 98 percent confidence interval for the slope parameter in his final

report. Computer the margin of error that John should use. Assume that conditions for

inference are satisfied.

7. 2009 Question 1 A simple random sample of 100 high school seniors was selected from a large

school district. The gender of each student was recorded, and each student was asked the following

questions.

a. Have you ever had a part-time job?

b. If you answered yes to the previous question, was your part-time job in the summer only?

The responses are summarized in the table below.

Gender

Job Experience Male Female Total

Never had a part-time job 21 31 52

Had a part-time job during summer only 15 13 28

Had a part-time job but not only during summer 12 8 20

Total 48 52 100

(a) On the grid below, construct a graphical display that represents the association between gender

and job experience for the students in the sample.

(b) Write a few sentences summarizing what the display in part (a) reveals about the association

between gender and job experience for the students in the sample.

(c) Which test of significance should be used to test if there is an association between gender and

job experience for the population of high school seniors in the district?

State the null and alternative hypothesis for the test, but do not perform the test.

8. 2003 Question 5 A random sample of 200 students was selected from a large college in the United

States. Each selected student was asked to give his or her opinion about the following statement.

"The most important quality of a person who aspires to be the President

of the United States is a knowledge of foreign affairs."

Each response was recorded in one of five categories. The gender of each selected student was

noted.The data are summarized in the table below:

Is there sufficient evidence to indicate that the response is dependent on gender? Provide

statistical evidence to support your conclusion.

Response Category

Strongly

Disagree

Somewhat

Disagree

Neither Agree

nor Disagree

Somewhat

Agree

Strongly

Agree

Male 10 15 15 25 25

Female 20 25 25 25 15

9. 1999 Question 2 The Colorado Rocky Mountain Rescue Service wishes to study the behavior of lost

hikers. If more were known about the direction in which lost hikers tend to walk, then more

effective search strategies could be devised. Two hundred hikers selected at random from those

applying for hiking permits are asked whether they would head uphill, downhill, or remain in the

same place if they became lost while hiking. Each hiker in the sample was also classified according

to whether he or she was an experienced or novice hiker. The resulting data are summarized in the

following table.

Do these data provide convincing evidence of an association between the level of hiking expertise and

the direction the hiker would head if lost? Give appropriate statistical evidence to support your

conclusion.

10. 2008 Question 5 A study was conducted to determine where moose are found in a region containing

a large burned area. A map of the study area was partitioned into the following four habitat types.

(1) Inside the burned area, not near the edge of the burned area,

(2) Inside the burned area, near the edge,

(3) Outside the burned area, near the edge, and

(4) Outside the burned area, not near the edge.

The figure below shows these four habitat types.

The proportion of total acreage in each of the habitat types was determined for the study area.

Using an aerial survey, moose locations were observed and classified into one of the four habitat

types. The results area given in the table below.

Habitat Type Proportion of Total Acreage Number of Moose Observed

1 0.340 25

2 0.101 22

3 0.104 30

4 0.455 40

Total 1.000 117

(a) The researchers who are conducting the study expect the number of moose observed in a habitat

type to be proportional to the amount of acreage of that type of habitat. Are the data

consistent with this expectation? Conduct an appropriate statistical test to support your

conclusion. Assume the conditional for inference are met.

(b) Relative to the proportion of total acreage, which habitat types did the moose seem to prefer?

Explain.

Direction

Uphill Downhill Remain in Same Place

Novice 20 50 50

Experienced 10 30 40

11. 2004 Question 5 A rural county hospital offers several health services. The hospital

administrators conducted a poll to determine whether the residents' satisfaction with the available

services depends on their gender. A random sample of 1,000 adult county residents was selected.

The gender of each respondent was recorded and each was asked whether he or she was satisfied

with the services offered by the hospital. The resulting data are shown in the table below.

(a) Using a significance level of 0.05, conduct an appropriate test to determine if, for adult residents of

this county, there is an association between gender and whether or not they were satisfied with

services offered by the hospital.

(b) Is 800

1,000 a reasonable estimate for the proportion of all adult county residents who are satisfied

with the services offered by the hospital? Explain why or why not.

12. 2003B Question 5 Contestants on a game show spin a wheel like the one shown in the figure above.

Each of the four outcomes on this wheel is equally likely and outcomes are independent from one

spin to the next.

β–ͺ The contestant spins the wheel.

β–ͺ If the result is a skunk, no money is won and the contestant’s turn is finished.

β–ͺ If the result is a number, the corresponding amount in dollars is won. The contestant can then

stop with those winnings or can choose to spin again, and his or her turn continues.

β–ͺ If the contestant spins again and the result is a skunk, all of the money earned on that turn is

lost and the turn ends.

β–ͺ The contestant may continue adding to his or her winnings until he or she chooses to stop or until

a spin results in a skunk.

(a) What is the probability that the result will be a number on all of the first three spins of the

wheel?

(b) Suppose a contestant has earned $800 on his or her first three spins and chooses to spin the

wheel again. What is the expected value of his or her total winnings for the four spins?

(c) A contestant who lost at this game alleges that the wheel is not fair. In order to check on the

fairness of the wheel, the data in the table below were collected for 100 spins of this wheel.

Result Skunk $100 $200 $500

Frequency 33 21 20 26

Based on these data, can you conclude that the four outcomes on this wheel are not equally likely?

Give appropriate statistical evidence to support your answer.

Male Female Total

Satisfied 384 416 800

Not Satisfied 80 120 200

Total 464 536 1,000