notes: linear regression tests & intervals
TRANSCRIPT
Today is Tuesday February 18th
Notes: Linear Regression
Tests & Intervals
Pgs. 208-211
Homework Notification Assigned Today:
β’ HWK 7.1 Residuals & Tests
Due: Friday, February 21st
β’ Death Packet
Due: Monday, March 32nd
β’ FRQ Packet
Due: Monday, March 5th
Due Today: CWK 7.1 Residual Practice
Assigned: Wednesday, February 11th
Due Tomorrow: No Assignment is Due
Highly Recommended:
Attend STATs Saturday this Saturday
Name:____________________________ HWK 7.1 Residuals & Tests Period:______
Due Friday, February 21st
Here are advertised horsepower ratings and expected gas mileage for several 2001 vehicles.
Vehicle Horsepower Gas Mileage
mpg Calculate the following Statistics:
Audi A4 170 22
Ξ²1 Buick LeSabre 205 20
Chevy Blazer 190 15
Ξ²0 Chevy Prizm 125 31
Ford Excursion 310 10
r GMC Yukkon 285 13
Honda Civic 127 29
R2 Hyundai Elantra 140 25
Lexus 300 215 21 Regression
Equation: Lincoln LS 210 23
Mazada MPV 170 18 Estimate mpg
for 250 hp
Olds Alero 140 23
Toyota Camry 194 21 Calculate the Chevy
Blazer Residual
VW Beetle 115 29
Check Values οΏ½Μ οΏ½ =185.43 οΏ½Μ οΏ½ =21.43 Calculate the Chevy
Prizm Residual
Sx =58.19 Sy =6.09
Explain the following in context of the situation.
1. Slope:_______________________________________________________________________
______________________________________________________________________________
2. Slope Intercept_______________________________________________________________
______________________________________________________________________________
3. Correlation Coefficient__________________________________________________________
______________________________________________________________________________
4. Coefficient of Determination_____________________________________________________
______________________________________________________________________________
A. Calculate and explain the meaning of R2 in the context of this problem.
B. Compute and report the equation of the least squares regression line.
Identify all variables used in the equation.
C. Calculate the residual value for an βanxiety levelβ of 5.
D. Interpret the meaning of the slope of the regression equation in the context of this problem.
E. Create a Scatter Plot for the Data Create A Residual Plot for the Data
New state requirements force students to take a βhigh stakesβ math test
in order to graduate from high school. Faced with the pressure-laden
situation, many students become nervous, which may interfere with their
ability to perform well. Concerned about βtest anxietyβ, a researcher
enlists 24 student volunteers for a study. A psychologist interviews
them before the math test, assessing their anxiety levels on a scale from
1 to 10. The table shows the anxiety levels and the exam scores.
Student
Number
Anxiety
Level
Test
Score
1 10 35
2 8 61
3 7 68
4 7 39
5 6 79
6 2 94
7 3 87
8 5 81
9 6 80
10 8 57
11 8 45
12 10 44
13 10 27
14 9 47
15 7 60
16 6 85
17 4 29
18 6 100
19 6 41
20 10 37
21 3 66
22 9 72
23 2 68
24 8 80
F. Create a hypothesis test and provide a conclusion to determine whether or not there is statistical
evidence to suggest that βTest Anxietyβ is associated with test scores in math.
G. Create a 95% confidence interval for the slope of the regression line.
Name:_________________________ Death Packet: Regression and Chi-Square Period:_____
(Due Monday, March 2nd)
1. In a random sample of 25 high school students, each was interviewed as to GPA and weekly hours
worked at part-time jobs. What is the critical t-value in calculating a 90 percent confidence interval
estimate for the slope of the resulting least squares regression line?
(A) 1.645
(B) 1.703
(C) 1.708
(D) 1.711
(E) 1.714
2. Suppose that the scatterplot of log X and log Y shows a strong positive correlation close to 1. Which
of the following is true?
(A) The variables X and Y will also have a correlation close to 1.
(B) A scatterplot of the variables X and Y will show a strong nonlinear pattern.
(C) The residual plot of the variable X and Y will show a random pattern.
(D) The residual plot of the variables Log X and Log Y will show a strong nonrandom pattern.
(E) None of the above are true.
3. Which of the following gives a proper ordering?
(A) π1 < π3 < π2 (B) π2 < π3 < π1
(C) π2 < π1 < π3 < |π2|
(D) π3 < π1 < |π2| (E) π1 < π2 < π3 < |π2|
4. Consider the points (-1, 4), (2, 10), (4, 15), (7, 21), (10, n). What should n be so that the correlation
between the x and y values is r = 1?
(A) 26
(B) 27
(C) 28
(D) A value different from any of the above.
(E) No value for n can make r = 1.
Correlation r1 Correlation r2 Correlation r3
5. Four pairs of data are used in determining a regression line οΏ½ΜοΏ½ = -2 + 6x. If the four values of the
independent variable are 37, 52, 18 and 23, respectively, what is the mean of the four values of the
dependent variable? (A) 32.5 (B) 193 (C) 195 (D) 778 (E) The mean cannot be determined from the given information.
6. Data on the number of cancer deaths among Americans (in 1,000s) and years (since 2001) result in the
regression line: π·πππ‘βπ Μ = 550 -6.05(Years) with r = .863. What is the correct interpretation of the
slope?
(A) The number of cancer deaths among Americans has been dropping by an average of 6,050 per
year since 2001.
(B) The baseline number of cancer deaths among Americans is 550,000.
(C) The regression line explains 74.5 percent of the variation in cancer deaths among Americans
over the year since 2001.
(D) The regression line explains 86.3 percent of the variation in cancer deaths among Americans
over the years since 2001.
(E) Cancer will be cured in the year 2092.
7.
(A) X has the largest residual, in absolute value, of any point on the scatterplot.
(B) X is an influential point.
(C) There will be no pattern in the residual plot
(D) X is an outlier
(E) None of the above are true statements.
8. Suppose the correlation between two variables is .85. If each of the y-variables is multiplied by -1,
which of the following is true about the new scatterplot?
(A) It slope up to the right, and the correlation is -.85
(B) It slopes down to the right, and the correlation is -.85
(C) It slopes up to the right, and the correlation is .85.
(D) It slopes down to the right, and the correlation is .85.
(E) None of the above is true.
9. Which of the following statements about the correlation coefficient r is incorrect?
(A) It is not affected by the measurement units of the variables.
(B) It is not affected by which variable is called x and which is called y.
(C) It is not affected by extreme values.
(D) It gives information about a linear relationship, not about causation.
(E) It always gives values between -1 and 1, even if the association is nonlinear.
To the right is a scatterplot with one pint labeled
X. Suppose you find the least squares regression
line. Which of the following is a true statement?
X
10. A linear regression analysis is performed on the data from two scatterplots, A and B, resulting in
identical least squares regression lines with positive slopes. Which of the following statements is
true?
(A) The sum of the squares of the residuals in A equals the sum of the squares of the residuals in B.
(B) The correlation in A equals the correlation in B.
(C) If the sum of the squares of the residuals in A is greater than the sum of the squares of the
residuals in B, then the correlation in A will be greater than the correlation in B.
(D) If the sum of the squares of the residuals in A is greater than the sum of the squares in of the
residuals in B, then the correlation in A will be less than the correlation in B.
(E) None of the above are true statements.
11. A study of weekly hours of television watched and SAT scores reports a correlation of r = -1.18.
From this information, we can conclude that
(A) Students who watch more TV tend to have lower SAT scores
(B) The fewer the hours in front of a TV, the higher a studentβs SAT scores.
(C) There is little relationship between weekly hours of television watched and SAT scores.
(D) There is strong negative association between weekly hours of television watched and SAT
scores, but it would be wrong to conclude causation.
(E) A mistake in the arithmetic has been made.
12. A scatterplot of a companyβs revenues versus time indicates a possible exponential relationship. A
linear regression on y = log(revenue in $1,000) against x = years since 2005 gives οΏ½ΜοΏ½ = 0.75 + 0.63x
with r = .68. Which of the following is a valid conclusion?
(A) On the average, revenue goes up 0.63 thousand dollars per year.
(B) The predicted revenue for year 2009 is approximately 1,862 thousand dollars.
(C) Forty-six percent of the variation in revenue can be explained by variation in time.
(D) Sixty-eight percent of the variation in revenue can be explained by variation in time.
(E) None of the above are valid conclusions.
13. Which of the following are possible residual plots?
(A) I only
(B) II only
(C) III only
(D) I and II
(E) I, II, and III
.4
.3
.2
.1
.2
.1
-.1
-.2
III I II .2
.1
-.1
-.2
14. Suppose a study finds that the correlation coefficient relating job satisfaction to salary is r = +1.
Which of the following is a proper conclusion?
(A) High salary causes high job satisfaction
(B) Low salary causes low job satisfaction
(C) There is a 100% cause-and-effect relationship between salary and job satisfaction.
(D) There is a very strong association between salary and job satisfaction.
(E) None of the above are proper conclusions.
15. Which of the following statements about the correlation r is true?
(A) When r = 0, there is no relationship between the variables.
(B) When r = .2, 20 percent of the variables are closely related.
(C) When r = 1, there is a perfect cause-and-effect relationship between the variables.
(D) A correlation close to 1 means that a linear model will give the best fit to the data.
(E) All the statements are false.
16. Consider the following three scatterplots:
Which has the greatest correlation coefficient r?
(A) I
(B) II
(C) III
(D) They have the same correlation coefficient.
(E) The question cannot be answered without additional information.
17. Suppose the correlation between two variables is r = .28. What will the new correlation be if .17 is
added to all values of the x-variable, every value of the y-variables is doubled, and the two variables
are interchanged?
(A) .28
(B) .45
(C) .56
(D) .90
(E) -.28
I
40
30
20
5 6 7 8
III
80
60
40
10 12 14 16
II
50
40
30
6 7 8 9
18. The number of students taking AP Statistics at a high school during the years 2000-2007 is fitted
with a least square regression line. The graph of the residuals and some computer output is as
follows.
How many students took AP Statistics in the year 2003?
(A) 47
(B) 48
(C) 52
(D) 53
(E) 58
19. Consider the scatterplot of midterm and final exam
scores for a class of 20 students.
Which of the following is incorrect?
(A) Both the correlation and slope are negative.
(B) The same number of students score 100 on the
midterm as did on the final.
(C) The same percentage of students scored 100 below 50 on each exam.
(D) Most students who scored below 50 on the midterm also scored below 50 on the final.
(E) Some students scored higher on the final exam than they did on the midterm.
20. If the standard deviation of a set of observations is 0, you can conclude
(A) That there is no relationship between the observations.
(B) That the average value is 0.
(C) That all observations are the same value
(D) That a mistake in the arithmetic has been made.
(E) None of the above.
21. Which of the following statements about the correlation r is incorrect?
(A) The correlation and the slope of the regression line always have the same sign.
(B) A correlation of -.32 and a correlation of =.32 show the same degree of clustering around the
regression line.
(C) Correlation r measures the strength and direction only of linear association.
(D) A correlation of .78 indicates a relationship that is 3 times as linear as on for which the
correlation is .26.
(E) Outliers can greatly affect the value of r.
0 1 2 3 4 5 6 7
5
0
-5
-10
-15
Dependent variable is: Students
Variable Coeff s.e t p
Constant 11 6.299 1.75 0.1313
Years 13.9826 1.506 9.25 0.0001
S = 9.758
R-sq =93.4%
R-sq(adj) = 92.4%
Fin
al E
xam
Sco
re
100
50
Midterm Exam Score 50 100
22. If every man married a woman who was exactly 3 years younger than he, what would be the
correlation between the ages of married men and women?
(A) Somewhat negative
(B) 0
(C) Somewhat positive
(D) Nearly 1
(E) 1
23. Which of the following statements about residuals is incorrect?
(A) The mean of the residuals is always zero.
(B) The sum of the residuals is always zero.
(C) The regression line for a residual plot is a horizontal line.
(D) The standard deviation of the residuals gives a measure of how the point in the scatterplot are
spread around the regression line.
(E) A residual equals the predicted y minus the observed y.
24. Data are obtained from a random sample of adult women with regard to their ages and their monthly
expenditures on health products.
The resulting regression equation is π¬πππππ πππππ Μ = 43 + 0.23(Age) with r = .27. What percentage
of the variation can be explained by looking at ages?
(A) 0.23 percent
(B) 23 percent
(C) 7.29 percent
(D) 27 percent
(E) 52.0 percent
25. Consider the following 3 scatterplots:
Which of the following is a true statement about the correlations for the three scatterplots?
(A) None are 0.
(B) One is 0, one is negative, and one is positive.
(C) One is 0, and both of the others are negative
(D) Two are 0, and the other is -1.
(E) Two are 0, and the other is close to -1.
26. Data on ages (in years) and prices (in $100) for ten cars of a specific model result in the regression
line: π·ππποΏ½ΜοΏ½ = 250 β 30(Age). Given that 64 percent of the variation in price is explained by variation in
age, what is the value of the correlation coefficient r?
(A) -.64
(B) -.80
(C) .64
(D) .80
(E) There is insufficient information to answer this question.
27. Which of the following is a true statement about the correlation coefficient r?
(A) A correlation of .3 means that 30 percent of the points are highly correlated
(B) The square of the correlation measures the proportion of the y-variance that is predictable
from a knowledge of x.
(C) Perfect correlation, that is, when the points lie exactly on a straight line, results in r = 0.
(D) Multiplying every y-value by -1 leaves the correlation unchanged.
(E) The unit of measure for correlation is the y-unit per x-unit.
28. A study is conducted relating AP Statistics exam scores to the total number of study hours for the
AP Statistics class put in by students during the academic year, and the correlation is found to be
.6. Which of the following is a true statement?
(A) On average, a 40 percent increase in study time results in a 24 percent increase in exam score.
(B) On average, a 60 percent increase in study time results in a 100 percent increase in exam score.
(C) Sixty percent of a studentβs exam score can be explained by the number of study hours.
(D) Sixty percent of the variation in exam scores can be accounted for by this linear regression
model.
(E) Higher exam scores tend to be associated with higher numbers of study hours.
29. Which of the following scatterplots could
have resulted in this residual plot?
(The y-axis scales are not the same in the
scatterplots as in in the residual plot.)
(E) None of these scatterplots could result in the given residual plot.
Resi
dua
ls
A C B D
30. Consider the three points (4, 33), (5, 27), and (6, 15). Given any straight line, we can calculate the
sum of the squares of the three vertical distances from these points to the line. What is the
smallest possible value this sum can be?
(A) 2.45
(B) 6
(C) 8.66
(D) 36
(E) None of these values
31. A study is made relating Power (in cold cranking amps) of auto batteries as a function of Price (in
dollars). Data from a sample of 13 batteries generates the following computer output:
Dependent Variable is: Power
s = 74.29 R-sq = 68.3% R-sq(adj) = 65.5%
Source Sum of Squares df Mean Square
Regression 13.1088 1 131088
Residual 60704.5 11 5518.59
Variable Coefficient s.e. Coeff t-ratio P
Constant 410.997 67.79 6.06 0.000
Price 5.52979 1.135 4.87 0.000
Which of the following gives a 90 percent confidence interval for the slope of the regression line?
(A) 410.997 Β± 1.771(67.79)
(B) 410.997 Β± 1.796(67.79)
(C) 5.52979 Β± 1.645(74.29)
(D) 5.52979 Β± 1.771(1.135)
(E) 5.52979 Β± 1.796(1.135)
32. Suppose the correlation is negative. Given two points from the scatterplot, which of the following is
possible?
I. The first point has a larger x-value and a smaller y-value than the second point.
II. The first point has a larger x-value and a large y-value than the second point.
III. The first point has a smaller x-values and a larger y-value than the second point.
(A) I only
(B) II only
(C) III only
(D) I and III
(E) I, II, and III
33. A regression analysis on the relationship between average annual profits (in millions of dollars) of
insurance companies and the yearly number of major natural disasters (hurricanes, tornadoes,
earthquakes, etc. gives the following computer output:
Dependent Variable is: Profits
s = 10.17 R-sq = 46.5% R-sq(adj) = 40.5%
Source df SS MS F
Regression 1 807.309 807.309 7.81
Residual 9 930.327 103.37
Variable Coef s.e. Coeff t P
Constant 420.727 5.735 73.4 0.0001
Disasters -2.70909 0.9694 -2.79 0.0209
If the analysis was rerun using number of disasters as the dependent variable instead of profits, what
would be correlation coefficient?
(A) -.318
(B) .465
(C) -.465
(D) .682
(E) -.682
34. Which is the correct
scatterplot for the computer
0utput to the right?
Variable Coef s.e t p
Constant 10.3001 0.5668 18.2 0.0001
Explanatory -1.73391 0.197 -8.8 0.0001
S = 0.9093
R-sq =90.6%
R-sq(adj) = 89.5%
2 4 6 8 10
5
4
3
2
1
(A)
Explanatory
Resp
onse
1 2 3 4 5
10
8
6
4
2
(B)
Explanatory
Resp
onse
1 2 3 4 5
10
8
6
4
2
(C)
Explanatory
Resp
onse
1 2 3 4 5
10
8
6
4
2
(D)
Explanatory
Resp
onse
1 2 3 4 5
10
8
6
4
2
(E)
Explanatory
Resp
onse
35. A linear regression analysis is performed on two variables. Which of the following tells you that
another model probably gives a better fit?
(A) The correlation r is low.
(B) The mean of the residuals is 0.
(C) The p-value for H0: π½ = 0 and HA: π½ > 0 is low.
(D) The coefficient of determination is high.
(E) The residual plot has a pattern.
36. Which is the correct regression output for the scatterplot below?
(A) Variable Coef s.e t p
Constant 5.45333 0.5487 9.94 0.0001
Explanatory -0.451515 0.008844 -5.11 0.0009
S = 0.8033 R-sq = 76.5% R-sq(adj) = 73.6%
(B) Variable Coef s.e t p
Constant -5.45333 0.5487 9.94 0.0001
Explanatory 0.451515 0.008844 -5.11 0.0009
S = 0.8033 R-sq = 76.5% R-sq(adj) = 73.6%
(C) Variable Coef s.e t p
Constant 5.45333 0.5487 9.94 0.0001
Explanatory 0.451515 0.008844 -5.11 0.0009
S = 0.8033 R-sq = 36.5% R-sq(adj) = 33.6%
(D) Variable Coef s.e t p
Constant 5.45333 0.5487 9.94 0.0001
Explanatory -0.451515 0.008844 -5.11 0.0009
S = 0.8033 R-sq = 36.5% R-sq(adj) = 33.6%
(E) Variable Coef s.e t p
Constant 5.45333 0.5487 9.94 0.0001
Explanatory 0.451515 0.008844 -5.11 0.0009
S = 0.8033 R-sq = 76.5% R-sq(adj) = 73.6%
2 4 6 8 10
5
4
3
2
1
Explanatory
Resp
onse
37. An automotive insurance company is interested in the association between age of SUVβs and odometer
reading for their clients. Data from 20 randomly selected clients generates the following computer
output:
Dependent Variable is: Mileage
Source df SS MS F
Regression 1 21.5613e9 21.5613e9 87.3
Residual 18 4.44774e9 247.097e6
Variable Coeff s.e. Coeff t P
Constant 9717.95 7208 1.315 0.194
Age 15675.2 1678 9.34 0.000
s = 15720 R-sq = 82.9% R-sq(adj) = 81.9%
Which of the following give a 96 percent confidence interval for the slope of the regression line?
(A) 9,718 Β± 2.1054(7,208)
(B) 9,718 Β± 2.197(7,208)
(C) 9,718 Β± 2.214(7,208
β20)
(D) 15,675 Β± 2.197(15,720
β20)
(E) 15,675 Β± 2.214(1,678)
38. A theater owner believes that attendance actually goes up if the best seats are advertised for
higher prices. Below is the computer printout for the regression analysis of Attendance (in 1,000s)
versus Price (in dollars) of the best seats.
Dependent Variable is: Attendance
Source SS df MS F
Regression 105.226 1 105.226 6.15
Residual 102.649 6 17.1082
Variable Coef s.e. Coeff t P
Constant 18.2961 5.182 3.53 0.0124
Price 0.573437 0.2312 2.48 0.0478
s = 4.136 R-sq = 50.6% R-sq(adj) = 42.4%
What is the p-value for a t-test with H0: π½ = 0 and HA: π½ > 0?
(A) .0124
(B) .0239
(C) .0478
(D) .0956
(E) .5060
39. Below is the computer output for a regression analysis involving starting salary (in $1,000) and
college GPA
Variable Coef s.e. Coeff t P
Constant -0.73391 5.744 -0.128 0.9015
GPA 11.8204 1.848 5.4 0.0002
s = 3.772 R-sq = 83.6% R-sq(adj) = 81.6%
Source df SS MS F
Regression 1 581.798 581.798 40.9
Residual 8 113.802 14.2252
What is the equation for the least squares regression line?
(A) πΊππ΄ Μ = -0.734 + 11.82(salary)
(B) ππππππ¦ Μ = -0.734 + 11.82(GPA)
(C) πΊππ΄ Μ = 0.9015 + 0.0002(salary)
(D) ππππππ¦ Μ = 5.744 + 1.848(GPA)
(E) ππππππ¦ Μ = 1.848 + 11.82(GPA)
40. Before Challenger went of at 31Β°F, each of the 23 earlier launches experienced from zero to three
O-ring failures. There was some speculation that the number of O-ring failures was related to the
temperature at lift-off. A computer printout, performed too late, is shown below.
Dependent Variable: Failures
s = .06673 R-sq = 31.5% R-sq(adj) = 28.2%
Variable Coef s.e. Coeff t P
Constant 4.79365 1.409 3.4 0.0027
Temperature -0.0626587 0.02016 -3.11 0.0052
Source df SS MS F
Regression 1 4.30166 4.30166 9.66
Residual 21 9.35052 0.445263
Is there evidence of a relationship between failure and temperature?
(A) There is no evidence of a relationship between number of O-ring failures and temperature at
lift-off.
(B) There is evidence of a relationship at the .10 level, but not at the .05 level.
(C) There is evidence of a relationship at the .05 level, but not at the .01 level.
(D) There is evidence of a relationship at the .01 level but not at the .001 level.
(E) There is evidence of a relationship at the .001 level.
41. A linear regression analysis relating secretariesβ salaries to years of experience yields:
οΏ½ΜοΏ½ = 19.78 + 2.405x, where x is years of experience and y is salary (in $1,000). Which of the
following is the most proper conclusion?
(A) A starting secretary will earn $19,780, while one with 70 years of experience should
earn $188,130.
(B) Starting secretaries average $19,780 with bonuses of $2,405 every year.
(C) There is a cause-and-effect relationship between secretariesβ salaries and experience with each
extra year of experience corresponding to an extra $2,405. In salary.
(D) Starting salaries for secretaries average $19,780 and each year or experience is associated
with an average extra $2,405.
(E) There is a high correlation between secretariesβ salaries and years of experience.
42. Data from an SRS of property owners is cross-classified by gender and support for a new educational
initiative giving the following table:
Male Female
For 35 45
Against 55 50
Is there evidence of a relationship between gender and support for the initiative among property
holders?
(A) There is strong evidence of a relationship between gender and support for the initiative.
(B) There is weak evidence of a relationship between gender and support for the initiative.
(C) There is no evidence of a relationship between gender and support for the initiative.
(D) Further information is needed to be able to perform a chi-square test of independence.
(E) The test is inconclusive.
43. Three professors are interviewed as to a sampling of their grades, and the following tables gives the
resulting counts.
Prof. A Prof. B Prof. C
Grades A, B 3 8 12
Grades C 15 9 8
Grades D, F 2 3 4
A statistics student runs a chi-square test of homogeneity. What is the most proper conclusion?
(A) There is no evidence of a relationship between these professors and grades.
(B) There is evidence at the 10 percent level, but not at the 5 percent level, that the professor give
different grade distributions.
(C) There is evidence at the 5 percent level, but not at the 1 percent level, that the professors give
different grade distributions.
(D) There is evidence at the 1 percent level that the professors give different grade distributions.
(E) A chi-square test of homogeneity is not appropriate.
44. A highway superintendent states that five bridges into a city are used in the ratio 2:3:3:4 during
the morning rush hour. A highway study of an SRS of 6,000 cars indicates that 920, 1,570, 1,480,
and 2,030 cars use the five bridges, respectively. Can the superintendentβs claim be rejected at the
1 or 5 percent level of significance?
(A) There is sufficient evidence to reject the claim at either of these two levels.
(B) There is sufficient evidence to reject the claim at the 1 percent level, but not at the 5 percent
level.
(C) There is sufficient evidence to reject the claim at the 5 percent level, but not at the 1 percent
level.
(D) There is not sufficient evidence to reject the claim at either of these two levels.
(E) There is not sufficient information to answer this question.
45. A study of accidents at a large factory reported the following numbers by shift:
Shift Morning Afternoon Night
Accidents 35 77 53
Is there sufficient evidence to say that the numbers of accidents on the three shifts are not the
same?
(A) There is sufficient evidence at the .001 significance level that the number of accidents on each
shift are not the same.
(B) There is sufficient evidence at the .01 level, but not at the .001 level, that the number of
accidents on each shift are not the same.
(C) There is sufficient evidence at the .05 level, but not at the .01 level, that the number of
accidents on each shift are not the same.
(D) There is sufficient evidence at the .10 level, but not at the .05 level, that the number of
accidents on each shift are not the same.
(E) There is not sufficient evidence to say that the number of accidents on each shift are not the
same.
46. To compare prices at a grocery store in the suburbs with one in the city, a housewife picks ten basic
items and checks the prices of these items at each store. Which test should she use to determine if
the prices are different at the two stores?
(A) ππTest for goodness-of-fit
(B) ππTest for independence
(C) Two sample Z-test
(D) Two sample t-test
(E) Matched pairs t-test
47. Last year the audience percentages captured by major news programs were as follows: (ABC, NBC,
CBS): 36 percent CNN 42 percent FOX 22 percent. In a random sample of 50 viewers this year, 23
watched (ABC, NBC, CBS), 17 watched CNN, and 10 watched FOX. If a goodness-of-fit test were
performed what would be the p-value?
(A) P(ππ >(ππβππ)π
ππ+
(ππβππ)π
ππ+
(ππβππ)π
ππ) with df = 2
(B) P(ππ >(ππβππ)π
ππ+
(ππβππ)π
ππ+
(ππβππ)π
ππ) with df = 3
(C) P(ππ >(ππβππ)π
ππ+
(ππβππ)π
ππ+
(ππβππ)π
ππ) with df = 2
(D) P(ππ >(ππβππ)π
ππ+
(ππβππ)π
ππ+
(ππβππ)π
ππ) with df = 3
(E) P(ππ >(ππβππ)π
ππ.π+
(ππβππ)π
ππ.π+
(ππβππ)π
ππ.π) with df = 49
48. A geneticist claims that four species of fruit flies should appear in the ration 1:3:3:9. Suppose that
a sample of 480 flies contained 25, 92, 68, and 295 flies of each species, respectively. Does a chi-square test show sufficient evidence to reject the geneticistβs claim?
(A) The test proves the geneticistβs claim.
(B) The test proves the geneticistβs claim is false
(C) The test does not give sufficient evidence to reject the geneticistβs claim
(D) The test gives sufficient evidence to reject the geneticistβs claim.
(E) The test is inconclusive.
49. The table below shows the number of students referred for disciplinary reasons to the principalβs
office, broken down by the day of the week. A counselor would like to know if such referrals are
related to the day of the week. What is the value of chi-square for the appropriate test?
Monday Tuesday Wednesday Thursday Friday
12 5 9 4 15
(A) (12β9)2
12+
(5β9)2
5+
(9β9)2
9+
(4β9)2
4+
(15β9)2
15
(B) (12β9)2
9+
(5β9)2
9+
(9β9)2
9+
(4β9)2
9+
(15β9)2
9
(C) (12β9)2
45+
(5β9)2
45+
(9β9)2
45+
(4β9)2
45+
(15β9)2
45
(D) (12β5)2
12+
(5β5)2
5+
(9β5)2
9+
(4β5)2
4+
(15β5)2
15
(E) (12β5)2
5+
(5β5)2
5+
(9β5)2
5+
(4β5)2
5+
(15β5)2
5
50. Which of the following is the proper use of a chi-square test of independence?
(A) To test whether the distribution of counts on a categorical variable matches a claimed
distribution.
(B) To test whether the distribution of counts on a numerical variable matches a claimed
distribution
(C) To test whether the distribution of two different groups on the same categorical variable
matches
(D) To test whether two categorical variables on the same subjects are related.
(E) To test whether two numerical variables on the same subject are related.
51. A small appliance manufacturer sets up three locations to provide service for its customers. Logs are
kept noting whether or not calls about the problems are solved successfully. Data from a sample of
500 calls are summarized in the following table:
Location 1 Location 2 Location 3
Problem solved 124 98 103
Problem not solved 55 63 57
Assuming there is no association between the location and whether or not a problem is resolved
successfully, that is H0: independence, what is the expected number of successful calls (problem
solved) from location 2?
(A) (161)(98)
325
(B) (325)(98)
161
(C) (161)(325)
98
(D) (161)(325)
486
(E) (161)(325)
500
52. In the following table, what value of n results in a table showing perfect independence?
40 60
50 n
(A) 30
(B) 50
(C) 70
(D) 75
(E) 100
53. Two commercial flights per day are made from a small county airport. The airport manager tabulates
the number of on-time departures for a sample of 200 days.
Number of on-time departures 0 1 2
Observed number of days 12 75 113
What is the ππ statistic for a goodness-of-fit test that the distribution is binomial with probability
equal to 0.8 that a flight leaves on time?
(A) (12β8)2
8+
(75β64)2
64+
(113β128)2
128
(B) (12β8)2
12+
(75β64)2
75+
(113β128)2
113
(C) (12β10)2
10+
(75β30)2
30+
(113β160)2
160
(D) (12β10)2
10+
(75β30)2
75+
(113β160)2
113
(E) (12β66)2
12+
(75β67)2
75+
(113β67)2
113
54. Which of the following is not true with regard to contingency tables for chi-square tests for
independence?
(A) Categorical rather than quantitative variables are being considered.
(B) Observed frequencies should be whole numbers.
(C) Expected frequencies should be whole numbers.
(D) Expected frequencies in each cell should be at least 5, and to achieve this, one sometimes
combines categories for one or the other or both of the variables.
(E) The expected frequency for any cell can be found by multiplying the row total by the column
total and dividing by the sample size.
55. Is preference between vanilla and chocolate ice cream independent of age? 210 interviews yielded
the following numbers:
Age 10-19 20-29 30-39 40-49
Prefer vanilla 22 31 21 40
Prefer chocolate 31 28 24 13
What is a reasonable conclusion?
(A) There is a very strong evidence of a relationship between age and ice cream preference.
(B) There is weak evidence of a relationship between age and ice cream preference.
(C) There is no evidence of a relationship between age and ice cream preference.
(D) Further information is needed to be able to perform a chi-square test of independence.
(E) The test is inconclusive.
56. A survey of 200 high school senior is conducted to see if there is a relationship between whether or
not a student is taking AP Statistics and whether the student plans to attend a public or private
college. The data is summarized in the following table:
Public Private
Taking AP Stat 33 77
Not taking AP Stat 47 43
What is the test statistics for a hypothesis test with H0: independence?
(A) (33β44)2
44+
(77β66)2
66+
(47β36)2
36+
(43β54)2
54
(B) (33β44)2
33+
(77β66)2
77+
(47β36)2
47+
(43β54)2
43
(C) (33β55)2
55+
(77β55)2
55+
(47β45)2
45+
(43β45)2
45
(D) (33β40)2
40+
(77β60)2
60+
(47β40)2
40+
(43β60)2
60
(E) (33β50)2
33+
(77β50)2
77+
(47β50)2
47+
(43β50)2
43
57. It is hypothesized that scores on a certain intelligence test are normally distributed with a mean of
100 and a standard deviation of 10. A psychologist runs a goodness-of-fit hypothesis on an SRS of
100 scores resulting in the table below. What is the ππ statistic for this test?
Score: Below 90 90-100 100-110 Above 110
Number of people: 10 40 35 15
(A) (10β16)2
10+
(40β34)2
40+
(35β34)2
35+
(15β16)2
15
(B) (10β16)2
16+
(40β34)2
34+
(35β34)2
34+
(15β16)2
16
(C) (10β25)2
25+
(40β25)2
25+
(35β25)2
25+
(15β25)2
25
(D) (10β25)2
10+
(40β25)2
40+
(35β25)2
35+
(15β25)2
15
(E) (10β25)2
16+
(40β25)2
34+
(35β25)2
34+
(15β25)2
16
FRQ Review Linear Regression & Chi-Square
Due: Thursday, March 5th
1. 2001 Number 6 The statistics department at a large university is trying to determine if it is
possible to predict whether an applicant will successfully complete the Ph.D. program or will leave
before completing the program. The department is considering whether GPA (grade point average) in
undergraduate statistics and mathematics courses (a measure of performance) and mean number of
credit hours per semester (a measure of workload) would be helpful measures. To gather data, a
random sample of 20 entering students from the past 5 years is taken. The data are given below.
Successfully Completed Ph.D. Program
Student A B C D E F G H I J K L M
GPA 3.8 3.5 4.0 3.9 2.9 3.5 3.5 4.0 3.9 3.0 3.4 3.7 3.6
Credit hours 12.7 13.1 12.5 13.0 15.0 14.7 14.5 12.0 13.1 15.3 14.6 12.5 14.0
Did Not Complete Ph.D. Program
Student N O P Q R S T
GPA 3.6 2.9 3.1 3.5 3.9 3.6 3.3
Credit hours 11.1 14.5 14.0 10.9 11.5 12.1 12.0
The regression output at the top of the next page resulted from fitting a line to the data in each
group. The residual plot (not shown) indicated no unusual patterns, and the assumptions necessary
for inference were judged to be reasonable.
Successfully Completed Ph.D. Program
Predictor Coef StDev T P
Constant 23.514 1.684 13.95 0.000
GPA β 2.7555 0.4668 β 5.90 0.000
S = 0.5658 R-Sq = 76.0%
Did Not Complete Ph.D. Program
Predictor Coef StDev T P
Constant 24.200 3.474 6.97 0.001
GPA β 3.485 1.013 β 3.44 0.018
S = 0.8408 R-Sq = 70.3%
(a) Use an appropriate graphical display to compare the GPAβs for the two groups. Write a few
sentences commenting on your display.
(b) For the students who successfully completed the Ph.D. program, is there a significant
relationship between GPA and mean number of credit hours per semester?
Give a statistical justification to support your response.
(c) If a new applicant has a GPA of 3.5 and a mean number of credit hours per semester of 14.0,
do you think this applicant will successfully complete the Ph.D. program? Give a statistical
justification to support your response.
2. 2003B Question 1 A simple random sample of 9 students was selected from a large university. Each
of these students reported the number of hours he or she had allocated to studying and the number
of hours allocated to work each week. A least squares linear regression was performed and part of
the resulting computer output is shown below.
Predictor Coef StDev T P
Constant 8.107 2.731 2.97 0.021
Work 0.4919 0.1950 2.52 0.040
S = 4.349 R-Sq = 47.6% R-Sq (adj) = 40.1%
The scatterplot below displays the data that were collected from the 9 students.
(a) After point P, labeled on the graph on the previous page, was removed from the data, a second
linear regression was performed and the computer output is shown below.
Predictor Coef StDev T P
Constant 11.123 3.986 2.79 0.032
Work 0.1500 0.3834 0.39 0.709
S = 4.327 R-Sq = 2.5% R-Sq (adj) = 0.0%
Does point P exercise a large influence on the regression line? Explain.
(b) The researcher who conducted the study discovered that the number of hours spent studying
reported by the student represented by P was recorded incorrectly. The corrected data point for
this student is represented by the letter Q in the scatterplot below
Explain how the least squares regression line for the corrected data (in this part) would differ from the
least squares regression line for the original data.
3. 2006 Number 2 A manufacturer of dish detergent believes the height of soapsuds in the dishpan
depends on the amount of detergent used. A study of the sudsβ heights for a new dish detergent was
conducted. Seven pans of water were prepared. All pans were of the same size and type and
contained the same amount of water. The temperature of the water was the same for each pan. An
amount of dish detergent was assigned at random to each pan, and that amount of detergent was
added to the pan. Then the water in the dishpan was agitated for a set amount of time, and the
height of the resulting suds was measured.
A plot of the data and the computer output from fitting a least squares regression line to the data
are shown below.
(a) Write the equation of the fitted regression line. Define any variables used in this equation.
(b) Note that s = 1.99821 in the computer output. Interpret this value in the context of this study.
(c) Identify and interpret the standard error of the slope.
4. 2002B Question 1 Animal-waste lagoons and spray fields near aquatic environments may significantly
degrade water quality and endanger health. The National Atmosphere Deposition Program has
monitored the atmospheric ammonia at swine farms since 1978. The data on the swine population
size (in thousands) and atmospheric ammonia (in parts per million) for one decade are given below.
Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
Swine Population 0.38 0.50 0.60 0.75 0.95 1.20 1.40 1.65 1.80 1.85
Atmospheric Ammonia 0.13 0.21 0.29 0.22 0.19 0.26 0.36 0.37 0.33 0.38
a) Construct a scatterplot for these data.
b) The value for the correlation coefficient for these data is 0.85. Interpret this value.
c) Based on the scatterplot in part (a) and the value of the correlation coefficient in part (b), does
it appear that the amount of atmospheric ammonia is linearly related to the swine population size?
Explain.
d) What percent of the variability in atmospheric ammonia can be explained by swine population
size?
5. 2011 Number 5 Windmills generate electricity by transferring energy from wind to a turbine. A
study was conducted to examine the relationship between wind velocity in miles per hour (mph) and
electricity production in amperes for one particular windmill. For the windmill, measurements were
taken on twenty-five randomly selected days, and the computer output for the regression analysis
for predicting electricity production based on wind velocity is given below. The regression model
assumptions were checked and determined to be reasonable over the interval of wind speeds
represented in the data, which were from 10 miles per hour to 40 miles per hour.
(a) Use the computer output above to determine the equation of the least squares regression line.
Identify all variables used in the equation.
(b) How much more electricity would the windmill be expected to produce on a day when the wind
velocity is 25 mph than on a day when the wind velocity is 15 mph? Show how you arrived at your
answer.
(c) What proportion of the variation in electricity production is explained by its linear relationship with
wind velocity?
(d) Is there statistically convincing evidence that electricity production by the windmill is related to
wind velocity? Explain.
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
.40
.35
.30
.25
.20
.15
.10
.05
Swin
Atm
osph
Predictor Coef SE Coef T P
Constant 0.137 0.126 1.09 0.289
Wind Velocity 0.240 0.019 12.63 0.000
S = 0.237 R-Sq=0.873 R-Sq (adj) = 0.868
S = 0.237 R-Sq = 0.873
6. 2005B Number 5 John believes that as he increases his walking speed, his pulse rate will increase.
He wants to model this relationship. John records his pulse rate, in beats per minute (bpm), while
walking at each of seven different speeds, in miles per hour (mph). A scatterplot and regression
output are shown below.
(a) Using the regression output, write the equation of the fitted regression line.
(b) Do your estimates of the slope and intercept parameters have meaningful interpretations in the
context of this question? If so, provide interpretations in this context. If not, explain why not.
(c) John wants to provide a 98 percent confidence interval for the slope parameter in his final
report. Computer the margin of error that John should use. Assume that conditions for
inference are satisfied.
7. 2009 Question 1 A simple random sample of 100 high school seniors was selected from a large
school district. The gender of each student was recorded, and each student was asked the following
questions.
a. Have you ever had a part-time job?
b. If you answered yes to the previous question, was your part-time job in the summer only?
The responses are summarized in the table below.
Gender
Job Experience Male Female Total
Never had a part-time job 21 31 52
Had a part-time job during summer only 15 13 28
Had a part-time job but not only during summer 12 8 20
Total 48 52 100
(a) On the grid below, construct a graphical display that represents the association between gender
and job experience for the students in the sample.
(b) Write a few sentences summarizing what the display in part (a) reveals about the association
between gender and job experience for the students in the sample.
(c) Which test of significance should be used to test if there is an association between gender and
job experience for the population of high school seniors in the district?
State the null and alternative hypothesis for the test, but do not perform the test.
8. 2003 Question 5 A random sample of 200 students was selected from a large college in the United
States. Each selected student was asked to give his or her opinion about the following statement.
"The most important quality of a person who aspires to be the President
of the United States is a knowledge of foreign affairs."
Each response was recorded in one of five categories. The gender of each selected student was
noted.The data are summarized in the table below:
Is there sufficient evidence to indicate that the response is dependent on gender? Provide
statistical evidence to support your conclusion.
Response Category
Strongly
Disagree
Somewhat
Disagree
Neither Agree
nor Disagree
Somewhat
Agree
Strongly
Agree
Male 10 15 15 25 25
Female 20 25 25 25 15
9. 1999 Question 2 The Colorado Rocky Mountain Rescue Service wishes to study the behavior of lost
hikers. If more were known about the direction in which lost hikers tend to walk, then more
effective search strategies could be devised. Two hundred hikers selected at random from those
applying for hiking permits are asked whether they would head uphill, downhill, or remain in the
same place if they became lost while hiking. Each hiker in the sample was also classified according
to whether he or she was an experienced or novice hiker. The resulting data are summarized in the
following table.
Do these data provide convincing evidence of an association between the level of hiking expertise and
the direction the hiker would head if lost? Give appropriate statistical evidence to support your
conclusion.
10. 2008 Question 5 A study was conducted to determine where moose are found in a region containing
a large burned area. A map of the study area was partitioned into the following four habitat types.
(1) Inside the burned area, not near the edge of the burned area,
(2) Inside the burned area, near the edge,
(3) Outside the burned area, near the edge, and
(4) Outside the burned area, not near the edge.
The figure below shows these four habitat types.
The proportion of total acreage in each of the habitat types was determined for the study area.
Using an aerial survey, moose locations were observed and classified into one of the four habitat
types. The results area given in the table below.
Habitat Type Proportion of Total Acreage Number of Moose Observed
1 0.340 25
2 0.101 22
3 0.104 30
4 0.455 40
Total 1.000 117
(a) The researchers who are conducting the study expect the number of moose observed in a habitat
type to be proportional to the amount of acreage of that type of habitat. Are the data
consistent with this expectation? Conduct an appropriate statistical test to support your
conclusion. Assume the conditional for inference are met.
(b) Relative to the proportion of total acreage, which habitat types did the moose seem to prefer?
Explain.
Direction
Uphill Downhill Remain in Same Place
Novice 20 50 50
Experienced 10 30 40
11. 2004 Question 5 A rural county hospital offers several health services. The hospital
administrators conducted a poll to determine whether the residents' satisfaction with the available
services depends on their gender. A random sample of 1,000 adult county residents was selected.
The gender of each respondent was recorded and each was asked whether he or she was satisfied
with the services offered by the hospital. The resulting data are shown in the table below.
(a) Using a significance level of 0.05, conduct an appropriate test to determine if, for adult residents of
this county, there is an association between gender and whether or not they were satisfied with
services offered by the hospital.
(b) Is 800
1,000 a reasonable estimate for the proportion of all adult county residents who are satisfied
with the services offered by the hospital? Explain why or why not.
12. 2003B Question 5 Contestants on a game show spin a wheel like the one shown in the figure above.
Each of the four outcomes on this wheel is equally likely and outcomes are independent from one
spin to the next.
βͺ The contestant spins the wheel.
βͺ If the result is a skunk, no money is won and the contestantβs turn is finished.
βͺ If the result is a number, the corresponding amount in dollars is won. The contestant can then
stop with those winnings or can choose to spin again, and his or her turn continues.
βͺ If the contestant spins again and the result is a skunk, all of the money earned on that turn is
lost and the turn ends.
βͺ The contestant may continue adding to his or her winnings until he or she chooses to stop or until
a spin results in a skunk.
(a) What is the probability that the result will be a number on all of the first three spins of the
wheel?
(b) Suppose a contestant has earned $800 on his or her first three spins and chooses to spin the
wheel again. What is the expected value of his or her total winnings for the four spins?
(c) A contestant who lost at this game alleges that the wheel is not fair. In order to check on the
fairness of the wheel, the data in the table below were collected for 100 spins of this wheel.
Result Skunk $100 $200 $500
Frequency 33 21 20 26
Based on these data, can you conclude that the four outcomes on this wheel are not equally likely?
Give appropriate statistical evidence to support your answer.
Male Female Total
Satisfied 384 416 800
Not Satisfied 80 120 200
Total 464 536 1,000