291 practice midterms and solutions

116
1 PRACTICE QUESTIONS FOR THE MIDTERM EXAM Part A. Midterm Exam 2013 Midterm Exam 2012 Midterm Exam 2011 Midterm Exam 2010 Questions, Answers and Explanations Part B. Past Years’ Midterm Exams Questions, Answers and Explanations

Upload: juliet-cap

Post on 13-Dec-2015

171 views

Category:

Documents


18 download

DESCRIPTION

Practice for the course COMM 291

TRANSCRIPT

1

PRACTICE QUESTIONS FOR

THE MIDTERM EXAM

Part A.

Midterm Exam 2013

Midterm Exam 2012

Midterm Exam 2011

Midterm Exam 2010

Questions, Answers and Explanations

Part B.

Past Years’ Midterm Exams

Questions, Answers and Explanations

2

Part A. Midterm Exam 2013

Notes: This exam has 9 questions. The duration is 2 hours. Books, notes, and

calculators are allowed, but not computers, cellphones or on-line connectivity.

MT2013: Question 1 “A sampling of sampling questions”

a) The Human Resources Department of a large university maintains records on its

faculty members. The table displays some of these data.

Payroll

Number Birth date Faculty

Years of

Employment

Teaching Rating

(1-5 scale)

Salary

Classification

1520 02/20/56 Law 25 4.1 IV

3210 05/13/62 Science 17 3.9 III

0342 10/12/69 Business 12 4.5 II

2988 02/27/78 Arts 4 4.3 I

Place an X in the space beside each variable that is best described as Quantitative.

__ Payroll Number __ Birth date __ Faculty

__Years of Employment __ Teaching Rating __ Salary Classification

b) Which of the following is (are) based on cross-sectional data?

__A. Company quarterly profits

__B. Percentage of Canadian adults who work full-time

__C. Historical closing stock prices

__D. Yearly student enrolments

__E. Annual costs

c) Which of the following is (are) time series data?

__A. Number of employees in 2012

__B. This month’s demand for an automotive part

__C. This quarter’s sales of automobiles

__D. Weekly receipts at a clothing boutique

__E. Percentage of employees who are female

d) The administration of a large university wants to study the types of wellness programs

that would interest its employees. They plan to survey a random sample of employees.

Under consideration are several sampling plans. Beside each plan, write the number of

the sampling strategy given in the following list. for each. Choose from among:

1 = Simple Random Sampling

2 = Stratified Random Sampling

3 = Cluster Sampling

4 = Systematic Sampling

__ (i) There are five categories of employees (administration, faculty, professional staff,

clerical and maintenance). Randomly select ten individuals from each category.

__ (ii) Each employee has an ID number. Randomly select 50 numbers.

__ (iii) Randomly select a school within the university (e.g., Business School) and survey

all of the individuals (administration, faculty, professional staff, clerical and

maintenance) who work in that school.

__ (iv) The HR Department has an alphabetized list of newly hired employees (hired

within the last five years). After starting the process by randomly selecting an

employee from the list, every fifth name is chosen to be included in the sample.

3

e) A manufacturer of toys claims that less than 3% of his toys are defective. When 100

toys were drawn from one production run of 5,000 toys, 5% were found to be defective.

For each term on the left, select the matching answer from the list to the right, and write

the number in the blank.

___ Population 1 The 3% value

___ Sample 2 The 5% value

___ Sampling Frame 3 The 100 toys

___ Parameter 4 The 5,000 toys

___ Statistic 5 All toys produced

MT2013: Question 2 “Could this label be called a ‘phone tag?’”

A magazine that publishes product reviews conducted a survey of teenagers' preferences

for cell phones. Three brands of cell phone designed specifically with teens in mind were

the focus of the study. The table summarizes responses by brand and gender.

Cell Phone Male Female Total

Call Me Maybe 55 87 142

Phone Fun XS 99 150 249

Black Keys II 196 113 309

Total 350 350 700

a) Which of the following charts would be appropriate for displaying the marginal

distribution of cell phone brand?

__A. Histogram __B. Boxplot __C. Bar Chart

__D. Line Graph __E. Stem and Leaf Display

b) What percent of teenagers preferred Call Me Maybe?

__A. 50% __B. 41% __C. 25% __D. 16% __E. 20%

c) What percent of female teenagers preferred the Phone Fun XS?

__A. 43% __B. 60% __C. 21% __D. 50% __E. 16%

d) What percent of teenagers who preferred the Black Keys II were males?

__A. 63% __B. 32% __C. 16% __D. 50% __E. 41%

e) Which of the following statement is true?

__A. It appears that cell phone brand preference and gender are not related.

__B. It appears that cell phone brand preference and gender are not independent.

__C. It appears that cell phone brand preference and gender are independent.

__D. A scatterplot will be more informative here than a table.

__E. None of the above

4

MT2013: Question 3 “Spring into these summary questions”

a) You have a set of 30 numbers. The standard deviation from these numbers is reported

as zero. You can be certain that:

__A. Half of the numbers are above the mean

__B. All of the numbers in the set are zero

__C. All of the numbers in the set are equal

__D. The numbers are evenly spaced below and above the mean

b) Here is the five number summary of the hourly wages ($) for sales managers.

Min Q1 Median Q3 Max

20.94 37.64 44.77 49.24 67.11

(i) The shape of this distribution is best described as:

__A. Symmetric

__B. Skewed to the right

__C. Skewed to the left

__D. Not enough information to tell

(ii) The IQR for these data is: ______________

(iii) Compute the lower and upper inner fences: Space for calculations:

Lower inner fence: ___________

Upper inner fence: ___________

(iv) Are there any outliers, as defined by the “inner fences” criterion?

__A. Yes, only on the left side of the distribution

__B. Yes, only on the right side of the distribution

__C. Yes, on both sides of the distribution

__D. No

(v) Suppose there had been an error and that the lowest hourly wage for sales managers

was $18.50 instead of $20.94. Indicate whether how this change would affect the

following summary statistics (increase, decrease, or stay about the same):

a. Mean Decrease Stay the Same Increase

b. Median Decrease Stay the Same Increase

c. Range Decrease Stay the Same Increase

d. IQR Decrease Stay the Same Increase

5

c) In a perfectly symmetrical distribution, which of the following statements is false?

__A. The distance from Q1 to Q2 is equal to the distance from Q2 to Q3

__B. The distance from the smallest observation to Q1 is the same as the distance

from Q3 to the largest observation

__C. The distance from the smallest observation to Q2 is the same as the distance

from Q2 to the largest observation

__D. The distance from Q1 to Q3 is half of the distance from the smallest to the

largest observation

d) Here is a stem plot of scores (out of 200) in a graduate finance course.

12 | 6 8

13 | 1 3 4 5 7 8

14 | 3 4 7

15 | 2 6

16 |

17 | 3

18 | 9

(i) How many students were in the course? _____

(ii) What was the maximum score? _____

(iii) What is the median score? _____

e) An office supply chain has stores in Toronto and Vancouver. One of these stores is to

be closed within the coming year, and to help make the decision, management reviews

sales data. Below are boxplots for monthly unit sales for both locations.

Which of the following statements is not correct?

__A. Monthly sales are higher in Toronto compared to Vancouver.

__B. The IQR for sales in Toronto is larger than that for Vancouver.

__C. Monthly sales are less variable in Vancouver compared to Toronto.

__D. Both distributions are fairly symmetric.

__E. Monthly sales are more variable in Vancouver compared to Toronto.

6

MT2013: Question 4 “Time for relationship-building”

a) A consumer research group investigating the relationship between the price of meat

(per kg) and the fat content (grams) gathered data that produced the following scatterplot.

(i) Which best describes the association between the price of meat and fat content?

__A. Negative, moderately strong

__B. Negative, weak

__C. Positive, strong

__D. Positive, weak

__E. No apparent association

(ii) If the point in the lower left hand corner ($2.00 per kilogram, 6 grams of fat) is

removed, would the correlation would most likely

__A. remain the same

__B. become stronger negative

__C. become weaker negative

__D. become positive

__E. become zero

b) For each of the following pairs of variables, would you expect a large negative

correlation, a large positive correlation, or a small correlation? Circle your choices.

1. The age of a used car and its price Large Neg. Large Pos. Small

2. The height and weight of a person Large Neg. Large Pos. Small

3. The height and the IQ of a person Large Neg. Large Pos. Small

c) For each of the following statements, about the correlation coefficient, r, decide

whether it is True or False. Circle your choices as appropriate.

1. r equals the proportion of times two variables lie True False

on a straight line

2. r will be +1.0 only if all the data lie exactly on a True False

horizontal straight line

3. r measures the fraction of outliers that appear in True False

a scatterplot

4. If the correlation between X and Y is r, the True False

correlation between Y and X is –r

5. r is a unitless number and must always lie True False

between –1.0 and +1.0 inclusive.

7

MT2013: Question 5 “If mistrust is the opposite of trust, would

mistress be the opposite of stress?”

A labour efficiency consultant collected some data on several employees of a

manufacturing operation: their stress levels (X, on a scale from 0 to 10) and the

productivity levels (Y, in parts made per hour). She only recorded some of the relevant

computations, as follows:

= 5.4 = 3.3

= 57.5 = 11.1

= –3.19 = 4.3

a) Write the estimated regression equation here: _____________________________ (Use two decimals only for each value)

b) Write the correlation coefficient here: ________ (Round to two decimals)

Space for work:

c) Complete this sentence: For each additional unit on the stress scale, the productivity

level _________________________________________ parts per hour.

d) What percentage of the variation in productivity levels can be explained by

the stress level variable? Give your answer here, to the nearest whole percent: _________

e) Estimate the productivity of an individual whose stress level is 8: __________ (Round to nearest whole number)

f) Suppose the employee in part e) has an actual productivity level of 60 parts per hour.

Compute the residual and use the fact that the standard deviation of the residuals is 4.3 to

decide whether this data point would be considered an outlier. Explain why in one

sentence only.

Residual = ________ Outlier? Yes No

Explanation:

g) Estimate the productivity of an individual whose stress level is unknown. __________

h) Give an interval range in which the productivity level of 95% of employees would be

expected to fall. Report to the nearest whole numbers. ____________ to ____________

8

MT2013: Question 6 “Can you answer the call of the ‘bell’?”

a) Which statistic(s) would you expect to have a normal distribution?

I. Height of women

II. Shoe sizes of men

III. Age (years) of first-year university students

__A. I & II only

__B. II & III only

__C. I & III only

__D. All three

__E. None of the three

b) The length of time taken by a statistics professor to solve The Globe & Mail cryptic

crossword has a normal distribution. It is known that the probability of needing more than

20 minutes is 0.5, while the probability of needing more than 30 minutes is 0.1587.

(i) Find the mean and the standard deviation of the professor’s solving time.

Mean = ____________ SD = ______________

(ii) What is the probability that the solving time is between 15 and 25 minutes?

__A. 0.38 __B. 0.17 __C. 0.68 __D. 0.06 __E. 0.12 __F. 0.50

c) A soft drink machine dispenses a cup, syrup and carbonated water, hopefully in that

order! The amount of syrup injected is normally distributed with mean 15 ml and

variance 10 ml2. The amount of water injected is normally distributed with mean 80 ml

and variance 15 ml2. The two amounts are independent of one another.

(i) Find the mean and standard deviation of the total amount of syrup and water

dispensed.

Mean = ____________ SD = ______________

(ii) If 25 drinks are dispensed in a day, what are the mean and standard deviation of the

total amount of liquid (syrup and water) that are required?

Mean = ____________ SD = ______________

9

d) Suppose the time it takes for a purchasing agent to complete an online ordering

process is normally distributed with a mean of 8 minutes and a standard deviation of 2

minutes. Suppose a random sample of 25 ordering processes is selected.

(i) The standard deviation of the sampling distribution of mean times is

__A. 0.4 minutes

__B. 2 minutes

__C. 0.08 minutes

__D. 1.6 minutes

__E. 0.12 minutes

(ii) What is the probability that the sample mean will be less than 7.5 minutes?

__A. 0.3944

__B. 0.1056

__C. 0.2114

__D. 0.4013

__E. 0.8944

e) The mean height of male UBC students is 70 inches, with SD 3 inches. The mean

height of female UBC students is 65 inches, with SD 4 inches. You measure the heights

of random samples of 100 males and 100 females. Which result is the most unlikely? To

decide, compute the z-score for each result and write the values in the spaces provided.

__A. One randomly chosen male having a height of 79 inches or more

__B. One randomly chosen female having a height of 74 inches or more

__C. All females in your sample having an average height of 68 inches or more

__D. All males in your sample having an average height of 73 inches or more

z-score for A = _______ z-score for B = _______

z-score for C = _______ z-score for D = _______

Space for work:

10

MT2013: Question 7 “Work with confidence!”

a) EU (European Union) countries report that 46% of their labour force is female. Is the

percentage of females in the Canadian labour force the same? Statscan plan to check a

random sample selected from more than 10,000 employment records on file to estimate

the percentage of females in the Canadian labour force.

(i) Statscan wants to estimate the percentage of females in the Canadian labour force to

within ±5% with 90% confidence. How many employment records should be sampled?

__A. 121

__B. 269

__C. 451

__D. 382

__E. 1000

(ii) Suppose that Statscan wants to be 90% confident of estimating the percentage of

females in the labour force to within ±2% of the true percentage. Which of the following

would they have to do?

__A. Decrease the sample size

__B. Select the same number of employment records

__C. Increase the sample size

__D. Decrease the precision

__E. Increase the sampling error

(iii) They actually select a random sample of 525 employment records, and find that 229

of the people are females. The 90% confidence interval is closest to:

__A. 40.1% to 47.2%

__B. 27.5% to 59.7%

__C. 17.8% to 69.4%

__D. 42.4% to 56.8%

__E. 12.4% to 71.0%

b) For each of the following statements about a 95% confidence interval (CI) for the

mean, decide whether it is True or False. Circle your answers at the right.

1. Results from 95% of all samples will lie in this interval. True False

2. CIs are more information than point estimates because they True False

show how much the population parameters can vary.

3. The interval is wider than a 90% CI would be. True False

4. 95% of data values will fall in the range of a 95% CI True False

for the mean.

5. We are 95% confident that the confidence interval True False

includes the sample mean.

6. If we took many additional samples and computed a 95% CI True False

for each, then approximately 95% of those intervals

would contain the population mean.

11

MT2013: Question 8 “Hypothetically speaking”

Suppose that a report indicates that 28% of Canadians have experienced difficulty in

making mortgage payments. Further suppose that a news organization randomly sampled

400 Canadians from 10 cities and found that 136 reported such difficulty. Does this

indicate that the problem is more severe among these cities?

a) The correct null and alternative hypotheses are

__A. H0 : p = 0.28 and Ha : p > 0.28

__B. H0 : p = 0.28 and Ha : p < 0.28

__C. H0 : p = 0.28 and Ha : p ≠ 0.28

__D. H0 : p ≠ 0.28 and Ha : p = 0.28

__E. H0 : p > 0.28 and Ha : p = 0.28

b) The correct value of the test statistic is: Space for work:

__A. –1.28

__B. –2.67

__C. 2.67

__D. 1.96

__E. –1.28

c) The P-value corresponding to this test statistic is:

__A. 0.025

__B. 0.2119

__C. 0.0177

__D. 0.0522

__E. 0.0038

d) At α = .05, we can conclude that the percentage of Canadians in these cities

experiencing difficulty making mortgage payments ...

__A. is significantly higher than 28%

__B. is significantly lower than 28%

__C. is not significantly different from 28%

__D. is equal to 28%

__E. is none of the above; no conclusion can drawn with the given information.

e) Using the P-value in part c), which one of the following statements is true?

__A. A 90% confidence interval for p would contain 28%

__B. A 95% confidence interval for p would contain 28%

__C. A 95% confidence interval for p would not contain 28%

__D. None of the above

Part f) is unrelated to parts a) through e):

f) An opinion poll in a city of 200,000 was based on a simple random sample of 2000

people. Another poll is to be taken in the same way in a second city of population

400,000. In order for this poll to have the same margin of error as the poll in the first city,

the sample size in the second city should be:

__A. 1000

__B. 2000

__C. 4000

__D. 8000

12

MT2013: Question 9 “No Surprise: A Statistics Test with a test statistic!”

Insurance companies track life expectancy information to assist in determining the cost of

life insurance policies. Last year the average life expectancy of all policyholders was 77

years. ABI Insurance wants to determine if their clients now have a longer life

expectancy, on average, so they randomly sample some of their recently paid policies.

The insurance company will only change their premium structure if there is evidence that

people who buy their policies are living longer than before. The sample has a mean of

78.6 years and a standard deviation of 4.48 years.

86 75 83 84 81 77 78 79 79 81

76 85 70 76 79 81 73 74 72 83

a) The appropriate null and alternative hypotheses are:

H0: _________________ Ha: _________________

b) Give the formula for the appropriate test statistic and compute its value.

Formula: ________________ Computed value: ______________

Space for work:

c) The corresponding P-value is:

__A. Greater than 0.20

__B. Between 0.10 and 0.20

__C. Between 0.05 and 0.10

__D. Between 0.025 and 0.05

__E. Between 0.01 and 0.025

__F. Less than 0.01

d) State your conclusion using α = .05. Write one statistically and grammatically correct

sentence that tells ABI Insurance whether there is evidence to increase their premiums.

e) Suppose ABI randomly samples 100 recently paid policies. This sample yields a mean

of 77.7 years and a standard deviation of 3.6 years. Compute a 95% confidence interval.

Report it in the format [xx.x , xx.x] with one decimal place. [_________ , _________]

MT2013 – END OF QUESTIONS; ANSWERS AND EXPLANATIONS FOLLOW

13

MT2013: ANSWERS AND EXPLANATIONS

MT2013: Answer 1

a) Years of Employment, Teaching Rating b) B. c) D. d) 2,1,3,4.

e) 5,3,4,1,2. Population = All toys produced; Sample = 100 toys; Sampling Frame =

5,000 toys; Parameter = 3%; Statistic = 5%

Details and Comments:

a) Years of Employment has units (yrs); Teaching Rating does not have units but the

rating is an average of ordinal data over a number of courses, and can range from 1 to 5

with fractional values possible.

b) “Percentage of Canadian adults who work full-time” is measured at one time point,

hence cross-sectional. The other variables are measured repeatedly over time, hence

longitudinal or time-series.

c) Only “Weekly receipts at a clothing boutique” is measured at more than one time

point. The other variables are measured once each.

d) (i) The five categories are strata; random samples are taken within each one.

(ii) Each employee has the same chance of being selected for the sample.

(iii) One school is a reasonable representative of the entire university, hence a cluster.

(iv) Choosing “every fifth name” makes it systematic.

e) The sampling frame is the production run, namely, that part of the population from

which the sample can be drawn.

MT2013: Answer 2

a) C. b) E. c) A. d) A. e) B.

Details and Comments:

a) Categorical data are displayed with a bar chart. Histograms, stem-and-leaf displays,

boxplots (and usually line graphs) are for quantitative data.

b) 20% (142/700)

c) 43% (150/350)

d) 63% (196/309)

e) The column percentages for males are different from those for females, which suggests

that cell phone brand preference and gender are related (i.e. not independent.)

14

MT2013: Answer 3

a) C.

b) (i) C. (ii) 11.6 (iii) Lower inner fence = 20.24 Upper inner fence = 66.64

(iv) B. (v) Decrease, Stay the same, Increase, Stay the same

c) D. d) 15, 189, 138 e) E.

Details and Comments:

a) Look at the formula for standard deviation. If all numbers are equal, then they are also

all equal to the mean, so all the deviations are zero. This is the only way the standard

deviation can be zero.

b) (i) The median is closer to Q3 than to Q1 so the distribution is skewed to the left.

(ii) IQR = Q3 – Q1 = 49.24 – 37.64

(iii) Lower inner fence = 37.64 – 1.5×11.6; Upper inner fence = 49.24 + 1.5×11.6

(iv) Yes, only on the right side of the distribution since the maximum exceeds 66.64.

(v) Decreasing the lowest data value decreases the sum, and hence the mean. But it

doesn’t really affect which is the middle value or the quartiles. The range increases.

c) Quartiles divide the area of the distribution into four equal sections.

d) (i) Count up the number of data values. Don’t forget to attach the leaf to the stem for

the maximum and median.

e) Monthly sales are more variable in Vancouver compared to Toronto since the box is

taller.

MT2013: Answer 4

a) (i) A. Negative, moderately strong (ii) B. Become stronger negative

b) Large Neg.; Large Pos.; Small c) False, False, False, False, True

Details and Comments:

a) (i) Top left to bottom right is negative association.

(ii) Removing the lower left point reduces the scatter.

b) 1. The older the car, the lower the price. 2. The taller the person, the heavier the

person. 3. Height has no connection with IQ.

c) 1. “Creative” but completely wrong.

2. The points must lie exactly on a straight line with a positive slope.

3. “Creative” but also completely wrong.

4. Corr of X and Y = Corr of Y and X. The roles are interchangeable.

5. Two of the properties of r.

15

MT2013: Answer 5

a) = 74.73 – 3.19x b) -0.95 c) “decreases by 3.19” d) 90% e) 49

f) Residual = 11; Yes, it is an outlier since the resident is more than 2.5 ’s away from 0.

g) 57.5 h) 35 to 80

Details and Comments:

a) = 59.5 –(-3.19)(5.4) = 74.73)

b) Rearrange the formula for = (-3.19)(3.3/11.1) = -0.95

c) Interpretation of slope.

d) r2 = (-0.95)

2 = 0.90 or 90%

e) (8) = 74.73 – 3.19(8) = 49.21 (Round to 49)

f) Residual = y – = 60 – 49 =11; remember the 68-95-99.7 Rule for identifying

outliers/unusual observations.

g) Since x is unknown, just use the mean of y.

h) Use the 68-95-99.7 Rule, i.e. 57.5 ± 2(11.1) = 35.3, 74.7

MT2013: Answer 6

a) A. I and II only

b) (i) Mean = 20; SD = 10 (ii) A.

c) (i) Mean = 95 ; SD = 5 (ii) Mean = 2375; SD = 25

d) (i) A. 0.4 minutes (ii) B. 0.1056

e) D. z-scores: 3, 2.25, 7.5, 10;

D has the highest z-score and therefore is the most unlikely.

Details and Comments:

a) First-year students’ ages will vary only slightly since most are within a year or two in

age. There might be some older students, i.e. those returning to school etc., but it is

highly unlikely to have students who are much younger than 18 or 19!

b) (i) Computations: Pr(Z > z) = 0.5 => z = 0, so X = μ + zσ => 20 = μ + 0 => μ = 20

Pr(Z > z) = 0.1587 => z = 1, so X = μ + zσ => 30 = 20 + 1σ => σ = 10

(ii) Computations: Pr(15 < X < 25) = Pr([15-20]/10 < Z < [25-20]/10)

= Pr(-0.5 < Z < 0.5) = 1 – 2(0.3085) = 0.383

c) (i) Computations: E(X+Y) = E(X) + E(Y) =15 + 80 = 95;

Var(X+Y) = Var(X) + Var(Y) (since indep.) = 10 + 15 = 25, so SD =√25 = 5

(ii) Computations: E(T) = 25(95) = 2375; Var(T) = 25(25) = 625; SD = √625 = 25

d) (i) σ/√n = 2/√25 = 0.4

(ii) Pr( < 7.5) = Pr(Z < [7.5-8]/0.4) = Pr(Z < -1.25) = 0.1056

e) Computations:

z-score for A = [79-70]/3 = 3 z-score for B = [74-65]/4 = 2.25

z-score for C = [68-65]/[4/√100] = 7.5 z-score for D = [73-70]/[3/√100] = 10

16

MT2013: Answer 7

a) (i) B. 269 (ii) C. Increase the sample size (iii) A. [ 40.1% , 47.2% ]

b) 1. False; 2. False; 3. True; 4. False; 5. False; 6. True

Details and Comments:

a) (i) n = (1.6452)(0.46)(0.54)/(0.05

2) = 269

(ii) Look at the formula for the CI. The sample size is in the denominator of the margin of

error, so increasing the sample size decreases the margin of error.

(iii) =229/525 = 0.4362;

90% CI: 0.4362 ± 1.645 = 0.4362 ± 0.0356 or [0.4006, 4718])

b) 1. The interval changes from sample to sample

2. Population parameters don’t vary; sample statistics vary

3. Higher confidence requires wider intervals

4. CIs are not about individual data values; they are about estimates

5. All CIs for mean include the sample mean; only 95% include the population mean

6. Definition of a CI

MT2013: Answer 8

a) A. H0: p = 0.28 and Ha: p > 0.28

b) C. 2.67 c) E. 0.0038 d) A. e) C. f) B. 2000

Details and Comments:

a) One-sided alternative since the question asks whether the problem is “more severe.”

b) = 136/400 = 0.34;

=

= 2.67.

c) The P-value is the area to the right of 2.67 on a standard normal curve.

d) Since the P-value is less than 0.05, the null hypothesis is rejected; the true population

proportion is significantly higher than 28%.

e) Rejecting the null hypothesis for a two-tailed alternative is equivalent to the usual

(two-sided) confidence interval.

f) Sampling variability only depends on sample size, as long the population is large.

MT2013: Answer 9

a) H0 : µ = 77 and Ha : µ > 77

b) Formula and computed value:

=

= 1.597

c) C. Between 0.05 and 0.10

d) There is not sufficient evidence that the mean length of life of people who buy their

policies is higher, so do not increase premiums.

e) [77.0 , 78.4]

Details and Comments:

a) One-sided alternative since the question asks whether policy-buyers are “living longer”

than before.

c) Use the t-table with 19 degrees of freedom

d) Since the P-value is greater than 0.05, do not reject the null hypothesis.

e) 77.1 ± 1.984×3.6/√100 = 77.7 ± 0.7

END OF ANSWERS AND EXPLANATIONS TO MIDTERM 2013

17

Midterm Exam 2012

Notes: This exam has 9 questions. The duration is 2 hours. Books, notes, and

calculators are allowed, but not computers, cellphones or on-line connectivity.

MT2012: Question 1 “A sole practitioner”

ASW, a regional shoe chain, has recently launched an online store. Sales via the Internet

have been sluggish compared to their brick and mortar stores, and management suspects

that its regular customers have concerns regarding the security of online transactions. To

determine if this is the case, they plan to survey a sample of their regular customers.

a) Suppose that ASW’s regular customers belong to a rewards program and have a

customer rewards ID number. ASW decides to randomly select 100 numbers. This

sampling plan is called:

__A. Simple Random Sampling

__B. Stratified Sampling

__C. Cluster Sampling

__D. Systematic Sampling

__E. Convenience Sampling

b) Suppose that ASW has an alphabetized list of regular customers who belong to their

rewards program. After randomly selecting a customer on the list, every 25th customer

from that point on is chosen to be in the sample. This sampling plan is called:

__A. Simple Random Sampling

__B. Stratified Sampling

__C. Cluster Sampling

__D. Systematic Sampling

__E. Convenience Sampling

c) “All regular ASW customers” is known as the ________ of the study.

__A. Parameter

__B. Statistic

__C. Target Population

__D. Sampling Frame

__E. Sample

d) Which of the following is the parameter of interest in the ASW study?

__A. All regular ASW customers

__B. % of regular ASW customers who have concerns about online security

__C. ASW customers who belong to the rewards program

__D. % of ASW customers who belong to the rewards program but don’t shop online

__E. None of the above

e) One member of the management team at ASW suggests that their survey could be

done online. Customers logging on to the online store would be asked to complete the

survey and offered a coupon as incentive to participate. Which statement is true?

__A. This is a voluntary response sample

__B. This would result in an unbiased random sample

__C. This would result in a biased sample

__D. Both A and B

__E. Both A and C

18

MT2012: Question 2 “Planning ahea

d”

A brokerage firm gathered information on how their clients were investing for retirement.

Here is a small sample of the data they collected.

Respondent

Number Age Gender

Household

Income

Self-directed

RRSP?

Book value of

portfolio

1001 45 Male $155,000 Yes $750,000

1002 53 Female $160,000 No $500,000

1003 58 Female $210,000 No $1,000,000

a) Place an X in the space beside each variable that is best described as Quantitative.

__ Respondent Number

__ Age

__ Gender

__ Household Income

__ Self-directed RRSP

__ Book value of portfolio

Based on age, clients were categorized according to where the largest percentage of their

retirement portfolio was invested and shown in the table below.

Age 50 or Younger Over Age 50 Total

Mutual Funds 30 34 64

Stocks 37 45 82

Bonds 19 23 42

Total 86 102 188

b) The percentage of clients who are over age 50 and invest in mutual funds is:

__A. 53.1% __B. 33.3% __C. 18.1% __D. 34.0% __E. 54.3%

c) Of the clients over age 50, the percentage who invest in mutual funds is:

__A. 53.1% __B. 33.3% __C. 18.1% __D. 34.0% __E. 54.3%

d) Of the clients who invest in mutual funds, the percentage over age 50 is:

__A. 53.1% __B. 33.3% __C. 18.1% __D. 34.0% __E. 54.3%

e) The percentage of clients over age 50 is:

__A. 53.1% __B. 33.3% __C. 18.1% __D. 34.0% __E. 54.3%

f) Consider the following side-by-side bar chart for the data below:

Does the chart indicate that mode of

investment is independent of age?

Yes No

Explain in one short sentence only.

19

MT2012: Question 3 “Mmm – Marketing Manager Money”

Here is a histogram and the five number salary for salaries (in $) for a sample of 48

marketing managers.

Min Q1 Median Q3 Max

46360 69693 77020 91750 129420

a) The shape of this distribution is:

__A. Symmetric

__B. Bimodal

__C. Skewed to the right

__D. Skewed to the left

__E. Normal

b) Which of the following is true?

__A. Mode < Median < Mean

__B. Median < Mode < Mean

__C. Mean < Median < Mode

__D. Mean < Mode < Median

__E. All three are equal

c) Which of the following is closest to the standard deviation?

__A. $ 3,676

__B. $ 13,843

__C. $ 20,765

__D. $ 83,060

__E. Can’t tell without the data

d) The IQR for these data is:

__A. $83,060

__B. $22,057

__C. $69,693

__D. $77,020

__E. $14,566

20

e) Compute the lower and upper inner fences:

Space for calculations:

Lower inner fence: ___________

Upper inner fence: ___________

f) Are there any outliers, as defined by the “inner fences” criterion?

__A. Yes, only on the left side of the distribution

__B. Yes, only on the right side of the distribution

__C. Yes, on both sides of the distribution

__D. No

g) Suppose the marketing manager who was earning $129,420 got a raise and is now

earning $140,000. Which of the following statements is true?

__A. The mean would increase

__B. The median would increase

__C. The range would stay the same

__D. The IQR would increase

__E. The IQR would decrease

The next two parts are not related to parts (a) through (g) above.

The boxplots below show monthly sales revenue figures ($ thousands) for a discount

office supply company with locations in three different regions of Canada (Atlantic,

Central and West).

h) Which of the following statements is true?

__A. Central has the lowest sales revenues

__B. Central has the lowest median sales revenue

__C. West has the lowest mean sales revenue

__D. West has the lowest median sales revenue

__E. Atlantic has the lowest mean sales.

i) Which of the following statements is false?

__A. West has the most variable sales revenues.

__B. West has the largest IQR.

__C. Central has the smallest IQR.

__D. Atlantic has the most variable sales revenues.

__E. Central has the least variable sales revenues.

21

MT2012: Question 4 “OMG: A great place to work”

To determine whether the cash bonus paid by a company is related to annual pay, data

were gathered for 10 account executives at Outstanding Management Group (OMG) who

received cash bonuses in 2007. The data and summary statistics are shown below.

ANNUAL PAY CASH BONUS

$ 70,609 $ 11,225

$ 58,487 $ 6,238

$ 104,561 $ 14,194

$ 43,922 $ 4,188

$ 82,613 $ 11,863

$ 116,250 $ 13,671

$ 76,751 $ 7,758

$ 68,513 $ 20,760

$ 137,000 $ 55,000

$ 94,469 $ 34,368

Mean $ 85,318 $ 17,927

Standard Deviation $ 28,077 $ 15,618

Correlation 0.735

a) What percentage of variability in cash bonuses can be explained by pay?

b) What would the correlation be if the Dollars were converted to Euros at the current

conversion rate of (1 Canadian Dollar = 0.76 Euros)?

c) Estimate the linear regression model that relates the response variable (cash bonus) to

the predictor variable (annual pay).

Slope of the regression line: ________________ (Report to three decimal places)

Intercept of the regression line: ________________ (Report to nearest whole number)

Equation of the linear model: ___________________________

Space for work:

d) From the equation, in part c), estimate the cash bonus for an executive at OMG earning

$82,613 a year, and compute the residual for this estimate.

Estimated cash bonus: ___________ Residual: ____________

22

e)Would you be confident in using your regression equation to estimate the cash bonus

for an executive at OMG earning $200,000 a year?

Yes No Reason:

f) Below is a plot showing residuals versus fitted values for the estimated regression

equation relating cash bonus to pay for the account executives at OMG.

Circle the conditions for linear regression which are violated, if any.

None are violated

Linearity

Normality

Constant Variance (Equal spread)

Independence

Parts (g) through (i) are unrelated to parts (a) through (f):

g) In commenting on the increase in home foreclosures (i.e. banks repossessing homes), a

news reporter stated “there appears to be a strong correlation between home foreclosures

and job loss of the head of household.” Comment on this statement; use one sentence

only.

h)A research study investigated the relationship between number of hours individuals

spend on the Internet and age. Which is the predictor variable? Circle your choice.

Hours on Internet Age

i)The correlation associated with the following scatterplot is:

__A. 1.00

__B. -1.00

__C. 0.50

__D. -0.50

__E. 0.00

23

MT2012: Question 5 “Greater attitude, greater latitude”

The Survey of Study Habits and Attitudes (SSHA) is a psychological test that measures

academic motivation and study habits. Females score higher, on average, than males. The

distribution of SSHA scores among the female students at a university has mean 120 and

standard deviation 28; the distribution among male students has mean 105 and standard

deviation 35. Scores are normally distributed. Assume also that scores are independent.

a) What percentage of female students have SSHA scores greater than 162? Report your

percentage to one decimal place only.

b) What SSHA score is exceeded by only 10% of female students? Round your answer

to the nearest whole number.

c) Compute the lower and upper quartiles for the distribution of scores of female students.

Round your answers to the nearest whole numbers.

d) Suppose you select a single female student and a single male student at random and

give them the SSHA test. What are the mean and the standard deviation of the difference

(female minus male) between their scores. Report to one decimal place.

Mean = __________ Standard Deviation = ____________

e) Using your answers from part d), compute the probability that the chosen female has a

higher score than the chosen male.

24

f) Suppose Angelina (a female) scores 78 on the SSHA, while Brad (a male) scores 70 on

the SSHA. Use an appropriate calculation to determine who did worse compared to the

average for their gender. Circle the name of the person who did worse.

Angelina Brad

Explanation:

MT2012: Question 6 “A convenient truth”

Part I. A convenience store owner suspects that only 10% of the customers buy

magazines and thinks that he might be able to sell something more profitable. In order to

decide whether he should stop selling them, he tracks the number of customers who buy

magazines on a given day.

a) On that day he had 300 customers. Assuming it was a typical day and that his estimate

is correct, what are the mean and standard deviation of the number of customers who buy

magazines each day? Report your answers to one decimal place.

Mean = ___________ Standard Deviation = ______________

b) What is the probability that 25 to 35 customers (inclusive) bought magazines that day?

c) How many magazine sales would you consider to be very strong evidence that his 10%

estimate was too low. That is, what number of sales would be extremely unusually high?

Hints: Use The Empirical (68-95-99.7) Rule. Remember to give a whole number answer.

Part II. Past records indicate that the magazines he sells on any day have an average

revenue of $150 with a standard deviation of $30. Suppose he takes a random sample of

36 past days’ sales receipts and records the dollar value of magazine sales.

a) Describe the sampling distribution for the sample mean by naming the model and

telling its mean and standard deviation.

b) Suppose the resulting sample mean is $130. Do you think that this sample result is

unusually small? Explain.

25

MT2012: Question 7 “Talk about confidence!”

One division of a telecommunications equipment company reports that 12% of non-

electrical components are reworked. Management wants to determine if this percentage is

the same as the percentage rework for electrical components manufactured by the

company. The Quality Control Department plans to check a random sample of the over

10,000 electrical components manufactured across all divisions.

a) The Quality Control Department wants to estimate the true percentage of rework for

electrical components to within ±4%, with 99% confidence. How many components

should they sample?

__A. 651

__B. 1000

__C. 344

__D. 438

__E. 579

b) They actually select a random sample of 450 electrical components and find that 46 of

those had to be reworked. The 99% confidence interval is closest to:

__A. [ 0.0654 , 0.1390 ]

__B. [ 0.0432 , 0.1608 ]

__C. [ 0.0763 , 0.1277 ]

__D. [ 0.0541 , 0.1499 ]

__E. Cannot be determined with the given information.

c) The 95% confidence interval based on these data is 0.0742 to 0.1302. Which one of

the following is the correct interpretation?

__A. The percentage of electronic components that are reworked is

between 7.4% and 13.0%.

__B. We are 95% confident that between 7.4% and 13.0% of electrical

components are reworked.

__C. The margin of error for the true percentage of electrical components

that are reworked is between 7.4% and 13.0%.

__D. All samples of size 450 will yield a percentage of reworked electrical

components that falls within 7.4% and 13.0%.

__E. There is a 95% chance that 7.4% to 13.0% of the electrical components

have to be reworked.

d) Based on the 95% confidence interval, should the Quality Control Department

conclude that the percentage of rework for the electrical components is lower than the

rate of 12% for non-electrical components?

__A. Yes, because the lower limit of the confidence interval is 7.4%.

__B. Yes, because 12% is contained with the 95% confidence interval.

__C. No, because 12% is contained with the 95% confidence interval.

__D. No, because the upper limit of the confidence interval is 13.0%.

__E. We cannot say since the sample size is not large enough.

e) All else being equal, increasing the level of confidence desired will...:

__A. ...tighten the confidence interval

__B. ...decrease the margin of error

__C. ...increase precision

__D. ...increase the margin of error

__E. ...increase the margin of error and tighten the confidence interval

26

MT2012: Question 8 “A dip in chips”

A company manufacturing computer chips finds that 8% of all chips manufactured are

defective. Management is concerned that high employee turnover is partially responsible

for the high defect rate. In an effort to decrease the percentage of defective chips,

management decides to provide additional training to those employees hired within the

last year. After training was implemented, a sample of 450 chips revealed only 27 with

defects. Was the additional training effective in lowering the defect rate?

a) The appropriate null and alternative hypotheses are:

H0: ______________ Ha: ______________

b) Give the formula for the appropriate test statistic and compute its value.

Test Statistic Formula: __________________

Computed value: ______________

Show your work:

c) Assume that the value of the test statistic is –1.4. Don’t use your computed value from

part b).The P-value associated with the given test statistic is closest to:

__A. 0.0404

__B. 0.05

__C. 0.0808

__D. 0.1616

__E. 0.9192

d) From the P-value in part c), and using a 1% significance level (i.e. α = .01), which of

the following is true?

__A. Conclude that additional training significantly lowered the defect rate.

__B. Conclude that additional training did not significantly lower the defect rate.

__C. Conclude that additional training significantly increased the defect rate.

__D. Conclude that additional training did not affect the defect rate.

__E. No conclusion can be made with the given information.

27

MT2012: Question 9 “The non-profit motive”

A large software development firm recently relocated its facilities. Top management has

encouraged their professional employees to engage in local service activities. They

believe that the firm's professionals volunteer an average of more than 15 hours per

month. If this is not the case, they will institute an incentive program to increase it. A

random sample of 24 professionals reported the following number of hours:

12 13 14 14 15 15 15 16 16 16 16 16

17 17 17 18 18 18 19 19 19 20 20 22

The sample has a mean of 16.75 hours and a standard deviation of 2.40 credit hours.

a) The correct null and alternative hypotheses are:

__A. H0 : = 15 and Ha : > 15

__B. H0 : µ = 15 and Ha : µ > 15

__C. H0 : µ = 15 and Ha : µ < 15

__D. H0 : µ ≠ 15 and Ha : µ = 15

__E. H0 : µ = 15 and Ha : µ ≠ 15

b) The correct value of the test statistic is closest to:

__A. 3.572

__B. -3.572

__C. 1.327

__D. -1.327

__E. 0.729

c) Which of the following conclusions is correct?

__A. We reject the alternative hypothesis at the 5% significance level.

__B. We fail to reject the null hypothesis at the 5% significance level.

__C. An incentive program is needed since the evidence indicates professional

employees volunteer an average of no more than 15 hours per month.

__D. We reject the null hypothesis; the firm shouldn't need to institute an

incentive program since the evidence indicates that professional

employees volunteer an average of more than 15 hours per month.

__E. No conclusion can be reached about the hypothesis with the information

that is given.

d) It is appropriate to test the mean because:

__ A. The data are a simple random sample from the population of interest

__ B. The distribution of the sample data appears to be approximately normal

__ C. Volunteer hours is likely to be independent across employees

__ D. All of the above

e) A 95% confidence interval for the true mean number of hours of volunteer time is

closest to:

__A. 16.75 ± 1.016

__B. 16.75 ± 0.840

__C. 16.75 ± 4.966

__D. 16.75 ± 4.114

__E. 2.40 ± 7.074

MT2012 – END OF QUESTIONS; ANSWERS AND EXPLANATIONS FOLLOW

28

MT2012: ANSWERS AND EXPLANATIONS

MT2012: Answer 1

a) A. b) D. c) C. d) B. e) E.

Details and Comments:

a) Each regular customer has the same chance of being selected for the sample.

b) Choosing “every 25th customer” makes it systematic.

c) The target population is the “universe” for which you want to be able to generalize.

d) A parameter is a numerical characteristic such as a mean or a proportion/percentage.

e) Since people can decide whether to answer or not, it is a voluntary response, and hence

subject to bias. People who decide to participate may not be like people who decide not to

participate.

MT2012: Answer 2

a) Age, Household Income, Book value of portfolio

b) C. 18.1% c) B. 33.3% d) A. 53.1% e) E. 54.3%

f) Yes: The age distribution (ratio of younger to older) is about the same for each mode

(i.e. type) of investment.

Details and Comments:

a) Age (yrs), Household Income ($), and Book Value ($) all have units and are measured

on a continuum, so they are quantitative.

b) 34/188 = 0.181

c) 34/102 = 0.333

d) 34/64 = 0.531

e) 102/188 = 0.543

f) Look for differences across the clusters of bars.

MT2012: Answer 3

a) C. Skewed to the right b) A. Mode < Median < Mean

c) B. $ 13,843 d) B. $22,057

e) Lower inner fence = $36,607.50; Upper inner fence = $124,835.50

f) B. g) A. h) B. i) D.

Details and Comments:

a) Long right-hand tail: more of the area is piled up to the left.

b) The mode is the peak and it is clearly to the left of the median value of 77020. The

median is less than the mean for a right-skewed distribution.

c) Use the rule of thumb: s ≈ Range/6

d) IQR = Q3 – Q1 = 91750 – 69693 = 22,057

e) Lower inner fence = 69,693 – 1.5×22,057 = $36,607.50

Upper inner fence = 91,750 + 1.5×22,057= $124,835.50

f) The maximum is larger than the upper fence but the minimum is not smaller than the

lower fence.

g) The sum is increased so the mean is increased.

h) The median is the line in the interior of the box.

i) Variability is shown by the length of the box.

29

MT2012: Answer 4

a) r2 = 0.735

2 = 0.5402 or 54%

b) Unchanged at 0.735

c) = 0.735(15,618/28,077) = 0.409;

= 17,927 – (0.409)(85,318) = -16,968; = -16,968 + 0.409x

d) (82,613) = -16,968 + 0.409(82,613) = $16,821

Residual = 11,863 – 16,821 = -$4,958

e) No; a prediction at $200,000 requires extrapolation beyond the range of data.

f) Constant Variance (V-shape indicates violation of this assumption)

g) The two variables are categorical, not quantitative, so correlation is not appropriate.

h) Age

i) E. 0.00

Details and Comments:

a) This is the definition of r-squared.

b) The correlation coefficient has no units; it doesn’t change if the measurement units

change.

c) Straightforward application of least squares regression line formulas.

d) Substitute the x-value into the regression equation to get the predicted y. The residual

is the observed y minus the predicted y.

h) Age “precedes” and therefore predicts Hours on Internet.

i) The best-fitting straight line is horizontal.

MT2012: Answer 5

a) 6.7% b) 156 c) Q1 = 101; Q3 = 139

d) Mean = 15; SD = 44.8 e) 0.6293 or 0.63 or 63%

f) Angelina; Z-score for Angelina = -1.5; Z-score for Brad = -1.0;

Details and Comments:

a) Standardize the X-value; 162 is 1.5 SDs above the average. Find the area to the right of

1.5 on the Z-curve.

Pr(X > 162) = Pr(Z > [162 – 120]/28) = Pr(Z > 1.5) = 0.0668.

b) Find the value of Z that has an area of 10% to the right; then “unstandardize.”

z = 1.28; X = 120 + 1.28(28) =155.8.

c) Find z-values that have an area of 25% to the right and to the left; then

“unstandardize.” Since the Z is symmetric, the z-value on the left is the negative of the z-

value on the right.

Q1: z = –0.675; X = 120 + (–0.675)(28) = 101

Q3: z = 0.675; X = 120 + (0.675)(28) = 139

d) Mean = 120–105 =15; SD = = 44.8

e) Pr(F–M > 0) = Pr(Z > [0–15]/44.8) = Pr(Z > -0.33) = 0.6293 or 0.63 or 63%

f) Z-score for Angelina = (78–120)/28 = -1.5; Z-score for Brad = (70–105)/35 = -1.0;

Angelina did worse relative to the reference populations since her Z-score more negative.

30

MT2012: Answer 6

Part I.

a) Mean = np = 300×0.10=30.0; SD = = = 5.2

b) Pr(25 ≤ X ≤ 35) = Pr([25–30]/5.2 < Z < [35–30]/5.2) = Pr(-0.96 < Z < 0.96)

= 1 – 2(0.1685) = 0.663.

c) From the Empirical Rule, 3 SDs above the mean is extremely unusual;

μ +3σ = 30 + 3(5.2) = 45.6. Sales of 46 or more would be extremely unusual.

Part II.

a) Normal: Mean = 150 and SD = 30/ =5

b) Pr( < 130) = Pr(Z < [130 – 150]/5) = Pr(Z < -4) < 0001

There is an extremely small probability of getting a sample mean this small.

Details and Comments:

Part I. a) Use the mean and standard deviation of a count.

b) Use the normal sampling distribution of a count. (Note: Continuity correction was not

needed, but if you used it correctly you would get an answer of 0.711.)

Part II. a) Use the mean and standard deviation of a mean. (Note: The CLT applies here, but it is

not necessary to say this in the answer.)

b) Use the normal sampling distribution of a mean.

MT2012: Answer 7

a) D. b) A. c) B. d) C. e) D.

Details and Comments:

a) n = (2.5762)(0.12)(0.88)/(0.04

2) = 438

b) =46/450 = 0.1022

99% CI: 0.1022 ± 2.576 = 0.1022 ± 0.0368

c) Notice the wording and the use of the term “95% confident”.

d) Values inside a confidence interval are likely values of the parameter. Evidence of a

change or a difference depends on the target value being outside the CI.

e) Examine the CI formula; a higher confidence level requires a larger multiplier/critical

value so the margin of error will be larger.

31

MT2012: Answer 8

a) H0 : p = 0.08 and Ha : p < 0.08

b) Formula and computed value; = 27/450 = 0.06

z =

=

= -1.56

c) C. d) B.

Details and Comments:

a) One-tailed alternative since the question asks whether the training was effective in

“lowering” the defect rate.

b) Remember that the test statistic uses

in the denominator, not

as in the

confidence interval.

c) Find the area to the left of -1.4 on the standard normal curve.

d)Since the P-value is not less than 0.05 the evidence is not statistically significant.

MT2012: Answer 9

a) B. b) A. c) D. d) D. e) A.

Details and Comments:

a) H0 : µ = 15 and Ha : µ > 15.

Use one-tailed alternative since the question is about “increasing” the volunteer time.

b) t =

=

= 3.572

c) The P-value is much smaller than 0.05 so reject the null hypothesis. The volunteer time

is greater than 15 hours. So no incentive program is needed to get past 15 hours.

d) These are the assumptions/conditions for a one-sample t-test.

e) 16.75 ± 2.069×2.40/ = 16.75 ± 1.016

END OF ANSWERS AND EXPLANATIONS TO MIDTERM 2012

32

Midterm Exam 2011

Notes: This exam has 9 questions. The duration is 2 hours. Books, notes, and

calculators are allowed, but not computers, cellphones or on-line connectivity.

MT2011: Question 1 “First things first”

a) At the beginning of the term we asked all Commerce 291 students to complete our on-

line survey. This survey was most likely designed to be:

__A. a random sample of all C291 students

__B. a census of all C291 students

__C. a random sample of business students

__D. a random sample of 2nd

year UBC students

__E. all of the above

b) The survey asked a wide range of questions. For each variable, circle the description

which best describes the type of data the variable represents.

Ethnic background Categorical Quantitative Identifier

Height Categorical Quantitative Identifier

C290 grade Categorical Quantitative Identifier

# hrs online per day Categorical Quantitative Identifier

c) From the survey results, we can estimate that, on average, students spent 15.2 hours

per week studying. This number seems high given that for a course load of 4 courses the

students spend 12 hours per week in the classroom and nearly half of the students

reported doing paid work. What is the most likely explanation?

__A. the data are very skewed and the median is a better numerical summary

__B. the data are bimodal, the two groups are those that work and those that don’t

__C. women study more than men

__D. none of the above

d) Unfortunately, not every C291-registered student responded to the survey. If it were

true that students who didn’t respond also spend less time studying, then our estimate of

study time from the survey is:

__A. a good estimate of average study time of C291 students

__B. biased above the true average study time of C291 students

__C. biased below the true average study time of C291 students

__D. not a good estimate for study time of C291 students but

we can’t say whether it is too high or too low.

e) From the survey we find that the Commerce 290 Grade (call this variable, X) has a

symmetric, bell-shaped distribution. Also, 95% of the grades fall in the range 53 to 93.

Use that information to compute the mean and standard deviation of X. Report to at most

one decimal place.

Mean of X = _____ SD of X = _____

33

MT2011: Question 2 “Stock answers are sufficient here”

a) The following data are the price-to-earnings ratios (P/E ratio) for a random sample of

25 stocks traded on the NYSE. The data values have been sorted from smallest to largest. Data: 4 8 11 11 12 13 13 14 14 15 16 17 17 17 19

21 22 22 24 24 26 28 33 35 39

The mean of these values is 19.0 and the standard deviation is 8.5.

i) Find the following:

Median = ______

Q1 = ______

Q3 = ______

IQR = ______

Inner fences = ________________

Outliers: = __________________ (If there are no outliers, write “None”) (Note: Outliers are defined using the “inner fences” criterion)

ii) Is the distribution symmetric or skewed? (Note: You do not have to draw a graph to

answer this.) Circle your choice. Then give your reason.

Symmetric Skewed

Reason:

iii) Sketch a boxplot of these data. Use the version based on the five-number summary;

do not use the modified version using fences.

b) Determine whether each statement is true or false? Circle your choice. No explanation

is required.

1. If the mean and SD are equal for a measurement variable True False

that only takes positive values, the distribution is symmetric.

2. If the mean and median are equal, the distribution must True False

be normal.

3. If the mean and median are equal, the mode must also True False

equal the mean and median.

4. The SD and IQR are always equal for a symmetric True False

distribution.

5. The SD of a set of data values can never be zero. True False

34

MT2011: Question 3 “To-fu or not to-fu, that is the question”

Read the following survey design plan and then answer the questions after it.

Get Healthy, a producer of health foods conducts a survey of the Lower Mainland to

determine how receptive high school students would be to its TOFU BURGH product and

what market potential (sales) it could expect. It plans the survey as follows:

i. From the list of all schools in the area, two groups are defined, public and private high

schools, called PUBS and PRIS

ii. From the PUBS, four schools are chosen randomly.

iii. From the PRIS, one school is chosen randomly.

iv. In the PUBS schools selected, on

one day, researchers give every

fifteenth student to exit the school a

TOFU BURGH and a-stamped, self-

addressed postcard (like the one here).

v. In the PRIS school, researchers set

up a stand outside the school and give a

free TOFU BURGH and the postcard to

any student who comes to the stand.

a) The overall survey sampling design planned by the company can best be described as:

__A. convenience sampling

__B. multi-stage sampling

__C. stratified sampling

__D. simple random sampling

__E. cluster sampling

b) In the PUBS selected, the sampling design uses:

__ A. systematic sampling

__ B. voluntary response strategy

__ C. unacceptable bribery of students

__ D. anecdotal responses

c) In the PRIS selected, the sampling design uses:

__ A. systematic sampling

__ B. voluntary response strategy

__ C. unacceptable bribery of students

__ D. anecdotal responses

d) One parameter of interest is likely to be:

__ A. the total number of students who replied to the survey

__ B. the number of high school students in the Lower Mainland

__ C. the number of students who replied they would buy at least one

TOFU BURGH in a typical week

__ D. the proportion of students who replied they would buy at least one

TOFU BURGH in a typical week

e) Which of the two samples is likely to have non-response bias?

__ A. PUBS schools only

__ B. PRIS school only

__ C. Both PUBS and PRIS schools

__ D. Neither will have non-response bias

35

MT2011: Question 4 “Unassociated questions about association – how ironic”

Note: This question has three unrelated parts.

a) A business school conducted a survey of companies in its state. They mailed a

questionnaire to small, medium-sized, and large companies. The rate of non-response is

important in deciding how reliable survey results are. Here are the data on responses to

this survey.

Small Medium Large

Response 375 160 40

No Response 225 240 160

Total 600 400 200

(i) What was the overall percent of non-response?

(ii) How is non-response related to the size of the business? Use percents to make your

statement precise.

b) Investment reports now often include correlations. Following a table of correlations

among mutual funds, a report adds, “Two funds can have perfect correlation, yet different

levels of risk. For example, Fund A and Fund B may be perfectly correlated, yet Fund A

moves 20% whenever Fund B moves 10%.” Explain to someone who knows no statistics

how this can happen.

c) A study shows that there is a positive correlation between the size of a hospital

(measured by its number of beds, x) and the median number of days, y, that patients

remain in the hospital. Does this mean that you can shorten a hospital stay by choosing a

small hospital? Explain your answer choice.

Yes No

Reason:

36

MT2011: Question 5 “Bart vs. Lisa does not refer to Simpson’s Paradox”

a) At a well-known business school the grade point averages (GPA) of its 1000

undergraduates are normally distributed with mean 2.84 and standard deviation 0.40.

(i) What percentage of the undergraduates have GPAs below 2.00 (i.e. “on probation”)?

Answer: ________

(ii) What GPA will be exceeded by only 20% of the student body?

Answer: ________

(iii) Compute the lower and upper quartiles, and the interquartile range for this

distribution.

Q1 = _______ Q3 = _______ IQR = ______

b) Bart scores 725 on the mathematics section of the Scholastic Aptitude Test (SAT). In a

reference population, SAT scores are normally distributed with mean 500 and standard

deviation 100. Lisa scores 33 on the American College Test (ACT) mathematics test;

ACT scores are normally distributed with mean 18 and standard deviation 6.

(i) What are the z-scores for each student?

Bart: _______ Lisa: _______

(ii) Circle either the name Bart or Lisa (above) based on who did better relative to the

reference populations.

37

MT2011: Question 6 “Strength in numbers; numbers on strength”

a) To test the strength of building materials such as steel girders, engineers place

increasing loads on the girders until they break. The pressure exerted by the load that

eventually breaks the material is call the ‘strength’ of the girder. Generally speaking, the

longer the girder, the less the strength. Your company makes steel girders. The engineer

in charge of testing tells you that he has tested 10 girders to breaking point and has

obtained data linking the length of each girder (in metres) to its strength (in kg per square

centimetre). But his computer crashed just after he ran a regression analysis on the data

and all he can remember is the lengths of the girders and a few strengths. He did manage

to record the means and standard deviations of all the lengths and strengths and the r2 of

the regression, which was 0.719.

(X) Length (m) (Y) Strength (kg/cm2)

1 90

1 101

2 Lost

2 Lost

3 91

3 77

4 Lost

4 Lost

5 76

5 Lost

Mean 3.00 82.60

SD 1.49 10.72

Note: The means and standard deviations are calculated for the ENTIRE data set,

including those that are missing.

(i) What is the correlation between length and strength? Report to three decimal places.

(ii) Work out a regression equation that predicts strength from length.

Equation: ___________________________

(iii) You notice that the purchaser of your girders requires the 5 m girders to support an

average load of 75 kg per square centimetre. Do you feel confident your girders will do

that? Give a numerical rationale.

38

b) What is the correlation coefficient for the following three points in the X-Y plane?

(STOP AND THINK BEFORE YOU START!)

X 1 3 5

Y 4 3 2

Answer: __________

c) An American study found that the correlation between two-year-old children’s heights

(measured in inches) and their weights (measured in pounds) was 0.46. What would the

correlation coefficient be if you converted their heights to centimetres and weights to

kilograms? (One inch = 2.54 cm and 1 pound = 0.454 kg.)

Answer: __________

d) An economist studied salaries of 321 bank employees with five or less years of

employment in a national bank. He found that the relationship between years of service

and salary was linear and that the regression equation predicting salary (in thousands of

dollars) was: Salary = 21.5 + 3.1 * Years.

He concludes that employees with 10 years of service should make an average salary of

$52,500. Is his conclusion correct? If not, say why.

e) In part d) the economist has used the regression equation to make a prediction. Which

of these numbers best measures the precision of this prediction?

__A. The slope of the line (b1)

__B. The standard deviation of y (sy)

__C. The standard deviation of x (sx)

__D. The square of the correlation coefficient (r2)

__E. The ratio of the two standard deviations (sy / sx)

f) An investigator measuring various characteristics of a large group of athletes found that

the correlation coefficient between the weight of the athlete and the weight that the

athlete could lift was r = 0.60. Determine whether each statement is true or false. Circle

your choice.

(i) If an athlete gains 5 kg, he/she will be able to lift True False

an additional 3 kg.

(ii) The more an athlete can lift, on the average the more True False

that athlete weighs.

(iii) 36 per cent of the athlete’s lifting ability can be True False

attributed to his or her weight alone.

(iv) 60 per cent of the athlete’s lifting ability can be True False

attributed to his or her weight alone.

39

MT2011: Question 7 “Pack up all your troubles, and call it a day”

An important part of the customer service responsibilities of a telephone company relates

to the speed with which troubles in residential service can be repaired. Suppose that past

data indicate that there is a probability of 0.70 that service troubles can be repaired on the

same day they are reported.

a) Suppose the company receives 100 trouble calls on a particular day. What is the

approximate chance that 80% or more will receive same-day repairs.

b) Suppose it is also known that the repair time for a trouble call has a mean of 480

minutes and a standard deviation of 250 minutes. A random sample of 400 trouble calls

was taken and the repair times recorded. Compute the probability that the mean of the

400 repair times is less than 500 minutes.

40

MT2011: Question 8 “Statistical analysis of a logo transformation”

An established clothing retailer, CHAP, is interested in customer response to a proposed

new logo. A survey randomly samples 100 customers; 55 of them say they would prefer

the new logo to the previous one. However, CHAP will only change its logo if it is

convinced that the newly designed logo is preferred by the majority (i.e. more than half)

of its customers. Based on this information answer the following questions.

a) The sample estimate , the proportion of customers who prefer the newly designed

logo over the previous one is:

__ A. 0.55

__ B. 55

__ C. 100

__ D. Not able to be determined from the information given

b) The standard error of this estimate is closest to:

__ A. 0.0025

__ B. 0.050

__ C. 0.071

__ D. 0.50

c) The 95% confidence interval for the true proportion of the customers who prefer the

new logo over the previous one is closest to:

__ A. 0.55 ± 0.098

__ B. 0.55 ± 0.98

__ C. 0.55 ± 0.0049

__ D. 55 ± 9.8

d) How large a sample n would you need to estimate p, the proportion of people who

prefer the newly designed logo over the previous one, with margin of error 0.05 with 99%

confidence? Use the guess = 0.5 as the value for p.

__ A. 384

__ B. 664

__ C. 26

__ D. 271

e) If a hypothesis test were conducted on these data, the test statistic would be 1.00. If the

alternative hypothesis were one-sided, what would the P-value be?

__ A. 0.0794

__ B. 0.1587

__ C. 0.3174

__ D. 0.8413

f) Which of the following is a correct conclusion from the hypothesis test in part e)?

__ A. Customers definitely prefer the new logo

__ B. Customers definitely do not prefer the new logo

__ C. There is not enough evidence to say customers prefer the new logo

__ D. There is not enough evidence to say customers do not prefer the new logo

41

MT2011: Question 9 “The business of bus-ness”

You are the new Operations Manager of the local public transportation company and are

especially interested in the reliability of bus service. You plan, on a monthly basis, to take

a random sample of major bus stops and observe whether the buses depart on time or late

and how late they are. (Buses never leave early since, if they arrive early, they wait until

their departure will be exactly on time.)

a) The first month, you gather a random sample of 121 bus departures from a variety of

times of day, days of the week, routes and locations. The sample has an average lateness

of departure of 6.4 minutes with a standard deviation of 1.8 minutes. Which of the

following is closest to a 95% confidence interval for the average lateness of departures

for the entire bus system this month.

__ A. 6.4 ± 0.029

__ B. 6.4 ± 0.271

__ C. 6.4 ± 0.324

__ D. 6.4 ± 3.564

b) Which of the following would decrease the width of the confidence interval?

__ A. Reduce the confidence level

__ B. Increase the sample size

__ C. Reduce the sample standard deviation

__ D. All of the above

Five years ago, the system-wide mean lateness of departure was known to be 6.8 minutes.

Using a 5% level of significance and the sample results of part a), carry out a hypothesis

test to decide whether the system is improving; that is, whether the mean lateness has

decreased from five years ago.

c) The appropriate null and alternative hypotheses are:

H0: ____________ Ha: ____________

d) Give the formula for the appropriate test statistic and compute its value.

Formula: __________________

Computed value: ______________

(Show your work to the right ==>)

e) Give a range in which the P-value is located.

42

f) From the P-value associated with this test statistic, which of the following is correct?

__ A. Do not reject H0 at the 10% significance level

__ B. Reject H0 at the 10% significance but not at the 5% significance level

__ C. Reject H0 at the 5% significance level but not at the1% significance level

__ D. Reject H0 at the 1% significance level

g) Using the 5% significance level, state your conclusion in one clearly worded sentence

that the bus company management can understand.

h) The distribution of lateness of departure is strongly skewed to the right. However, it is

still appropriate to test the mean because:

__ A. The data are a simple random sample from the population of interest

__ B. The sample size is large enough for the Central Limit Theorem to apply

__ C. Since the sample is random, bus departures are independent of one another

__ D. All of the above

BONUS: In what century did the “equals” sign first appear in print?

__ A. 1300s

__ B. 1400s

__ C. 1500s

__ D. 1600s

__ E. 1700s

__ F. 1800s

__ G. 1900s

MT2011 – END OF QUESTIONS; ANSWERS AND EXPLANATIONS FOLLOW

43

MT2011: ANSWERS AND EXPLANATIONS

MT2011: Answer 1

a) B

b) Ethnicity: Categorical; Height: Quantitative; C290 grade: Quantitative;

# hrs online: Quantitative

c) A; d) B; e) Mean(X) = 73, SD(X) = 10

Details and Comments:

a) The goal was to survey the entire population of C291 students; that is the definition of

a census.

b) Height, C290 grade, and # hrs online are each measured with units (cm, %, and hrs,

respectively) so all three are quantitative variables.

c) The distribution is likely to have a small number of students with a high number of

study hours; this skewness has the effect of inflating the mean.

d) Bias comes from, among other sources, missing values which are missing for a reason

related to the variable of interest.

e) The interval 73 ± 20 = Mean ± 2SD (by The Empirical Rule) = 73 ± 2(10).

The midpoint of the interval (73) is the mean; the SD is 10.

MT2011: Answer 2

a) i) Median = 17, Q1 = 13, Q3 = 24, IQR = 11;

Inner fences = (-3.5, 40.5). [Accept also (0,40.5).] There are no outliers.

ii) The distribution is skewed since the mean is quite different from the median.

iii)

_________

|-----------|__|_______|-----------------|

___________________________________

0 10 20 30 40

b) All five statements are False.

Details and Comments:

a) i) With 25 data points, the median is the 13th

value. The Q1 is between the 6th

and 7th

values (which are equal here) and the Q3 is between the 19th

and 20th

values (which are

also equal here). IQR = Q3 – Q1. Since the lower inner fence (Q1 – 1.5×IQR) is negative,

it is also acceptable to report it as 0 because P/E ratios cannot be negative.

ii) Actually, the distribution is skewed to the right, but that distinction was not needed in

the answer.

iii) The sketch must show the skewness, namely that the median is closer to the left side

of the box and the left whisker is shorter than the right whisker.

b)

1. The Empirical Rule wouldn’t be able to “work” so the distribution is NOT symmetric.

2. A distribution can be symmetric without being normal; e.g. pyramid shape, or uniform.

3. A symmetric distribution can have two peaks; the mean and median are in the middle

but the modes are at either end (e.g. U-shaped)

4. There is no reason for this to be true.

5. SD = 0 if all data values are the same.

44

MT2011: Answer 3

a) B or C; b) A; c) B; d) D; e) C;

Details and Comments:

a) Both multi-stage sampling and stratified sampling are acceptable answers.

Technically, multi-stage sampling is the preferred answer, since for PUBS, four schools

are chosen randomly but the actual students are selected systematically.

b) Since every fifteenth student is selected, the selection is systematic, not random.

c) Since students are free to come, or not, to the stand, this is voluntary response.

d) Counts are not parameters because they are not adjusted for sample size; however,

proportions are parameters.

e) Cards are handed out either to every fifteenth student or to volunteers; however, in

each group not everyone who receives a card will mail the card in; that’s non-response.

MT2011: Answer 4

a) (i) 52% (625/1200 = 0.52)

(ii) Non-response rates are: Small: 37.5%, Medium: 60%, Large: 80%.

The larger the company the higher the expected rate of non-response.

b) Correlation is not the same as slope. So a perfect correlation does not mean that the

slope is 1, hence a 1 unit increase in x does not mean a 1 unit increase in y.

c) No: Larger hospitals are more likely to take more serious cases requiring longer length

of stay.

Details and Comments:

a) (i) Sum across the columns to get the row totals of 575 Respondents and 625 Non-

respondents. Then divide by the overall total of 1200.

(ii) Column percentages are needed here, not row percentages.

b) Remember the formula for slope:

. Even if r = 1, the slope is still the ratio of

the SDs, which need not be equal.

c) Look for lurking variables to explain unusual or nonsensical correlations.

MT2011: Answer 5

a) (i) Pr (X < 2.00) = Pr (Z < [2.00–2.84]/0.40) = Pr (Z < –2.10) = 0.0179 or 17.9%.

(ii) Z = 0.84; X = 2.84 + 0.84(0.40) = 3.18 (or 3.176)

(iii) Q1 = 2.57; Q3 = 3.11; IQR = 0.54

Q1 for Z = –0.675; X = 2.84 + (–0.675)(0.40) = 2.57

Q3 for Z = 0.675; X = 2.84 + (0.675)(0.40) = 3.11

IQR = 3.11 – 2.57 = 0.54

b) Bart: 2.25; Lisa = 2.50, Circle Lisa

Z-score for Bart = (725–500)/100 = 2.25; Z-score for Lisa = (33–18)/6 = 2.50;

Lisa did better relative to the reference populations since her positive Z-score is higher.

Details and Comments:

a) Remember to make sketches of the required areas so that you get the correct parts of

the normal curve. In (i), standardize X to Z and find the corresponding area ; in (ii) and

(iii), begin with the area, find Z and “unstandardize” to get X.

45

MT2011: Answer 6

a) (i) r = – = –0.848 (Note that the correlation is negative!)

(ii)

= –0.848(10.72/1.49) = –6.10;

= 82.60 – (–6.10)(3.00) = 100.9; = 100.9 – 6.10x

(iii) (75) = 100.9 – 6.10(5) = 70.4. Since this is less than the required 75 kg/cm2, you

should not be confident that the 5m girders will be sufficient.

b) Perfect negative correlation: r = –1 (Plot the data points; they fall on a straight line.)

c) r = 0.46, unchanged (Correlation is invariant to the measurement scales.)

d) No – predictions at 10 years requires extrapolation beyond the range of data (that is,

the analysis was done using bank employees with 5 or less years of employment).

e) D. The square of the correlation coefficient.

f) False, True, True, False

Details and Comments:

a) The minus sign is vital; the correlation is negative since the longer the girder, the lower

the strength. If you forget the minus sign your calculations of the slope, intercept and

regression equation will be incorrect and you would end up concluding that 5 m girders

would support an average load of 75 kg/cm2. In that case your building might fall down.

Bad statistical analysis can kill!

b) Remember to make a plot before doing the calculations.

f) (i) is false because a gain of 5 kg will mean an additional lift of 3 kg only on average.

A gain of 5 kg might give additional lift greater than 3 kg for some people and less than 3

kg for others; (ii) refers to what happens on average; (iii) uses the definition of r2; (iv)

mistakenly uses r instead of r2.

MT2011: Answer 7

a) Pr ( > 0.80) = Pr (Z >

)

= Pr (Z > 2.18) = 0.0145 or 1.45%

b) Pr ( < 500) = Pr (Z <

)

= Pr (Z < 1.60) = 0.945 or 94.5%

Details and Comments:

a) Use the sampling distribution of .

b) Use the sampling distribution of (i.e. remember the in the denominator).

Both of these situations depend on the Central Limit Theorem and both random samples

are large enough (100 and 400, respectively). Remember to make a sketch to get the

correct area!

46

MT2011: Answer 8

a) A; b) B; c) A; d) B; e) B; f) C

Details and Comments:

a) Reason: =55/100 = 0.55

b) Reason: = 0.050

c) Reason: 0.55 ± 1.96(0.050)

d) Reason: n = (2.5762)(0.5)(0.5)/(0.05

2) = 664

e) Reason: Area to the right of 1.00 on the z-curve.

f) Reason: The P-value is not less than 0.05 (and not even less than 0.10).

MT2011: Answer 9

a) C b) D c) H0: μ = 6.8; Ha: μ < 6.8

d) t =

=

= -2.44

e) 0.005 < P-value < 0.01

f) D. Reject H0 at the 1% significance level

g) There is strong evidence to say that the system is improving (or that mean lateness has

decreased).

h) B or D (either is acceptable)

Details and Comments:

a) Reason: = 1.980: CI = 6.4 ± 1.980(1.8/ ) = 6.4 ± 0.324

b) Examine the effect of each of these by referring to the formula for the CI.

c) This is a one-tailed alternative since the question asks whether mean lateness has

decreased from five years ago.

d) Remember the minus sign on the test statistic.

e), f) & g) Reject H0 since the P-value is less than 0.01. Remember to state your

conclusion in a sentence that answers the original question.

h) B is the most important of the three, but A and C are also needed for the test to work.

BONUS: C. The “equals” sign first appeared in print in 1557.

MT2011 – END OF ANSWERS AND EXPLANATIONS

47

Midterm Exam 2010

Notes: This exam has 9 questions. The duration is 2 hours. Books, notes, and

calculators are allowed, but not computers, cellphones or on-line connectivity.

MT2010: Question 1 "Mittens, means, and medians"

a) The Hudson's Bay Company was the official retailer of Olympics merchandise,

including the very popular red mittens. Their database included information on each sale

made to customers who paid by credit card (Visa only). Some of the variables they

collected are listed below. Decide whether each variable would, for analysis, be most

usefully considered as categorical, quantitative or neither.

Total amount of the sale ($) Categorical Quantitative Neither

Country of origin on credit card Categorical Quantitative Neither

Gender of the customer Categorical Quantitative Neither

Visa credit card number Categorical Quantitative Neither

b) Credit card customers were divided into two groups: Canadian residents and visitors to

Canada. The average amount spent by all Canadian residents was $200. The average

amount spent by all visitors to Canada was $300. What must be true about the average

amount spent by all customers?

__ A. It must be $250

__ B. It must be larger than the median expenditure

__ C. It could be any number between $200 and $300

__ D. It must be larger than $250

c) A sample of 500 cash sales had a mean of $20 and a standard deviation of $40. The

histogram of the data would most likely be:

__ A. skewed to the left (i.e. long left-hand tail)

__ B. approximately symmetric

__ C. skewed to the right (i.e. long right-hand tail)

__ D. bimodal

d) Which of the following is likely to have a mean that is smaller than the median?

__ A. The salaries of all National Hockey League players

__ B. The grades of students (out of 100) on a very easy exam on which most

score very high or perfectly, but a few do very poorly

__ C. The prices of homes in Vancouver

__ D. The grades of students (out of 100) on a very difficult exam on which most

score poorly, but a few do very well

48

e) Here is the frequency distribution of the ages of a sample of 100 employees of the

Hudson's Bay Company.

Age (years) Frequency

15-19 2

20-24 10

25-29 19

30-34 27

35-39 16

40-44 10

45-49 6

50-54 5

55-59 3

60-64 2

Total 100

(i) What percentage of the employees is 50 or older? _______

(ii) The median age of the employees is:

__ A. About 40

__ B. Between 30 and 34

__ C. Between 40 and 49

__ D. None of the above

(iii) The mean age of the employees is:

__ A. About 34 because about half are younger than 34 and half are older than 34

__ B. Above the median because the distribution is approximately symmetric

__ C. Above the median because the distribution is skewed to the right

__ D. None of the above

f) Based on the following figure, decide whether each of the statements below the figure

is more likely to be True or False. (Note: House income means "total household income"

and is referred to simply as "income" in the statements.)

Mercedes buyers have the highest variability in income. True False

For each car type, the incomes are reasonably symmetric. True False

There is a positive correlation between income and brand. True False

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

BMW Cadillac Lexus Lincoln Mercedes

car

bu

yer

ho

use

inco

me

49

MT2010: Question 2 "Catching some zzzs"

a) Consider a standard normal random variable, Z, (i.e. with mean 0 and standard

deviation 1). Find the median, lower and upper quartiles and interquartile range (IQR) of

Z.

Median of Z: ______

Lower quartile (Q1): _______

Upper quartile (Q3): _______

Interquartile Range: _______

b)What percentage of values of Z lie outside 1.5×IQR on each side of the median? That

is, find the total percentage below "Median – 1.5×IQR" or above "Median + 1.5×IQR".

c) Draw a boxplot that would represent data obtained from a large sample of values of Z.

d) This part is unrelated to parts a), b) and c).

Scores on the Wechsler Adult Intelligence Scale (WAIS), a standard IQ test, are

approximately normally distributed for all age groups; however, the means and standard

deviations of scores differ across different age groups. For the 20 to 34 age group, the

mean is 110 and the standard deviation is 25, while for the 60 to 64 age group, the mean

is 90 and the standard deviation is 25. Sarah is 29 and her mother Ann is 62. Sarah scores

135 on the WAIS while Ann scores 120. Which of the two has the higher score relative to

her age group? Explain your choice with appropriate calculations.

____ Ann ____ Sarah

50

MT2010: Question 3 "Contender for gender offender"

A university offers only two degree programs, one in Engineering and one in English.

Admission to the programs is competitive, and a women's group suspects discrimination

against women in the admissions process. They obtain the following data from the

university, a two-way classification of all applicants by gender and admissions decision.

Male Female

Admitted 35 20

Not Admitted 45 40

a) Is there evidence of an association between the applicants' gender and success in

obtaining admission? Why or why not?

b) The university replies that there is no discrimination. In its defence, it produces a

three-way table that classifies applicants by gender, admission decision AND program to

which they applied.

Engineering English

Male Female Male Female

Admitted 30 10 Admitted 5 10

Not Admitted 30 10 Not Admitted 15 30

Is there an association between admission rates and gender in either program? Explain

why or why not.

c) Are the answers in parts a) and b) contradictory? If so, how can you explain the

contradiction?

d) After disregarding gender, are admission rates different in the two programs? Support

your conclusion with an appropriate two-way table (i.e. admission decision by program).

51

MT2010: Question 4 "Beauty is in the eye of the frolder"

On a recent trip to Mars, scientists discovered a colony of small creatures that they named

frolders. Due to the speed and agility of the frolders, the scientists could only capture five

specimens to bring back to Earth to study. One scientist suspects the weight of the frolder

may be related to the number of eyes it has. The following table shows the weight and

number of eyes for each of the five specimens:

Specimen ID A101 A102 A103 A104 A105

Weight (kg) 2 8 4 15 6

Number of Eyes 2 11 5 17 5

a) Plot these data below.

b) Briefly describe the association (must be brief for full marks!)

c) Which of the following values is the correct correlation coefficient for this data?

Note: You can reason this out without doing the calculation.

__ A. r = 0.5

__ B. r = 0.975

__ C. r = 0

__ D. r = -0.954

__ E. r = -0.5

d) Looking at the scatterplot, is the correlation coefficient an appropriate measure? Why

or why not?

e) A journalist reporting on this study claims that being heavier causes a frolder to grow

more eyes. What is wrong with this statement?

f) Do you think these five frolders represent a random sample? Why or why not?

0

5

10

15

20

0 5 10 15 20

# Ey

es

Weight (kg)

Frolder Study

52

MT2010: Question 5 "Wires, dam wires, and electricians"

Electrical wires can corrode over time. And wires used near hydroelectric dams can

corrode more quickly because of the extra moisture in the air. Corrosion rates (measured

in hundredths of mils) are generally known for various types of wire, but electricians

would like to be able to predict the corrosion rates near dams. Corrosion rates for 30

types of wire were measured in normal use and at dams to assess the relationship. A

linear regression model can be constructed with wires in normal use as the x variable and

the same wires used at dams as the y variable. The following scatterplot shows the data:

a) In this study, the response variable is:

__ A. Corrosion rate for a dam wire

__ B. Corrosion rate for a wire in normal use

__ C. Either rate; it does not matter which is considered the response

__ D. Neither; the instrument used to measure corrosion is the response variable

b) Is linear regression appropriate here? Choose the single best statement.

__ A. Yes, the scatterplot is straight enough

__ B. No, there is not enough scatter

__ C. No, there is too much scatter

__ D. Yes, there are no outliers

c) Summary statistics are presented below. Use them to calculate the regression line.

Show the formulas and your work. Report your final answers to three decimal places.

= 304.6667 sx = 196.4466 r = 0.8691

= 554.0000 sy = 286.6104

0

200

400

600

800

1000

1200

0 200 400 600 800

Dam

Wir

e

Wire (normal use)

53

d) A new type of wire has a corrosion rate measure of 555. What does the model predict

for the corrosion measure of this type of wire used at a dam?

e) One of the data points is (220, 245). What is the value of the residual for this point?

f) What fraction of the variation in y is accounted for by the model?

g) Can the regression line be used to reliably estimate the dam wire corrosion rate for a

wire which has a rate of 2500 mil under normal use? Give a reason.

___ Yes ___ No

Reason:

h) Fill in each blank with the letter of the ending that fits best.

(i) If the x and y variables are switched, __________.

(ii) If the units are changed for both x and y variables, __________.

(iii) If the units are changed for just the x variable, __________.

(iv) If a constant is added to the y variable, __________.

Endings:

A. ...the slope will change but the averages and standard deviations will not change.

B. ...sx will change but will not change.

C. ...the data will be normally distributed.

D. ...only the correlation will change.

E. ...the correlation, slope, and standard deviations will remain the same.

F. ...the correlation and slope will both change.

G. ...the slope will change, and sx and sy will also change.

54

MT2010: Question 6 "Putting the pedal to the medal"

Retain all precision throughout your calculations but write down only two decimal places

for your final answer.

For parts a) and b), assume that the weights of the gold medals, silver medals, and

ribbons are all independent (especially since we have not learned how to deal with such

questions otherwise!).

a) Each medal made for the recent Olympics is unique. Ours were the first Olympic

Games for which the medals have not been identical! Complete gold medals (that is, the

medal plus the ribbon) weigh 48 grams on average with a standard deviation of 6 grams.

The ribbons that are attached to the medals weigh 8 grams on average with a standard

deviation of 2 grams. Find the mean, variance and standard deviation of the weights of

the gold medals without their ribbons.

b) Complete silver medals (i.e. medal plus ribbon) weigh 38 grams on average with a

standard deviation of 5 grams. Find the mean, variance and standard deviation of a pair of

complete medals (gold and silver) combined.

c) You were instructed to assume that the weights of the gold medals, silver medals, and

lengths of ribbon are all independent. Is this a reasonable assumption? Explain why or

why not in one brief sentence at most.

d) In some winter Olympic events, such as the snowboard parallel giant slalom, the

winner is the rider with the best combined time over two runs. In some summer Olympic

events, such as the javelin throw, the winner is athlete with the best single distance out of

four tries. Generally speaking, does the sum of two random times or the maximum of four

random distances have greater variability?

__ A. Sum of two random times

__ B. Maximum of four random distances

__ C. Cannot say because time and distance are unrelated variables

Why? Explain in one sentence maximum.

55

MT2010: Question 7 "The food of the gods!"

Chocolate bars produced by a certain machine are labeled 240 grams to comply with

advertising rules and regulations. However, the distribution of the actual weight of these

chocolate bars is claimed to be normal with a mean of 243 grams and a standard

deviation of 3 grams.

a) Approximately what percentage of all chocolate bars produced by this machine would

be expected to be between 240 and 246 grams?

b) A quality control manager initially plans to take a random sample of size n from the

production line. If he were to double his sample size to 2n, the standard deviation of the

sampling distribution of the sample mean would be multiplied by:

__ A. 1/2

__ B. 1/

__ C.

__ D. 2

c) The quality control manager plans to take a random sample of size n from the

production line. How big should n be so that the sampling distribution of has standard

deviation 0.3 grams?

__ A. 10

__ B. 100

__ C. 1000

__ D. Cannot be determined unless we know that the population is normal.

d) If the quality control manager takes a random sample of nine chocolate bars from the

production line, what is the probability that the sample mean weight of the nine sample

chocolate bars will be less than 240 grams?

__ A. 0

__ B. 0.0013

__ C. 0.1587

__ D. 0.9987

Show your work:

56

MT2010: Question 8 "Shooters for the shooters?"

A radio talk show host with a large audience is interested in the proportion p of adults in

his listening area who think the drinking age should be lowered to 18. To find out, he

poses the following questions to his listeners: “Do you think that the drinking age should

be reduced to 18, in light of the fact that 18-year-olds are eligible for military service?”

He asks listeners to phone in and vote “yes” if they agree the drinking age should be

lowered and “no” if not. Of the 100 people who phoned in, 70 answered “yes”.

a) The sample estimate, , of the proportion of adults who think the drinking age should

be reduced is:

__ A. 70

__ B. 0.70

__ C. 0.69

__ D. Not able to be determined from the information given

b) The standard error of this estimate is closest to:

__ A. 0.089

__ B. 0.046

__ C. 0.0021

__ D. 0.0045

c) The margin of error for a 90% confidence interval is closest to:

__ A. 0.046

__ B. 0.075

__ C. 0.090

__ D. 0.690

d) How large a sample n would you need to estimate p with margin of error 0.01 with

95% confidence? Use the guess = 0.6 as the value for p.

__ A. 6768

__ B. 9220

__ C. 9502

__ D. 9596

e) Which of the following assumptions for inference about a proportion using a

confidence interval are violated in this case?

__ A. The data are a simple random sample from the population of interest

__ B. The success/failure condition

__ C. A third choice of no opinion needed to be included

__ D. There appear to be no violations

57

MT2010: Question 9 "Going postal"

A simple random sample of 100 Canada Post employees found that the average time

these employees had worked for the postal service was 7.0 years with standard deviation

of 2.0 years. Do these data provide evidence that the mean length of time that the

population of Canada Post employees have worked for the postal service had changed

from the value of 7.5 of 20 years ago?

a) Give the appropriate null and alternative hypotheses.

b) Give the formula for the appropriate test statistic and compute its value.

c) Give a range in which the P-value is located.

d) From the P-value associated with this test statistic, which of the following is correct?

__ A. Do not reject H0 at the 10% significance level

__ B. Reject H0 at the 10% significance but not at the 5% significance level

__ C. Reject H0 at the 5% significance level but not at the1% significance level

__ D. Reject H0 at the 1% significance level

e) Using the 5% significance level, state your conclusion in one clearly worded sentence

that Canada Post management can understand.

f) The 95% CI for the mean time the population of postal employees have spent with the

postal service is closest to:

__ A. 7.0 ± 0.2

__ B. 7.0 ± 0.4

__ C. 7.0 ± 2.0

__ D. 7.0 ± 4.0

Bonus Question: Just for Fun and Bragging Rights

Over the 17 days of the Winter Olympics you saw the Olympic rings logo countless

times. In the official logo, not the single-colour Vancouver 2010 version, each of the five

rings is a different colour. How well do you remember the order of the colours in the

rings? Write the colours in the blanks as indicated.

________ ________ ________

Ring 1 Ring 2 Ring 3

________ ________

Ring 4 Ring 5

END OF MIDTERM 2010 – ANSWERS AND EXPLANATIONS FOLLOW

58

MIDTERM EXAM 2010: ANSWERS AND EXPLANATIONS

MT2010: Answer 1

a) Quantitative; Categorical; Categorical; Neither

b) C c) C d) B e) (i) 10% (ii) B (iii) C

f) True; True; False

Details and Comments:

a) Although the text considers an identifier variable, such as a Visa credit card number, as

a type of categorical variable, it is useless in that form; it is best thought of as Neither.

You aren't likely to do any analysis on the Visa card number!

b) The average must lie between the minimum and maximum, but depending on

skewness it could be smaller or larger than the midpoint or median.

c) The minimum value is 0 but the maximum can be very large, hence right-skewed.

d) All except B are likely to have a long right-hand tail, where the mean exceeds the

median.

e) (i) (5+3+2)/100 = 10%

(ii) 41% of values (2+20+19) are less than 30; including the 30-34 interval increases the

cumulative count to 58% (2+20+19+27).

f) Incomes are not exactly symmetric, but for all practical purposes and especially for

data analysis, they certainly are reasonably symmetric.

MT2010: Answer 2

a) Median = 0; Q1 = -0.675; Q3 = 0.675; IQR = 1.35

b) Prob. = 2×Pr (Z > 1.5×1.35) = 2×Pr (Z > 2.025) = 2×0.0215 = 0.0430 or about 4%

c) The boxplot is symmetric around 0, with the ends of the box at Q1 and Q3 at -0.675

and 0.675 (from part a). Since Z has no limits, the whiskers can't extend to the minimum

and maximum. Instead, use inner fences; the whiskers should extend to -2.7 and 2.7.

d) Ann has a higher rank.

Ann's z-score = (120–90)/25 = 1.2; Sarah's z-score = (135–110)/25 = 1

Details and Comments:

a) Z is symmetric so the median equals the mean.

It is acceptable to report answers to two decimal places:

For Q1: -0.68 or -0.67; for Q3: 0.68 or 0.67; for IQR: 1.36 or 1.34

b) If you used IQR of 1.36, the probability is 0.0414.

If you used IQR of 1.34, the probability is 0.0444.

c) Since the distribution is unbounded, any reasonable choice of whiskers is acceptable.

59

MT2010: Answer 3

a) Yes: Percent of males admitted = 35/80 = 0.4375 or 43.75%

Percent of Females admitted = 20/60 = 0.33 or 33%

b) No: Half of engineers of either sex are admitted. One-quarter of English students of

either sex are admitted.

c) The English program is harder to get into, and that is where more females applied. This

is an illustration of Simpson's Paradox.

d)

Engineering English Row Total

Admitted 40 15 55

Not Admitted 40 45 95

Column Total 80 60 140

Admitted to Engineering: 40/80 = 0.50 or 50%

Admitted to English: 15/60 = 0.25 or 25%

Details and Comments:

When a two-way table is provided, it is useful to add the row totals and the column totals.

They are needed to compute conditional probabilities. Simpson's Paradox is one of the

most revealing illustrations of the need to dig deeper into the relationship between

categorical variables. What might appear to be the result for a two-way table may well be

reversed when a third variable is incorporated.

MT2010: Answer 4

a)

b) Strong positive linear association

c) B

d) Yes; there is a clear linear relationship

e) Correlation does not imply causation.

f) No. They were the slower ones, or the easier ones to catch.

Details and Comments:

c) Since the correlation is strong and positive, only 0.975 is a sensible choice for r.

d) Correlation coefficients require linear relationships.

0

5

10

15

20

0 10 20

#

E

y

e

s

Weight (kg)

60

MT2010: Answer 5

a) A b) A

c)

= 0.8691(286.6104/196.4466) = 1.268

= 554.0000 – 1.268(304.6667) = 167.683 (or 167.684)

= 167.683 + 1.268x (or = 167.684 + 1.268x)

d) = 167.683 + 1.268(555) = 871.423 (or 871.424)

e) = 167.683 + 1.268(220) = 446.643 (or 446.644)

Residual = e = 245 – 466.643 = -201.643 (or -201.644)

f) r2 = 0.8691

2 = 0.755

g) No; this is extrapolation far beyond the range of data.

h) (i) A (ii) G (iii) B (iv) E

Details and Comments:

a) Response variable is on the vertical axis.

c) Beware of round-off error. Carry all available decimal places in the intermediate

calculations, but report fewer as instructed.

d) Simple substitution

e) Use the definition of residual: observed minus predicted.

f) This is the definition of r-squared.

g) Although it is mathematically correct to substitute 2500 into the regression equation,

extrapolation far beyond the range of data is a major misuse of regression.

h) Examine the formulas for slope, intercept and correlations and test out the effect of the

suggested changes. For (ii), correlation does not depend on units, but slope and SDs do

change if both variables change. For (iv), the scatterplot is simply moved straight up, so

SDs, slope, and correlation are not affected.

MT2010: Answer 6

a) Mean (X–Y) = Mean (X) – Mean (Y) = 48 – 8 = 40

Var (X–Y) = Var (X) + Var (Y) = 36 + 4 = 40

SD (X–Y) = = 6.32

b) Mean (X+Y) = Mean (X) + Mean (Y) = 48+38 = 86

Var (X+Y) = Var (X) + Var (Y) = 36 + 25 = 61

SD (X+Y) = = 7.81

c) Yes: Heavier ribbons are not expected to be found only on heavier medals.

d) A

Details and Comments:

a) and b) The variance of a sum or difference of two independent variables is always the

sum of the individual variances. Remember that calculations are not done with standard

deviations; combine variances first and then take the square root.

d) The sum of two random variables generally has greater variability than a single

random variable. However, if the question had asked about the “mean” of two measures

rather than the sum, then the mean would have lesser variability than a single measure.

61

MT2010: Answer 7

a) 68% b) B c) B d) B

Details for d): Pr ( < 240) = Pr (z < [240–243] / [3/ ]) = Pr (z < -3) = 0.0013

Details and Comments:

a) Use the 68/95/99.7 (or Empirical) Rule: 243 ± 1s = 243 ± 3 = (240 , 246)

b) Remember that "standard deviation of the sampling distribution" is another name for

"standard error" and the standard error of the sample mean is σ/ . If the sample size

becomes 2n, then σ/ = (1/ )(σ/ ), so the new SE is (1/ ) times the old SE.

c) σ/ = 3/ = 0.3, so = 3/0.3 = 10, and n = 100

d) Since the question is about the sample mean, the standardization uses σ/ = 3/ .

MT2010: Answer 8

a) B b) B c) B d) B e) A

Details and Comments:

a) Reason: p̂ =70/100 = 0.70

b) Reason: = 0.046

c) Reason: 1.645(0.046) = 0.075

d) Reason: n = (1.962)(0.6)(0.4)/(0.01

2) = 9220

e) The data are a convenience sample since people choose whether or not to phone in!

MT2010: Answer 9

a) H0: μ = 7.5; Ha: μ ≠ 7.5

b) t =

=

= -2.5

c) 0.01 < P-value < 0.02 d) C

e) There is evidence to say that the mean length of time employees have worked for

Canada Post has changed from 20 years ago.

f) B. 7.0 ± 1.984 (2.0/ ) = 17.0 ± 0.4

Details and Comments:

a) This is a t-test of a single mean. The alternative hypothesis is two-tailed since the

question asks whether the mean length of time had "changed" from the value of 7.5. This

is not a directional hypothesis.

b) & c) Note that although the value of the test statistic is negative, you look up the

positive value in the t-table to find the P-value.

d) Since the P-value is less than 0.05, reject the null hypothesis at the 5% level; but since

the P-value is greater than 0.01, do not reject the null hypothesis at the 1% level.

e) The conclusion must be in terms of the original question, not just “Reject H0”.

f) The multiplier is based on a t with n–1 = 99 df. Use the value for 100 df in the table.

Bonus Question: Olympic Rings colours

BLUE BLACK RED

YELLOW GREEN

END OF ANSWERS AND EXPLANATIONS TO MIDTERM 2010

62

Part B. Past Years’ Midterm Exams

A collection of questions from midterm exams of past years, with

answers and explanations

This section presents questions from midterm exams in recent years. Since course

syllabi, textbooks, order of topics, and even notation, have changed, not every question

from past exams is relevant today. So the exam questions have been reorganized by broad

topic area as follows:

Section A: Descriptive Statistics

Section B: Scatterplots, Association, Correlation, Least Squares Regression

Section C: Normal Curve, Sampling Distributions, Combining Random Variables

Section D: Introduction to Inference, Confidence Intervals, Hypothesis Tests

Section E: Miscellaneous

Questions in each topic area are arranged from the most recent year and go back in time.

Following the questions in each topic area is a set of answers and explanations/comments

about the answers. The comments give details of calculations and common errors made

by students.

Since the teaching of any course is dynamic and always undergoing change, there may

still be some terminology or notation or even a few parts of questions which are

unfamiliar to you. If you are unclear whether a particular question or topic is relevant to

the current year, please ask your instructor.

63

SECTION A: DESCRIPTIVE STATISTICS

Question A1 (MT2009–Q1) “Not yet an Olympic Sport”

NOTE: Parts g), h) and j) require material from later sections.

The following boxplots show the distributions of ages of the UBC male and female

underwater basket-weaving teams:

a) Which team has more members? (Circle the correct response)

Male Female Can’t tell Same size

b) Given the information provided, which of the following is most likely the mean age of

the female team? (Circle the correct response)

21 22 23 30

c) For each of the three measures below, fill in the numerical value in the blank provided

and then decide if each is a measure of shape, centre, spread, or none of these (circle one

choice for each measure):

Value: Is a measure of:

Interquartile range _________ Shape Centre Spread None

(for males):

50th

percentile _________ Shape Centre Spread None

(for females)

Oldest male member _________ Shape Centre Spread None

d) The distribution of male ages is: (circle the correct response)

Symmetric Skewed to the left Skewed to the right

e) The distribution of female ages is: (circle the correct response)

Symmetric Skewed to the left Skewed to the right

64

f) The mean male age is 22.5 years. One of the members of the male team is 22 years old

and has a z-score of -0.25. What is the standard deviation of male ages?

g) If we assume that male ages are normally distributed, what proportion of males on the

team are 22 years of age or younger?

h) Which of the following is the best justification for the assumption of normality made

in part g)? (Check the best response)

__ A. The Law of Large Numbers

__ B. The Central Limit Theorem

__ C. Least squares regression

__ D. None of the above

i) Team members are required to take a course in the history of underwater basket-

weaving. The professor records the values of several variables for each student. These

variables are listed below. For each one, decide whether it has been recorded as

quantitative or categorical.

Score on the final exam (out of 200 points) Quantitative Categorical

Final grade for the course (A, B, C, D, or F) Quantitative Categorical

The number of lectures the student missed Quantitative Categorical

Brand name of favorite swimsuit Quantitative Categorical

j) Universities across North America require underwater basket-weaving students to take

a quantitative skills test. Percentage scores on this test have a mean of 30% and a

standard deviation of 10%. Give a range within which you would expect to find the

middle 95% of all North American underwater basket-weaving student test scores.

65

Question A2 (MT2008–Q1) “There are two kinds of data -- good and bad!”

a) Here is a small part of the data set in which CyberStat Corporation records information

about its employees.

Employee # Surname Age Gender Salary Job Type

11234 Smith 39 Female $62,100 Management

23467 Jones 27 Male $47,350 Technical

98543 Chan 22 Female $25,250 Clerical

76548 Wong 48 Male $77,600 Management

Circle the names of the variables below which are recorded as quantitative scale variables

in the data set above.

Employee # Surname Age Gender Salary Job Type

b) Three small Statistics classes all took the same test. Histograms of the scores for each

class are shown below.

(i) Which class had the highest mean score? 1 2 3

(ii) Which class had the highest median score? 1 2 3

(iii) For which class are the mean & median most different? 1 2 3

(iv) Which class had the smallest standard deviation? 1 2 3

c) For each of these variables, decide whether its distribution is more likely symmetric or

skewed right (i.e. long right-hand tail) or skewed left (i.e. long left-hand tail). Circle one

choice for each variable.

Individual incomes in the United States Symmetric Skewed right Skewed left

Age of male heart attack victims Symmetric Skewed right Skewed left

Lifetimes of electric light bulbs Symmetric Skewed right Skewed left

IQ scores of the Canadian population Symmetric Skewed right Skewed left

66

Question A3 (MT2008–Q2) “A Nash-ional Game”

The data set to the right contains all the point differentials or margins in all NBA

games played by the Phoenix Suns up to February 13 of the 2007/08 season.

Negative numbers indicate losses, positive numbers indicate wins. The data have

been arranged in ascending order for you (biggest loss to biggest win).

a) Compute the various numerical summaries and put them into the table below

part b) under “original data.” Some have been computed for you.

NOTE: Part b) is not part of the current curriculum. You can ignore it. But

think of it as a challenge question. It is easy to figure out. Instructions

are given in the Answers/Comments.

b) Suppose the data undergo a transformation such that X* = 2X – 3, where X =

the original variable and X* is the “new,” transformed variable. Find all of the

numerical summaries for X* and put them into the table below under

“transformed data”.

Original

Data (X)

Transformed

Data (X*)

Mean 5.6

Median

Range

Q1

Q3

IQR

Std dev 11.7

c) Are there any outliers? Use the “inner fences” definition of outliers and the

original data (not the transformed data) to identify any outliers.

Lower inner fence = ___________________

Upper inner fence = ___________________

Observation numbers of outliers = _______

Obs# Margin

1 -222 -213 -154 -105 -96 -77 -78 -79 -6

10 -511 -412 -313 -314 -315 -216 -217 118 219 220 421 422 423 524 525 526 627 728 729 830 831 832 933 1034 1035 1036 1137 1138 1139 1140 1141 1442 1543 1544 1645 1946 1947 2048 2049 2250 2451 3152 33

67

Question A4 (MT2007–Q1) “Data, data, data! I can’t make bricks without clay!” –

Sherlock Holmes

a) A sample of shoppers at a mall was asked the following questions. Decide whether the

type of data are more likely to be quantitative or categorical. (Circle your choice)

What is your age (in years)? Categorical Quantitative

How much did you spend (in $)? Categorical Quantitative

What is your marital status? Categorical Quantitative

Rate the availability of parking. Categorical Quantitative

(Excellent, Good, Fair, Poor)

b) Here is a table of sources of electricity in Canada and the US and the percentage of

electricity generated by each. Construct a bar graph to compare Canada and the US.

Do NOT use separate sets of axes for each graph.

Source Canada US

Hydropower 65 6

Coal 16 51

Nuclear 16 21

Natural Gas 1 16

Other 2 6

c) A news article reports that, “Of the 411 players on National Basketball Association

rosters in February 1998, only 139 made more than the league _______ salary of $2.36

million.” Which word should go in the blank, mean or median? That is, is $2.36 million

the mean or median salary for NBA players? Explain why, in one sentence only.

d) A study was made of the age of entering first-year university students. Which of the

following is most likely to be the standard deviation? Explain why, in one sentence only.

__A. 1 month

__B. 1 year

__C. 5 years

68

e) The following histogram displays the December 2000 percentage unemployment rates

in the 50 U.S. states and Puerto Rico. The labels on the horizontal axis should be

interpreted as follows: the bar labelled “1” represents rates of 1.0% to 1.9%, the bar

labelled “2” represents rates of 2.0% to 2.9%, etc.

(i) What percentage of the rates (out of a total of 51 observations) is 5.0% or greater?

(ii) Estimate the median unemployment rate.

f) You have decided to sell your home. The market is booming now with the 2010

Olympic Games preparations, and therefore most sellers of houses with similar

characteristics have received extremely good deals in the past few months. You ask the

realtor for a summary of net prices of homes sold in your neighborhood. The realtor

hands you the following two density curves, one of them of the prices of homes sold in

the past few months in your neighborhood, and the other of the prices of homes sold

during a deep economic recession.

Curve A Curve B

(i) Under the given assumptions, which of the two curves better represents the

distribution of prices of homes sold in the past few months? Circle your answer choice.

Curve A Curve B

(ii) A potential buyer offers to give you the mean, the median or the mode of the prices of

all the homes sold in the past few months in your neighborhood. Assuming that the

density curve is the one you chose in (i) directly above, which numerical measure would

you prefer? Circle your answer choice.

If you chose Curve A: Mean Median Mode

OR:

If you chose Curve B: Mean Median Mode

(iii) You are told that the mean price of 50 houses sold is $700,000. However, you notice

that there was a mistake in the calculation, and that one of the buyers paid $500,000

instead of the $800,000 that was used when making this calculation. What is the actual

mean price of the 50 houses sold?

0

2

4

6

8

10

12

14

16

18

20

22

24

1 2 3 4 5 6 7 8

Unemployment Rate

Nu

mb

er

of

Sta

tes

69

Question A5 (MT2007–Q2) “Teach Your Children” – Book of Deuteronomy

In 2003, the salaries (in $) of secondary school classroom teachers in the United States

gave the following descriptive statistics:

Minimum = 31,200

Q1 = 37,400

Median = 40,000

Q3 = 48,400

Maximum = 57,200

a) Find the range and interquartile range.

Range = _______ Interquartile range = _______

b) Are there are any outliers, as defined by the 1.5 × IQR rule (a.k.a. inner fences)?

Explain.

No Yes

c) Predict the direction of skewness for this distribution. Explain.

d) If the distribution, although somewhat skewed, is approximately bell-shaped, which

one of the following would be the most realistic value for the standard deviation?

__A. 100

__B. 1,000

__C. 5,000

__D. 25,000

e) NOTE: Part e) is not part of the current curriculum; it is similar to Question 3b)

above. Detailed instructions are given in the Answers / Comments to Question 3b).

It is very worthwhile to learn this!

Suppose each teacher were to receive a 10% increase in salary for 2004. For each of the

four statistics in the left-hand column, write the number of the phrase from the list in the

right-hand column that states how each statistic will change (Note: Some phrases may be

used more than once; others may not be used at all.)

Write the phrase number where indicated by the arrows.

↓▼↓

____ Median 1. Will be multiplied by

____ IQR 2. Will be multiplied by 0.10

____ Range 3. Will be multiplied by 1.10

____ Standard Deviation 4. Will remain unchanged

70

Question A6 (MT2006–Q1) “Call centre (and spread and shape)”

Here are the numbers of calls answered by 20 workers in a call centre on a particular day: 13 13 14 16 17 17 18 18 19 19

19 19 20 20 21 21 22 24 25 25

The mean number of calls is 19 and the standard deviation is 3.49.

a) What is the variable in this dataset?

__A. Days

__B. Call centre workers

__C. Length of calls

__D. Numbers of calls per day

b) Describe this distribution with a five-number summary: (Min, Q1, Median, Q3, Max).

Min = ___ Q1 = ___ Median = ___ Q3 = ___ Max = ___

c) Compute the “inner fences” using the 1.5 × IQR criterion.

Identify any outliers (using inner fences) _________________ (If none, write None)

Consider the following histograms.

d) Which of these two histograms describes the dataset given at the start of the question?

__A. Histogram (a)

__B. Histogram (b)

__C. Both of them

__D. Neither one of them

e) Assuming that a workday in the call centre is 8 hours long and the workers are on the

phone 60% of the time, what is the mean length of a call?

__A. About 15 minutes

__B. About 19 minutes

__C. About 25 minutes

__D. Cannot tell, since the dataset does not contain data about individual calls

(a)

0

2

4

6

8

10

11.5-14.5

14.5-17.5

17.5-20.5

20.5-23.5

23.5-26.5

Number of calls

Co

un

t

(b)

01234567

11.5-13.513.5-15.515.5-17.517.5-19.519.5-21.521.5-23.523.5-25.5

Number of calls

Co

un

t

71

SECTION A: ANSWERS AND EXPLANATIONS

Answer to Question A1 (MT2009–Q1)

a) Can’t tell

b) 23

c) IQR = 3; Spread 50th

p. = 22, Centre Oldest male = 27, None

d) Symmetric

e) Skewed to the right

f) Z = -0.25 = (29–22.5)/σ, so σ = (22–22.5)/(-0.25) = 2

g) Pr(Z < -0.25) = 0.4013

h) D. None of the above

i) Quantitative, Categorical, Quantitative, Categorical

j) Empirical (68-95-99.7) Rule: 30 ± 20 = (10 , 50) (Also accept 30 ± 19.6)

Note: Parts g), h) and j) are about “Sampling Distributions and the Normal Model”.

Check your notes or the textbook.

Details and Comments:

a) Boxplots do not show sample sizes; they only show: min, Q1, median, Q3, and max.

b) Since the age distribution for females is strongly skewed to the right, the mean is

greater than the median. The median (from the graph) is 22, so the mean must be a little

larger, hence 23. Note that 30 is close to the maximum and far above Q3 so it is not a

realistic estimate of the mean.

c) IQR (Males) = 24 – 21 = 3; 50th

p. (Females) = median = 22; Oldest Male = max = 27

f) Use the formula for standardizing X to Z; however, here both the values of X and Z are

given and it is the value of σ which is unknown.

h) The Central Limit Theorem cannot be used as the reason here since the sample is

unlikely to be large.

Answer to Question A2 (MT2008–Q1)

a) The quantitative variables are Age and Salary.

b) Answers: 3, 3, 3, 1.

c) Answers:

Individual incomes in the United States Skewed right (long right-hand tail)

Age of male heart attack victims Skewed left (long left-hand tail)

Lifetimes of electric light bulbs Skewed right (long right-hand tail)

IQ scores of the Canadian population Symmetric (equal tails)

Details and Comments:

a) Gender and Job Type are categorical; Employee # and Surname are simply strings and

used as identifier variables. Taking the mean of the Employee # would not make sense.

b) Class 3 has much more area to the right than Class 1 or Class 2 so the mean and

median are also shifted to the right. And since the histogram for Class 3 shows the

greatest skewness, it has the greatest difference between mean and median. Class 1 is less

spread out (the tails are both smaller than in the other two classes) so it has the smallest

standard deviation.

72

c) Incomes are skewed right because fewer people have very large incomes, more people

have incomes at the lower end or middle.

Age of heart attack victims is skewed left because heart attacks are much more likely in

older people.

Lifetimes of bulbs are skewed right because most bulbs last the amount of time they are

engineered to last but some will last much longer; that is, quality is designed in. Only a

few will fail early. Lifetimes in general are skewed right.

Answer to Question A3 (MT2008–Q2) a) and b)

Original

Data (X)

Transformed

Data (X*)

Mean 5.6 8.2

Median 6.5 10

Range 55 110

Q1 -3 -9

Q3 11 19

IQR 14 28

Std dev 11.7 23.4

c) Lower inner fence = -3 – 1.5(14) = -24

Upper inner fence = 11 + 1.5(14) = 32

Observation numbers of outliers = 52

Details and Comments:

Note that the question asked for the observation number(s), not the margin!

For part b): Suppose the data are transformed (linearly) as follows X* = a + bX; that is,

multiply the original observations by “b” and then add “a”. That shifts all the values of X

up or down by the amount “a” and changes the size of the unit of measurement by “b”.

Mean(X*) = a + b×Mean(X);

Median (X*) = a + b×Median(X);

Range(X*) = b×Range(X); [the effect of “a” is cancelled]

Q1(X*) = a + b×Q1(X);

Q3(X*) = a + b×Q3(X);

IQR(X*) = b×IQR(X); [the effect of “a” is cancelled]

SD(X*) = b×SD(X); [the effect of “a” is cancelled]

73

Answer to Question A4 (MT2007–Q1)

a) What is your age (in years)? Quantitative

How much did you spend (in $)? Quantitative

What is your marital status? Categorical

Rate the availability of parking Categorical

b)

OR

c) “Of the 411 players on National Basketball Association rosters in February 1998, only

139 made more than the league MEAN salary of $2.36 million.” If it were the median,

then half of the 411 players (i.e. 205 or 206) would exceed the value.

d) 1 year is the typical difference in age between entering first-year university students.

e) (i) 5/51 = 0.098, so 9.8%. It is also acceptable to round to 10%.

(ii) The median is in the 3.0-3.9 interval, so the median is best estimated as the midpoint

of that interval at 3.5%.

Comment: It is also acceptable to give the range 3.0-3.9. It is not acceptable to estimate

the median as 3.0%.

f) (i) Curve B

(ii) If you chose Curve A: Mean

If you chose Curve B: Mode

Note: The two choices offered in part (ii) are to give you a chance to get the correct

answer to part (ii) even if you made the wrong choice in part (i).

(iii) [(50×700,000) – 300,000]/50 = $694,000

Comment: Use the formula for mean and adjust accordingly.

74

Answer to Question A5 (MT2007–Q2)

a) Range = $26,000 (i.e. 57,200 – 31,200)

Interquartile range = $11,000 (i.e. 48,400 – 37,400)

b) No: Q1–1.5×IQR = 20,900; Q3+1.5×IQR = 64,900

There are no values outside this range, so there are no outliers.

c) The distribution is skewed to the right (i.e. long right-hand tail) since the distance

between the median and Q3, and between Q3 and the maximum, are greater than the

distance between the median and Q1, and between Q1 and the minimum.

d) C. 5,000

Use the rule of thumb that SD is approx. equal to Range/6 = 26,000/6 = 4333

e) 3, 3, 3, 3

Details and Comments:

All four quantities are multiplied by 1.10 (that’s what a 10% increase means). Since each

data value is multiplied by 1.10, so are the minimum and maximum (and therefore the

range), and the median. And when all values are scaled up by a constant, the standard

deviation is also scaled up by a constant. See Answer to Question 3 (MT2008–Q2) for

details of linear transformations.

Answer to Question A6 (MT2006–Q1)

Here are the numbers of calls answered by 20 workers in a call centre on a particular day: 13 13 14 16 17 17 18 18 19 19

19 19 20 20 21 21 22 24 25 25

The mean number of calls is 19 and the standard deviation is 3.49.

a) D. Numbers of calls per day

b) Min = 13; Q1 = 17; Median = 19; Q3 = 21; Max = 25

c) IQR = 4: Inner fences: 11 and 27. There are no outliers.

d) C. Both of them

e) A. About 15 minutes

Details and Comments:

b) Q1 is between the 5th

and 6th

observations; Q3 is between the 15th

and 16th

observations. Fortunately, here the 5th

and 6th

observations are the same, and the 15th

and

16th

observations are the same, so there is no ambiguity about Q1 or Q3.

c) IQR = 21–17 = 4

Lower inner fence =17–(1.5)(4) = 11; Upper inner fence = 21+(1.5)(4) = 27

No data values are less than 11 or more than 27.

d) Notice the effect that changing the bins (i.e. class intervals) has on the look of the

histogram.

e) (0.60×8×60 / 19 = 15)

75

SECTION B: SCATTERPLOTS, ASSOCIATION, CORRELATION,

LEAST SQUARES REGRESSION

Question B1 (MT2009–Q2) “The Need for Speed”

Highway planners investigated the relationship between traffic density (number of cars

per mile) and the average speed of the traffic on a large city thoroughfare. The data were

collected at the same location at 10 different times over a span of three months. They

found a mean traffic density of 70 cars per mile (cpm) with standard deviation of 27 cpm;

this variable is called Cars. Overall the cars’ average speed was 26.5 mph with standard

deviation of 10 mph; this variable is called Speed. These researchers found the least

squares regression line for predicting Speed from Cars to be: .

a) Compute the value of the correlation coefficient between Speed and Cars?

b) What percent of the variation in average speed is explained by traffic density? (Round

your answer to the nearest whole percent.)

c) Predict the average speed of traffic on the thoroughfare when traffic density is 50 cpm.

d) Using the prediction you made in part c), what is the value of the residual for a traffic

density of 50 cpm when the observed speed was 32.5 mph?

e) What prediction would you make for the average speed when the traffic density is

unknown?

f) The data set initially included the point Cars = 125 cpm, Speed = 55 mph. This point

was considered an outlier and was not included in the analysis. Will the slope of the least

squares regression line increase, decrease or stay the same if we redo the analysis and

include this point?

Increase Decrease Stay the same

g) A Canadian member of the research team measured the speed of the cars in kilometres

per hour (1 km 0.62 miles) and the traffic density in cars per kilometre. What is the

value of his calculated correlation between speed and traffic density?

h) Does this study demonstrate that traffic density is a causal factor in explaining the

average speed of the traffic on the thoroughfare?

i) Suppose another researcher got confused about which was the response variable and

which was the explanatory variable and computed a linear regression model to predict

Cars from Speed. What would the slope of this line be? Report a maximum of two

decimal places.

76

Question B2 (MT2008–Q4) “Drawing a wine line”

There is some evidence that drinking moderate amounts of wine helps prevent heart

attacks. Researchers collected data on yearly wine consumption (litres of alcohol from

drinking wine, per person) and yearly deaths from heart disease (deaths per 100,000

people) in 19 developed nations, including Canada.

a) The scatterplot below shows the relationship between heart disease death rates and

wine consumption for the 19 developed nations.

Describe the shape, direction and strength of this relationship by circling the best choice.

Shape: Linear Curved No Pattern

Direction: Positive Negative Neither

Strength: Very strong Fairly strong Quite weak

b) The correlation between heart disease death rates and national wine consumption is

0.843r . What does a negative correlation say about wine consumption and heart

disease deaths? Answer this question by circling the appropriate italicized words below:

High wine consumption goes with (more / fewer) heart disease deaths,

while low consumption goes with (more / fewer) deaths.

c) Do you think these data give good evidence that drinking wine causes a reduction in

heart disease deaths? Explain why, in one sentence only.

d) About what percent of the variation among countries in heart disease rates is explained

by the straight-line relationship with wine consumption? Report the percentage to the

nearest whole number.

77

e) The least squares regression line for predicting heart disease death rate from wine

consumption is:

Use this equation to predict the heart disease rate in another country where adults average

4 litres of alcohol from drinking wine each year.

f) What is the predicted heart disease rate for a country that drinks enough wine to supply

150 litres of alcohol per person? Can this result be true? Explain why using the least-

squares regression line for this prediction is not justified.

g) Which of the three figures below corresponds to the plot of least-squares residuals

versus national wine consumption? Hint: Look at the vertical axis.

Circle your choice: Graph (a) Graph (b) Graph (c)

78

Question B3 (MT2008–Q5) “Causal relationships differ from casual relationships!”

a) Here are some general statements about regression. For each one, decide whether it is

True (T) or False (F) and circle the letter of your answer.

T F The least-squares residual is defined as the difference between an

observed value of the explanatory variable and the value predicted

by the least-squares regression line.

T F The least-squares residuals add up to 0.

T F If the least-squares regression line fits the data poorly, the residual

plot will exhibit a systematic pattern.

T F Removing an influential observation will markedly change the

least-squares regression line.

b) Someone says, “There is a strong positive correlation between the number of

firefighters at a fire and the amount of damage the fire does. So sending lots of

firefighters just causes more damage.” Explain why this reasoning is wrong, in one

sentence only.

c) A study shows that there is a positive correlation between the size of a hospital

(measured by its number of beds, X) and the median number of days, Y, that patients

remain in the hospital. Does this mean that you can shorten a hospital stay by choosing a

small hospital? Explain why, in one sentence only!

Circle one: Yes No

d) Which set of two variables is most likely to have a cause and effect relationship?

__ A. The height of a person and their corresponding weight

__ B. The weight of a box and the postage required to ship the box to Toronto

__ C. The make of a car and the fuel efficiency (miles per gallon) of the car

__ D. The age of a teacher and their corresponding yearly income

79

Question B4 (MT2007–Q5) “Good relationships are everything”

a) Rank the following in order, from highest to lowest, by the correlations you would

estimate there to be between:

A Purchase price of a home and the yearly income of the purchasing family

B Purchase price of a home and the shoe size of the purchaser

C Asking price and corresponding selling price of homes in Greater Vancouver

D Purchase price of a home and age of the buyer

Write the letters in order from highest to lowest.

(Highest) __ __ __ __ (Lowest)

b) A survey of 3000 medical records showed that smokers are more inclined to get

depressed than non-smokers. Does this necessarily imply that smoking causes depression?

Explain, in one sentence only!

c) For each of the eight sections of last year’s C291 course, the average midterm exam

mark and the average final exam mark were calculated. The correlation for the eight pairs

of averages was +0.97. Does this mean that the relationship between a student’s midterm

and final exam marks scores for all students in the eight classes is almost exactly a

straight line? Explain, in one sentence only!

__Yes __No

d) A study is made of people who stutter. Each subject is asked to read two passages of

equal length, and the number of times they stutter while reading each passage is recorded.

The researchers discover that the subjects who stuttered many times on the first passage

tended to stutter fewer times on the second passage. They conclude that the subjects who

stuttered many times on the first passage must have been nervous the first time and more

relaxed the second time, so that they tended to stutter less. Do you agree? Explain, in one

sentence only.

__Yes __No

e) Studies show that in the period from 1850 to 1900 in the United States, the average

marriage lasted only 12 years. Does this show that the divorce rate was high in that

period? Explain, in one sentence only!

__Yes __No

80

Question B5 (MT2007–Q6) “Bed-size kings?”

An expert consultant in hospital resource planning states that the number of open beds

that a hospital can use effectively should be estimated by the number of FTEs (full-time

equivalent employees) on staff. The consultant collected data on the number of open beds

and number of FTEs for 12 hospitals, and computed the means and SDs as follows:

Number of open beds: Mean = 50 SD = 20

Number of FTEs: Mean = 140 SD = 40

She computed the least squares regression equation and found that for a hospital with 100

FTEs, the estimated number of open beds was 32.

a) Use this information to compute the value of the correlation coefficient.

b) What is the least squares regression equation she found?

c) From the available data, what would you predict the number of open beds to be for a

hospital with an unknown number of FTEs?

d) What fraction of the variation in number of open beds is explained by the number of

FTEs?

e) Another expert consultant, this one in hospital administration, claims that the

regression was done the wrong way around, and that the number of FTEs required in a

hospital should be estimated from the number of open beds in the hospital. What would

the value of the correlation coefficient be if the analysis were done this way?

81

0

20

40

60

80

100

0 5 10 15

Medals: Team

Med

als

: In

div

idu

als

Question B6 (MT2006–Q2) “Putting the pedal to the medal”

Torino 2006: In this question we will look at some aspects of medals citizens of the

mythical country of Statland have won over the past 30 years in Winter Olympic Games.

Statlander Olympic Medals can be divided into team medals and individual medals. Here

are the numbers of medals that Statlanders have won in the past eight Winter Olympic

Games:

Here is a scatterplot of the number of team medals Statland has won since Innsbruck

versus the corresponding individual medals.

a) What is the correlation, r, between these

two variables?

__A. 0.98

__B. –0.98

__C. 0.58

__D. –0.58

b) How would r change if Statland had won exactly 20 more individual medals in each of

these eight Winter Olympics games?

__A. r would be 20/8=2.5 times larger

__B. r would be 20/8=2.5 times smaller

__C. r would be the same

__D. r would increase, but I am not sure how much larger

c) Which word best describes the type of relationship between team and individual

medals?

__A. Causation

__B. Correlation

__C. Confounding

__D. Some other word starting with “C”

Site Innsbruck

1976

Lake

Placid

1980

Sarajevo

1984

Calgary

1988

Albertville

1992

Lillehammer

1994

Nagano

1998

Salt Lake City

2002

Team

medals 0 0 0 2 5 5 8 13

Individual

medals 3 2 4 6 37 40 49 86

82

0

20

40

60

80

100

120

1976 1980 1984 1988 1992 1996 2000 2004

Year

Med

als

The number of medals Statland won in Winter Olympics has grown dramatically over the

30 years. Here is the total number of medals Statland has won in each of the past eight

Winter Olympics:

Least squares regression line: Medals = -7089 + 3.58 Year

d) Find the correlation coefficient.

Write your answer here: r = __________ (Report 2 decimal places only.)

Show your work:

e) How many medals do you expect Statland to win in 2010?

__A. 106 or 107

__B. Cannot say due to the outlier in the data

__C. Cannot say, because this involves extrapolating beyond the range of data

__D. Cannot say without examining the residual plot first

Site Innsbruck Lake

Placid

Sarajevo Calgary Albertville Lillehammer Nagano Salt Lake

City Mean SD

Year 1976 1980 1984 1988 1992 1994 1998 2002 1989.25 8.9

Medals 3 2 4 8 42 45 57 99 32.5 34.8

83

Question B7 (MT2006–Q5) “Wood flakes are not as tasty as corn flakes”

Wood scientists are interested in replacing solid-wood building material by less

expensive products made from wood flakes. They collected data to examine the

relationship between L, the length (in metres) and T, the tensile strength (in kg per cm2)

of beams made of wood flakes.

a) Which is the response variable in this study?

The researchers computed the means to be = 3 and = 20, and the standard deviations

to be SL = 1 and ST = 4. The correlation between T and L is rTL = -0.8.

b) What is rLT (i.e. the correlation between L and T)?

c) Write the regression equation (using the same response variable as your answer in a)).

Show your calculations:

d) Based on your answer in part c), what is the predicted strength (in kg/cm2) of a 2.5

metre beam?

e) In order to present the data in a wood industry conference in the US, the researchers

converted their data to units of feet and pound/ inch2 (1 metre = 3.28 feet, and 1 kg/cm

2 =

0.0704 pound/inch2). Let and be the converted variables. Then:

= _______

= _______

= _______

= _______

= _______

Show your calculations:

f) Which of the following observations (that are part of the dataset used for the regression

above) is the most influential observation?

__A. L = 2, T = 21.2

__B. L = 3, T = 20.0

__C. L = 4, T = 18.8

__D. L = 5, T = 13.6

g) The negative correlation between T and L for the beams produced by the new

technology means that:

__A. wood flake beams are stronger than solid wood beams

__B. wood flake beams are weaker than solid wood beams

__C. wood flake beams are shorter than solid wood beams

__D. shorter wood flake beams are weaker but longer ones are stronger

__E. shorter wood flake beams are stronger but longer ones are weaker

84

SECTION B: ANSWERS AND EXPLANATIONS

Answer to Question B1 (MT2009–Q2)

a)

= -0.35(27/10) = -0.945

b) r2 = (-0.945)

2 = 0.89 89%

c) = 51 – 0.35(50) = 33.5

d) Residual = 32.5–33.5 = -1.0

e) Since x is unknown, use the mean of y = 26.5

f) Increase

g) Unchanged: -0.945

h) No; correlation does not prove causation, only association.

i) Switch the roles of x and y:

= (-0.945)(27/10) = -2.55

Details and Comments:

This question requires familiarity and facility with the formulas for the least squares

estimates.

a) Take the usual least squares formula for b, and solve for r. Notice what happens to the

ratio of the standard deviations. That is, since

, then

.

b) This is the definition of r2.

f) This point is far above the regression line, so including it will pull up the slope of the

line.

Answer to Question B2 (MT2008–Q4) a) Shape = Linear; Direction = Negative; Strength = Fairly strong

b) High wine consumption goes with fewer heart disease deaths,

while low consumption goes with more deaths.

c) No; correlation does not prove causation, only association.

d) r2 = (-0.843)

2 = 0.711 71%

e) Substitute x=4 into the least squares equation:

= 260 – 23(4) = 168 (or 168 per 100,000 people)

f) Substitute x=150 into the least squares equation:

= 260 – 23(150) = -3190

The predicted disease rate cannot be negative! The reason for the negative number is

extrapolation far beyond the range of data values.

g) Graph (b)

Details and Comments:

b) Negative correlation: as X increases, Y decreases and as X decreases, Y increases.

c) This is especially true in an observational study. Many other factors can be responsible

for heart disease deaths.

d) This is simply the definition of r2.

g) Residuals are centered around 0, some are positive and some are negative. In order for

Graph (a) to be the residual plot then all the points on the scatterplot would have to be

above the line. For Graph (c) all the points would be below the line. Neither would, of

course, be the “best-fitting” line.

85

Answer to Question B3 (MT2008–Q5)

a) F, T, T, T

b) No. The number of firefighters and amount of damage are common responses to the

seriousness of the fire; the more serious the fire, the greater the number of firefighters

sent and the greater the amount of damage.

c) No. Larger hospitals tend to take more serious cases, so the length of stay is longer, on

average.

d) B

Details and Comments:

a) In the first statement change the word “explanatory” to “response” to make it a true

statement; a residual is the difference between y and .

d) Postal rates are set according to the weight of the package; as the weight increases it

“causes” the postage to increase. However, increasing the height does not necessarily

cause the weight to increase; and increasing the age of a teacher does not necessarily

cause an increase in salary. The make of the car doesn’t cause the fuel efficiency; that’s

due to engine design, etc.

Answer to Question B4 (MT2007–Q5)

a) (Highest) C A D B (Lowest)

b) No. Depression could just as easily cause people to smoke!

c) No. This is correlation based on averages (called ecological correlation).

d) No. This is an example of the regression effect.

e) No. Life expectancy was much lower 150 years ago so marriages were shorter because

people died earlier!

Details and Comments:

a) In the 2007 real estate market, sellers generally got the price they asked for, or an

amount very close. Purchase price and yearly income are expected to have a strong

correlation, but purchase price and age are not. Shoe size is irrelevant with respect to

purchase price (except possibly for rare cases like professional basketball players )

b) Correlation is not causation; there are confounding factors involved.

c) Using averages suppresses the scatter in a scatterplot and gives false impressions of

correlation.

d) Once again, there are confounding factors involved.

e) Interpreting statistical results requires thinking about the historical and societal

context.

86

Answer to Question B5 (MT2007–Q6)

a) There are multiple ways to solve this:

Since 50 = + (140)

Since 32 = + (100)

Solve for by taking the first equation minus the second equation: = 18/40 = 0.45

Since 0.45 = r (20/40) Hence r = 0.9

Another way: 100 is 1 SD below average in the X-variable; this equals 32, which is r SDs

below average in the Y-variable. So 32 = 50–r(20). Hence r = 0.9.

b) What is the least squares regression equation she found?

= -13 + 0.45x

= 0.90(20/40) = 0.45; = 50 – 0.45(140) = -13

c) If X is unknown, predict Y to be the mean of y, which is 50.

d) r2 = 0.81 (This is simply the definition of r

2.)

e) 0.9, unchanged.

Details and Comments:

Practise using the basic least squares formulas. The most common mistake is to confuse

the standard deviations of X and Y in the formula for b.

For part e), correlation between X and Y is the same as correlation between Y and X.

However, the regression equation would be different; the roles of X and Y make a

difference in regression but not in correlation.

Answer to Question B6 (MT2006–Q2)

a) A. 0.98

b) C. r would be the same

c) B. Correlation

d) r = 0.92

e) C. Cannot say, because this involves extrapolating beyond the range of data

Details and Comments:

a) The scatterplot shows positive correlation so B and D are incorrect. As well, the

clustering around a straight line is very strong so 0.98 is more strongly suggested than

0.58 which is moderate to weak correlation.

b) Adding a constant would move the entire scatterplot but would not change the relative

position of the data points so the correlation would be unchanged.

d) Invert the least squares equation for the slope to solve for the correlation.

r = 3.58(8.9/34.8) = 0.92

e) This is also known as a restricted range problem.

87

Answer to Question B7 (MT2006–Q5)

a) Response variable: T Tensile strength

b) r is unchanged – still -0.8

c) = 29.6 – 3.2L; = -0.8(4/1) = -3.2; = 20 – (-0.8)(3) = 29.6

d) 21.6

e) = 9.84; = 1.41; = 3.28; = 0.28; = -0.8

f) D. L = 5, T = 13.6

g) E. Shorter wood flake beams are stronger but longer ones are weaker

Details and Comments:

b) The roles of X and Y are interchangeable in correlation (but not in regression). Look at

the formula for r and note what happens if you switch X and Y.

c) Use the least squares formulas for slope and intercept.

d) Substitute 2.5 into equation in c): 29.6 – 3.2(2.5) = 21.6

88

SECTION C: NORMAL CURVE, SAMPLING DISTRIBUTIONS,

COMBINING RANDOM VARIABLES

Question C1 (MT2009–Q3) “The Patron Saint of Coffee: St. Arbucks!”

One of your COMM 291 instructors explains how he/she organizes the school day: “I

often work at home until I leave for campus, intending to arrive just in time for class. My

driving time varies, of course. The mean time is 41 minutes and the standard deviation is

2 minutes. Without fail, I stop at Starbuck’s for tea and my mean time there is 6 minutes,

with a standard deviation of 1.5 minutes. I assume that my driving time and my

Starbuck’s time are normally distributed and independent of one another.”

a) Is the distribution of my total time getting to campus—i.e., driving and Starbuck’s time

combined – normal, non-normal, or unknown? Give the mean, variance and standard

deviation of the distribution of total time.

Space for calculations

Distribution ___________

Mean ___________

Variance ___________

SD ___________

b) If I were willing to be late for class 3% of the time in the long run, how far ahead of

class time should I leave home?

c) Well, I admit that I am late very occasionally, but I have found over time that there is a

0.2 probability of any student being late to class. Assume that students are independent

from one another with respect to being late. (Actually, this probability was made up for

the purposes of this question; you all are much better than this! ) In a class of 100

students, what is the approximate probability of at least 75% of students being on time?

(Hint: Be careful in setting up the start of the question.)

89

Question C2 (MT2009–Q4) “The Need for Less Speed”

An automobile insurer has found that repair claims have a mean of $920 and a standard

deviation of $870.

a) What is the probability that one randomly chosen claim is larger than $1000?

__ A. 0.9100

__ B. 0.4641

__ C. 0.0900

__ D. 0.5359

b) Compute the mean and standard deviation of the average, , of the next 100 random

claims.

__ A. Mean = $920; Standard Deviation = $87

__ B. Mean = $920; Standard Deviation = $8.70

__ C. Mean = $92; Standard Deviation = $87

__ D. Mean = $92; Standard Deviation = $870

__ E. None of these

c) What is the probability that the average, ,, of the 100 claims is larger than $1000?

__ A. 0.9200

__ B. 0.8212

__ C. 0.0800

__ D. 0.1788

In order to get full marks for part c), show your work:

d) The Central Limit Theorem justifies some of the calculations above. What does the

Central Limit Theorem say? Complete the following sentence by selecting the most

appropriate phrase from the choices.

“When a sample of size n is to be drawn from any population with mean mu and standard

deviation sigma, then when n is sufficiently large…”

__ A. the standard deviation of the sample mean is

__ B. the distribution of the population is exactly normal

__ C. the distribution of the population is approximately normal

__ D. the distribution of the sample mean is exactly normal

__ E. the distribution of the sample mean is approximately normal

90

Question C3 (MT2008–Q3) “It’s all in/on your head!”

a) According to GQ magazine, the first and third quartiles of men’s haircut prices in

London are ₤21 and ₤29, respectively. What are the mean, variance and standard

deviation of haircut prices? Assume haircut prices are normally distributed.

b) Instead of using the numbers you found in part a), assume that the mean haircut price

is ₤26.50 and the standard deviation is ₤5. Suppose a visiting tourist from Paris is so

desperate for a haircut that he walks into the first barber shop he sees and sits right down

in the chair. What is the probability that his haircut will be ₤25 or less?

c) Two twin sisters are registered in different MBA programs in the United States. The

sister registered at Harvard got 87% on her final comprehensive exam. The average mark

on that exam was 73% and the standard deviation was 7%. The sister registered at

Stanford got 84% on her final comprehensive exam. Its mean was also 73% but its

standard deviation was 5%. Assume exam marks are normally distributed. Which twin’s

result ranked higher within her own class? What are their respective percentiles?

Percentile of Harvard sister = ______

Percentile of Stanford sister = ______

Which twin ranked higher within her own class?

__ Harvard sister __ Stanford sister

91

Question C4 (MT2008–Q7) “What exactly is a widget?”

NOTE: This question focuses on means and variances of combinations of random

variables. See textbook and course notes.

The dictionary defines “widget” as 1: a usually small device, contrivance, or mechanical

part, or 2: an unnamed article considered for purposes of a hypothetical example as the

typical product of a company. We are using the second definition here.

The manufacturing of a widget requires the following 3 steps:

1. Cut a small cube of wood. The amount of time, W, needed to cut the cube is

normally distributed with mean 15 seconds and standard deviation 5 seconds.

2. Drill a hole through the center of each face of the cube through to the other

side. A total of 3 holes are required and the amount of time, X, to drill one

hole is normally distributed with mean 3 seconds and standard deviation 1

second.

3. Round each corner. There are a total of 8 corners and the amount of time, Y, to

round a corner is normally distributed with mean 5 seconds and standard

deviation 2 seconds.

Assume that the variables W, X, and Y are independent.

a) Find the mean and standard deviation for the total time, T, required for producing a

widget. Hint: Figure out all the variables that go into the sum, T.

Mean of T: ______ Standard deviation of T: ______

Show your work:

b) Assume that the mean of T is 70 and the standard deviation of T is 8. (Those aren’t

actually the correct answers for part a), but will allow you to do part b) independently .)

Let S be the time required to produce 9 widgets. What is the probability that S will exceed

660 seconds? Hint: Define S in terms of T from part a).

Answer: _____

Show your work:

92

Question C5 (MT2008–Q8) “Is RV a random variable or a recreational vehicle?”

NOTE: Parts a) and b) are unrelated.

a) The time Canadians spend watching TV and movies per day has mean 190 minutes and

standard deviation 52 minutes. The time Americans spend watching TV and movies per

day has mean 170 minutes and standard deviation 45 minutes. Two simple random

samples, of 100 Canadians and 100 Americans, are selected. Let denote the mean TV-

watching time for the Canadian group, and for the American group. Note that both

and are random variables. Assume all 200 TV-watching times in the two samples are

independent.

(i) Find the means and standard deviations of and .

Mean of : ________ Mean of : ________

SD of : ________ SD of : ________

(ii) What is the approximate shape of the distribution of – , the difference between the

mean TV-watching times in the two samples? Give a reason why.

(iii) What is the probability that – is greater than 30 minutes? Report your answer with

no more than two decimal digits. Hint: You will need the mean and standard deviation of

– to solve this. You can easily compute the mean; we’ll give you the standard

deviation, which is 6.88. You can trust us!

b) The Grocery Manufacturers of Canada reported that 72% of consumers read the

ingredients listed on a product’s label. Assume the population proportion is p = 0.72 and

a sample of 250 consumers is selected from the population. What is the approximate

probability that the percentage of consumers in the sample who read the ingredients on a

label will exceed 76%?

93

Question C6 (MT2007–Q4) “Normal is as normal does” (apologies to Forrest

Gump)

a) Suppose midterm exam scores in a math course () are normally distributed with a

mean of 60 and a standard deviation of 20. What is the interquartile range? (Hint: Start by

finding Q1 and Q3 for a standard normal distribution.)

b) Now suppose that the final exam scores are normally distributed, also with a mean of

60 and a standard deviation of 20. The instructor wishes to give 20% A’s, 30% B’s and

the rest C or lower.

(i) What final score should be the cutoff between A and B?

(ii) What final score should be the cutoff between B and C? (This one’s easy!)

c) A company has two manufacturing plants, one that uses low-tech machines and

another that uses high-tech machines. From recent history, the number of defects per

week observed at each plant is normally distributed with the following parameters.

Low-tech: Mean = 15 SD = 3

High-tech: Mean = 10 SD = 1

Last week, the low-tech plant produced ten defects, while the high-tech plant produced

eight defects. Which plant performed better relative to past performance? Explain why.

___ Low-tech

___ High-tech

d) Refer to part c). The two plants work independently. Compute the mean and standard

deviation of the total number of defects from the two plants.

Mean = ___________

SD = ____________

94

Question C7 (MT2007–Q7) “Bean-counting and counting beans”

a) An accounting professor claims that about one-quarter of undergraduate business

students will major in accounting. Assuming the professor is correct, what is the

probability that in a random sample of 1200 undergraduate business students, at least

28% will major in accounting?

b) Refer to part a). A survey of a random sample of 1200 undergraduate business students

indicates that there are 336 students who plan to major in accounting. What does this tell

you about the professor’s claim?

c) The restaurant in a large commercial building provides coffee for the building’s

occupants. The restaurateur has determined that the mean number of cups of coffee

consumed in a day by each occupant is 2.0 with a standard deviation of 0.6. A new tenant

of the building intends to have a total of 125 new employees. What is the probability that

the new employees will consume more than 240 cups per day? (Hint: You can answer

this using either of two different but related methods.)

Question C8 (MT2006–Q6) “Making a grand entrance”

Scores on the ACT college entrance examination in the U.S. vary normally with μ = 18

and standard deviation σ = 6. The range of reported scores is 1 to 36.

a) What range contains the middle 95% of all individual scores?

b) If the ACT scores of 25 randomly selected students are averaged, what range contains

the middle 95% of the averages computed from many repetitions of the sampling

process?

c) If the sample size increased from 25 to 50, by what factor would the standard deviation

of the sampling distribution of change?

__A. ½

__B. 1/

__C.

__D. 2

95

Question C9 (MT2006–Q3) “Pooling your resources”

Beijing 2008: In this question we focus on swimming events at the Summer Olympics.

a) To qualify for the men’s 100m freestyle event, a swimmer needs to have a time of 48

seconds or less. An American, a Dutch and an Australian hope to qualify. Their past

times (in seconds) in this event are normally distributed: American ~N(50,3.0), Dutch

~N(49.5,2.5), and the Australian ~N(49,2.5). (Remember: The first number is the mean,

the second is the standard deviation.) What is the probability that the Dutch swimmer will

qualify? (Report only 2 decimal places!)

Probability = ______

b) Which swimmer has the lowest probability of qualifying? Compute that probability.

Swimmer is ________

Probability = _____

c) If, in the next Olympic competition, each swimmer achieves his 90th

percentile

performance, who will win the event? (Remember: Lower times are better, so be careful!)

Winner is _________

d) Men’s 4x100m freestyle relay: The Dutch relay team has four swimmers, each of

whose past times are normally distributed as follows: N(49.5,2.5), N(50,3.0), N(51.5,5),

and N(53,2.5). What are the mean and standard deviation of the total time needed by the

four swimmers to complete the 400m swim? Assume racers’ times are independent.

Mean = _____

SD = _____

e) Refer to part d): What is the chance that the Dutch team will complete the swim in

under 200 seconds (the time needed to qualify)? (Report only 2 decimal places!)

96

SECTION C: ANSWERS AND EXPLANATIONS

Answer to Question C1 (MT2009–Q3)

a) Distribution = Normal

Mean = 41+6 = 47; Variance = (2)2 + (1.5)

2 = 6.25; SD = 2.5

b) Pr (Z < z) = 0.97; z = 1.88 = (x–47)/2.5; x = 51.70. (Leave 52 minutes before class)

c) Pr ( < 0.25) = Pr (Z < [0.25–0.20]/ ) = Pr (Z ≤ -1.25) = 0.894

Details and Comments:

a) Since each individual component is normally distributed, the sum is normally

distributed. The mean of the sum is sum of the means. The variance of sum is also

additive since the components are independent. Note that you can’t add the SDs.

b) The probability (area under the z curve) is given; find z and then “unstandardize” to

find x.

c) The information given is about being late. The question asks for at least 75% being on

time. This translates into less than 25% who are late. Once you see that the question asks

for Pr ( < 0.25), using the sampling distribution of , standardize and use the standard

normal curve.

97

Answer to Question C2 (MT2009–Q4)

a) B. Pr(X > 1000) = Pr (Z > [1000–920]/870) = Pr (Z > 0.09) = 0.4641

b) A. SD( ) =

c) D. Pr( ) = Pr (Z > [1000–920]/87) = Pr (Z > 0.92) = 0.1788

d) E. (Look up the explanation of the Central Limit Theorem)

Details and Comments:

This question is all about the sampling distribution of a mean. In part a), the probability is

concerned with a single observation. In parts b) and c) the probability is concerned with

the mean of n=100 observations. That’s why the denominator in a) is 870 and in c) is 87

(=870/√100).

Answer to Question C3 (MT2008–Q3)

a) Mean: μ = (29+21)/2 = 25

Q3 = z(0.75) = 0.675 = (29–μ)/σ = (29–25)/σ ; hence σ = 4/0.675 = 5.9

If you used z(0.75) = 0.67, σ = 5.97; if you used z(0.75) = 0.68, σ = 5.88

All three values for σ are acceptable.

Variance = σ2 = 34.81 (or 35.64 or 34.57 corresponding to the other values of σ above).

b) Pr (X < 25) = Pr (Z < [25–26.50]/5) = Pr (Z < -0.3) = 0.3821

c) First find the z-scores and then convert them to percentiles.

Percentile of Harvard sister = (87–73)/7 = 2.00 97.7th

percentile

Percentile of Stanford sister = (84–73)/5 = 2.20 98.6th

percentile

Which twin ranked higher within her own class? Stanford sister

Details and Comments:

a) Since the distribution is assumed to be normal it is symmetric, so the mean is halfway

between the first and third quartiles. Therefore, add the two quartiles and divide by 2.

If you didn’t notice that μ could be found this way, another way to solve this is to use Q1

= z(0.25) = -0.675 = (29–μ)/σ, and then solve two equations in two unknowns.

b) Remember to draw a sketch of the standard normal and mark -0.3 on the sketch. You

can look up -0.3 directly in the tables or look up +0.3 and subtract the result from 1.

c) The area to the left of 2.00 on the standard normal is .977 which is the same as saying

2.00 is the 97.7th

percentile.

98

Answer to Question C4 (MT2008–Q7) a) Mean of T: 64 Standard deviation of T: 7.75

T = W+(X1+X2+X3)+(Y1+Y2+ … +Y8) [this is NOT the same as W + 3X + 8Y].

Mean(T) = Mean(W )+Mean(X1)+Mean(X2)+Mean(X3)+Mean(Y1)+ … +Mean(Y8)

= Mean(W )+3Mean(X)+8Mean(Y) = 64

Var(T) = Var(W )+Var(X1)+Var(X2)+Var(X3)+Var(Y1)+ … +Var(Y8) [by independence]

= Var(W )+3Var(X)+8Var(Y) = 52+3(1

2)+8(2

2) = 60

SD(T) = √60 = 7.75

b) Answer: = 0.1056 or 10.56% or 11%.

S = T1+ … + T9

Mean(S) = 9Mean(T) = 9(70) = 630

Var(S) = 9Var(T) = 9(82) = 576

SD(S) = √576 = 24

Pr (S > 660) = Pr (Z > [660–630]/24) = Pr (Z > 1.25) = 1 – 0.8944 = 0.1056 or 10.56% or

11%.

Details and Comments:

a) These calculations use two principles of combining random variables. The mean of a

sum of random variables is the sum of the means. The variance of a sum of random

variables is the sum of the variances, ONLY if the random variables are independent

(which they are in the case of a random sample). Note that you can never add standard

deviations. First compute the variance of the sum and then take the square root.

b) Remember to draw a sketch of the standard normal. From the Empirical Rule you

know that the area to the right of 1 is 16% and the area to the right of 2 is 2.5%, so the

area to the right of 1.25 must be between 2.5% and 16%; 10% seems reasonable.

Answer to Question C5 (MT2008–Q8)

a) (i) Mean of = 190 Mean of = 170

SD of = 52/√100 = 5.2 SD of = 4.5

(ii) Normal, due to the Central Limit Theorem, since each n is larger than 30.

(iii) Mean( – ) = 190–170 = 20; SD ( – ) = 6.88 (given)

Pr ( – > 30) = Pr (Z > [30–20]/6.88) = Pr( Z > 1.45) = 1 – 0.9265 = 0.0735 or 0.07

b) Pr ( > 0.76) = Pr (Z > [0.76–0.72]/ ) = Pr (Z > 1.41) = 1 – 0.9207

= 0.0793 or 0.080.

Details and Comments:

a) First calculate the mean of – (the SD is given), then standardize and find the

appropriate area under the z curve.

b) Use the sampling distribution of , standardize and use the standard normal curve.

99

Answer to Question C6 (MT2007–Q4)

a) For Z: Q1 = -0.675, Q3 = 0.675 (from Z-table)

“Unstandardize” to X: Q1 = 60 + (-0.675)(20) = 46.5, Q3 = 60 + (0.675)(20) = 73.5

IQR = 73.5 – 46.5 =27

b) (i) Find Z with area of 0.20 to the right; Z = 0.84; X = 60 + (0.84)20 = 76.8.

Hence the cutoff is 77.

(ii) The cutoff is 60 (50% are C or lower, so the median – equal to the mean in a normal –

is the cutoff)

c) High-tech

Z-score for low-tech = (10–15)/3 = -1.67

Z-score for high-tech = (8–10)/1 = -2.00. More unusual, so better relative performance.

d) Mean = 15+10 = 25; SD = √[32+1

2] = √10 = 3.16

Details and Comments:

a) Hints are given for a reason

The z-table only gives two decimal places for a z-value, so it is also acceptable to base

your computations on -0.68 and 0.68: IQR = 27.2, or on -0.67 and 0.67: IQR = 26.8.

However, it is an easy and sensible interpolation to get to -0.675 and 0.675; that is the

preferred solution.

b) (ii) is based on the definition of the median; that’s why the question said it was easy.

c) Be careful here; high-tech has the larger z-score and hence has smaller probability

associated with it, but less probable is better here since you are computing the probability

of defects! It is analogous to shorter times in a race or lower golf scores representing

better outcomes. Context is everything.

d) Means are additive; standard deviations are not. However, variances are additive if the

random variables are independent.

Answer to Question C7 (MT2007–Q7)

a) Pr ( > 0.28) = Pr (Z > [0.28–0.25]/ ) = Pr (Z > 2.4) = 0.0082

b) = 336/1200 = 0.28. We know from part a) that the claim is very unlikely since the

probability of exceeding 0.28 is so small.

c) The amount of coffee consumed follows a normal distribution; = 240/125 = 1.92

Pr ( > 1.92) = Pr (Z > [1.92–2.0]/[0.6/√125]) = Pr (Z > -1.49) = 0.93

Alternative method (but not necessary for you to do):

Pr (Total > 240) = Pr (Z > [240–125(2.0)]/[(0.6)√125]) = Pr (Z > -1.49) = 0.93

Details and Comments:

a) Use the sampling distribution of .This question asks for the probability of the

proportion exceeding 0.28.

c) This question can be answered two ways, either by working out the probability that the

mean exceeds 1.92 (i.e. 240/125), or that the total exceeds 240. To use the second method

you need to work out the mean and standard deviation of a Total, using combinations of

random variables. Using the mean will be much more familiar to you and therefore much

easier! The main difference is in the standard deviation formula; the “square root of n”

term changes positions.

100

Answer to Question C8 (MT2006–Q6)

a) 18 ± 2(6) = 18 ± 12 = (6,30)

OR: 18 ± 1.96(6) = 18 ± 11.76 = (6.24, 29.76)

b) 18 ± 2(6/√25) = 18 ± 2.5 = (15.6, 20.4)

OR: 18 ± 1.96(6/√25) = 18 ± 2.35 = (15.65, 20.35)

c) B. 1/

Details and Comments:

a) This part uses The Empirical Rule (68-95-99.7 Rule) since we are interested in

individual scores. That is, the plus or minus number is about 2(σ).

b) This part uses standard error (from the sampling distribution of X ) since we are

interested in the mean of 25 scores. That is, the plus or minus number is about 2(σ/ )

c) Notice the effect of the square root of n.

Answer to Question C9 (MT2006–Q3)

a) Pr (X < 48) = Pr (Z < [48-49.5]/2.5) = Pr (Z < -0.6) = 0.27 or 27%

b) American: Pr (Z < [48-50]/3) = Pr (Z < -0.67) = 0.25

Australian: Pr (Z < [48-49]/2.5) = Pr (Z < -0.40) = 0.34

Swimmer is American; Probability = 0.25

c) Australian: -1.28(2.5) + 49 = 45.8

American: -1.28(3) + 50 = 46.2

Winner is Australian

d) Mean = 49.5 + 50 + 51.5 + 53 = 204

Variance = 2.52 + 3

2 + 5

2 + 2.5

2 = 46.5; SD = 6.82

e) Pr (T < 200) = Pr (Z < [200-204]/6.82) = Pr (Z < -0.59) = 0.28 or 28%

Details and Comments:

b) After standardizing, draw a sketch to get the correct part of the normal curve.

c) From the given probability (through the percentile), find the corresponding value on

the standard normal curve and then “unstandardize.” Remember that faster is better here.

d) The mean of a sum is the sum of the means. The variance of a sum is the sum of the

variances IF the random variables are independent. Standard deviations are never

additive.

e) Use the mean and standard deviation from part d) to standardize.

101

SECTION D: INTRODUCTION TO INFERENCE, CONFIDENCE

INTERVALS, HYPOTHESIS TESTS

Question D1 (MT2009–Q6) “Frozen biscuits”

Thunderbird Co. uses a high technology manufacturing process to produce ice hockey

pucks with an average weight of 170 g. Sometimes the process gets out of adjustment and

produces pucks with average weights different from 170 g. When the average weight

exceeds 170 g., sales will be negatively affected since these pucks cannot be used in a

game due to possible serious harm. As well, there are equipment standards for

professional competition and when the average weight falls below 170 g., Thunderbird’s

pucks might be rejected.

Thunderbird’s quality control program involves taking periodic samples of 50 pucks to

monitor the manufacturing process. For each sample, a hypothesis test is conducted to

determine whether the process has fallen out of adjustment. The quality control team

selected α = 0.05 as the level of significance for the test.

a) Suppose that a sample of 50 pucks is selected and that the sample mean, , is 172.6 g

and sample standard deviation, s, is 12 g.

Carry out the hypothesis test by following these steps.

State the hypotheses

Explain whether Ha should be one-sided or two-sided

Calculate the test statistic

Find bounds on the P-value (Assume the test statistic = 1.5 rather than the answer you

got immediately above.)

State your conclusion (Base it on the test statistic = 1.5 and the P-value you found

immediately above. Reminder: Don’t just say Accept H0 or Reject H0.)

b) Construct a 95% confidence interval for the mean weight of the population of all pucks

made by Thunderbird.

102

Question D2 (MT2008–Q9) “A testbank bank test”

The Bank of ABC focused on a stable workforce that has very little turnover. The bank

has always promoted the idea that its employees stay with them for a very long time, and

has used the following line in its recruitment brochures: The average tenure of our

employees is 20 years. However, its new HR manager thinks that the average tenure may

be less than 20 years which would mean that some new measures to improve workforce

stability would be required. A random sample of 100 employees is taken and the average

tenure is computed to be 19 years, with a standard deviation of 4 years. Does the HR

manager have enough evidence to support his claim at the 5% significance level?

a) State the hypotheses.

b) Compute the value of the test statistic.

c) Find an approximate P-value for this test statistic.

d) State a conclusion about the manager’s claim, in one complete sentence. Don’t just say

Accept H0 or Reject H0.

e) Assuming the same sample mean of 19 years and standard deviation of 4 years, what is

the smallest sample size that would still reject the null hypothesis at the 5% significance

level? Hint. Find the value of the test statistic that would give a P-value as close as

possible to 0.05.

103

Question D3 (MT2008–Q10) “No lack of confidence here!”

a) A study of the career paths of hotel general managers sent questionnaires to a random

sample of 160 hotels belonging to major U.S. hotel chains and received 114 responses.

The mean time these 114 general managers had spent with their current company was

11.8 years, with a standard deviation of 3.2 years. Construct a 99% confidence interval

for the mean number of years general managers of major-chain hotels have spent with

their current company.

b) Answer each of the following with: Yes, No, or Can’t Tell. Circle your choices.

Does the sample mean lie in Yes No Can’t Tell

the 95% confidence interval?

Does the population mean lie in Yes No Can’t Tell

the 95% confidence interval?

If a 90% confidence level were used, Yes No Can’t Tell

would the interval from the same data

produce an interval wider than the 95%

confidence interval?

With a smaller sample size, all other Yes No Can’t Tell

things being the same, would the 95%

confidence interval be wider than with

a larger sample size?

c) A radio talk show invites listeners to enter a dispute about a proposed pay increase for

city council members. “What yearly pay do you think council members should get? Call

us with your number.” In all, 958 people call. The station calculates the 95% confidence

interval for the mean pay, μ, that all citizens would propose for council members to be

$9669 to $9811. Is this result trustworthy? Explain your answer.

104

Question D4 (MT2007–Q8) “I owe, I owe, it’s off to work I go” (go ahead & sing it)

The National Association of Independent Colleges and Universities took a random

sample of 64 college graduates and found that their average debt upon graduation was

$12,000, with a standard deviation of debt upon graduation of $1800.

a) Construct a 95% confidence interval for the mean debt of all college graduates.

b) True or False: The confidence interval you obtained in part a) means that

approximately 95% of sample averages obtained from repeated random samples of 64

college graduates will fall in that interval. (No explanation is necessary.)

___True ___False

c) Calculate the sample size required to have a 99% confidence interval with the same

margin of error as that found in part a).

d) A college president says that the sample of 64 graduates in part a) resulted in an

overestimate and that the actual mean debt is $11,500. Test whether the actual mean debt

exceeds $11,500 by forming the appropriate hypotheses, obtaining a P-value, and

interpreting it.

e) Decide whether each statement is more likely to be True or False:

(i) The larger the sample size, the more likely you will get True False

statistical significance using a t-test (assuming the sample

mean does not change).

(ii) The P-value does not depend on the level of significance. True False

(iii) As the P-value gets smaller, the evidence against the null True False

hypothesis gets stronger.

(iv) In hypothesis testing, if the P-value is 1%, there is a 1% True False

chance that the null hypothesis being tested is true.

(v) The smaller the P-value, the less likely the null hypothesis True False

is true.

105

Question D5 (MT2006–Q7) “Buyer, and renter, beware!”

The square footage of the several thousand apartments in a new development is

advertised to be 1250 square feet, on average. A tenant group thinks that the apartments

are smaller than advertised. They hire an engineer to measure a random sample of

apartments to test their suspicion about the true average area (in sq. ft.) of the apartments.

a) What are the appropriate null and alternative hypotheses?

__A. H0: μ = 1250 vs. Ha: μ < 1250

__B. H0: μ = 1250 vs. Ha: μ ≠ 1250

__C. H0: μ = 1250 vs. Ha: μ > 1250

__D. H0: = 1250 vs. Ha: < 1250

b) The engineer’s sample of 36 apartments had a mean of 1208 square feet and standard

deviation of 120 square feet. What is the value of the test statistic?

__A. -0.35 Show your work:

__B. -2.1

__C. -12.6

__D. 2.1

__E. None of these

c) The P-value is for the test statistic you computed in part b) is closest to:

__A. >0.20

__B. 0.05

__C. 0.025

__D. <0.001

__E. None of these

d) State your conclusion in one complete sentence that the tenant group can understand.

106

SECTION D: ANSWERS AND EXPLANATIONS

Answer to Question D1 (MT2009–Q6)

a) Hypotheses: H0: μ = 170; Ha: μ ≠ 170

Two-sided; differences from 170 in either direction are both problems!

Test statistic: t =

=

= 1.53

Degrees of freedom = 49

Two-tail P-value (from Table T) is between 0.10 and 0.20.

There is not enough evidence to conclude a difference from the target of 170 g.

b) = 172.6 ± 2.064 (12/ ) = 172.6 ± 3.5 or (169.1,176.1)

Details and Comments:

a) It is a problem if pucks are too light or too heavy so this is a two-tailed (two-sided)

alternative.

b) Look up the multiplier in Table T, using 24 degrees of freedom and 95% confidence

level.

Answer to Question D2 (MT2008–Q9)

a) H0: μ = 20; Ha: μ < 20

b) t =

= -2.5

c) P-value = Pr(t99 < -2.5): From Table T, this is between 0.005 and 0.01.

d) There is enough evidence to support the manager’s claim that the average tenure is less

than 20 years.

e) From Table T, the critical value of t100 corresponding to a one-tail probability of 0.05 is

1.660. We are working with the left-hand tail, so t = -1.660 =

.

Hence = (-1.660)(4)/(-1) = 6.64; n = 44.1 Round up n = 45

Details and Comments:

This is a case of the one-sample t-test for a mean. In this scenario the alternative

hypothesis is one-tailed (i.e. one-sided). Since the P-value is much less than the threshold

of 0.05 (i.e. 5%), the null hypothesis is rejected. Remember that the smaller the P-value

the stronger the evidence that something “real” is happening.

Note that in part e) it is not correct to leave the answer as 44.1; you can’t have fractional

sample sizes!

107

Answer to Question D3 (MT2008–Q10)

a) = 11.8 ± 2.62 (3.2/ ) = 11.8 ± 0.79 or (11.01, 12.59)

Since the sample size is 114, there are 113 df, so accept multipliers of 2.617 or 2.626,

corresponding to 120 and 100 df, respectively, or anything in between. I rounded to two

decimal places.

b) (1) YES (2) CAN’T TELL (3) NO (4) YES

c) Confidence intervals are only trustworthy if they are based on random samples; this is

NOT a random sample.

Details and Comments:

a) This is a straightforward use of the confidence interval for a single mean. Don’t obsess

about the choice of degrees of freedom; it only makes a difference in the second decimal

place, but the mean is only reported to one decimal place. In fact, since we are working

with years of employment, one decimal place is sufficient.

b) (1) The formula has the sample mean plus or minus a margin of error, so by definition

the sample mean must be the midpoint of the confidence interval.

(2) That’s the whole point of constructing a confidence interval. You don’t know where

the population mean is, but you are 95% sure that it is included in the interval.

(3) Narrower intervals mean less confidence.

(4) Increasing the sample size decreases the variability and hence the margin of error.

Answer to Question D4 (MT2007–Q8)

a) 95% CI: 12,000 ± 2.000(1800/ ) = 12,000 ± 450 = (11,550 ; 12,450)

Note: Since Table T does not have probabilities corresponding to 63 degrees of freedom,

use the slightly more conservative choice of 60 df; hence the multiplier is 2.00. It would

also be acceptable to use a multiplier of 1.992, corresponding to 75 df in the next row in

the table. The CI doesn’t change much: 12,000 ± 448 = (11,552 ; 12,448).

b) False. The CI changes from sample to sample, but every CI is centred around its own

sample average.

c) H0: μ = 11,500; Ha: μ > 11,500

Test statistic: t =

= 2.22

P-value = Pr (t63 > 2.22): from Table T, this probability is between 0.01 and 0.025.

Conclusion: Reject the null hypothesis. There is evidence that mean debt is greater than

$11,500. The college president’s claim is rejected.

d) True, True, True, False, False

Details and Comments:

d) (i) Examine the formula for the z-test statistic; with n in the “denominator of the

denominator,” increasing it will increase the value of the test statistic.

(ii) The level of significance (i.e. alpha) is chosen before the P-value is calculated and

does not enter into the calculation of P-value.

(iii) This is precisely the interpretation of P-value.

(iv) and (v) The P-value assumes the null hypothesis is true so it can’t be a statement

about the chance that the null hypothesis is true. It is a statement about the data and the

consistency of the data with the null hypothesis.

108

Answer to Question D5 (MT2006–Q7)

a) A. H0: μ = 1250 vs. Ha: μ < 1250

b) B. -2.1

c) C. 0.025

d) There is sufficient evidence to conclude that the apartments are smaller than

advertised. (OR: The difference between the sample mean and advertised mean cannot be

explained by chance alone; the difference is real.)

Details and Comments:

This is a one-sample t-test of a mean.

a) D is completely wrong since hypotheses are about parameters, not estimates.

The remaining decision is whether the alternative is one-tailed or two-tailed, and if one-

tailed, which way.

b) t =

= -2.1

c) Use Table T to find the area (i.e. probability) to the left of -2.1, with 35 degrees of

freedom. Remember to look up the one-tail probability.

d) Note that it is not sufficient simply to say, “Reject H0.”

109

SECTION E: MISCELLANEOUS

Question E1 (MT2009–Q5) “Drinking and de-riving a sample” A study of the number of years that employees work for food-and-drink businesses in the

Lower Mainland was based on a sample from the telephone directory’s Yellow Pages

listings of food-and-drink businesses in the Lower Mainland. The sample was drawn as

follows. The study investigator first drew a simple random sample of four municipalities

in the Lower Mainland. Then within each selected municipality, he randomly sampled 50

businesses. For various reasons, the study got no response from 40% of the 200

businesses chosen. Interviews were completed with 120 businesses that responded. Each

of the 120 businesses was asked for the typical number of years that an employee stayed

with the business.

a) The population of interest to the investigator is:

__ A. all food-and-drink businesses in the Lower Mainland that are listed under

the telephone directory’s Yellow Pages

__ B. all food-and-drink businesses in the Lower Mainland

__ C. the 200 businesses that were chosen by the investigator

__ D. the 120 businesses that responded

b) What is the relevant statistic here?

__ A. The mean years of employment of all food-and-drink businesses in the

Lower Mainland listed under the telephone directory’s Yellow Pages

__ B. The mean years of employment of all food-and-drink businesses in the

Lower Mainland

__ C. The mean years of employment of the 200 businesses that were chosen by

the investigator

__ D. The mean years of employment of the 120 businesses that responded

c) The sampling scheme that the investigator used in choosing the 200 businesses is:

__ A. Simple random sampling

__ B. Stratified random sampling

__ C. Multistage sampling (also known as Cluster sampling)

__ D. Systematic sampling

d) The main source of bias in this study is due to the fact that:

__ A. only four municipalities were sampled

__ B. only 50 businesses in each municipality were sampled

__ C. only 120 of the 200 businesses sampled actually responded

__ D. not all food-and-drink businesses are listed in the Yellow Pages

e) This study is an example of

__ A. an experiment

__ B. a double-blind study

__ C. a census

__ D. a survey

110

Question E2 (MT2009–Q8) “So many choices, so little time”

For each of the following multiple choice questions, choose the single correct answer.

Place an X in the space beside the letter of your choice or circle the letter.

a) A study of elementary school children, ages 6 to 11, finds a high positive correlation

between shoe size X and score Y on a test of reading comprehension. The observed

correlation is most likely due to

__ A. the effect of a lurking variable, such as age

__ B. a mistake since the correlation must be negative

__ C. cause and effect (larger shoe size causes higher reading comprehension)

__ D. “reverse” cause and effect (higher reading comprehension causes large

shoe size)

b) Which of the following is true of the least-squares regression line?

__ A. the slope is the change in the response variable that would be predicted by

a unit change in the explanatory variable

__ B. It always passes through the point ( , ), the means of the explanatory

and response variables, respectively

__ C. It will only pass through all the data points if r = ± 1

__ D. All of the above

c) The slope of a regression line and the correlation are similar in the sense that:

__ A. they both have the same sign

__ B. they both do not depend on the units of measurement of the data

__ C. they both fall between -1 and 1 inclusive

__ D. neither of them can be affected by outliers

__ E. both can be used for prediction

d) When the level of confidence and variance remain the same, a confidence interval for a

population mean based on a sample of n=100 will be _____________ as the confidence

interval for a mean based on a sample of n =400.

__ A. twice as wide

__ B. half as wide

__ C. the same size

__ D. four times as wide

__ E. one fourth as wide

e) A random sample of 1500 observations gave a 95% confidence interval for the mean of

810 ± 62. What would the margin of error be, to the nearest whole number, if we wanted

just 90% confidence?

__ A. ± 65

__ B. ± 59

__ C. ± 74

__ D. ± 52

__ E. cannot be determined

111

f) In testing hypotheses, which of the following would be strong evidence against the null

hypothesis?

__ A. Using a small level of significance

__ B. Using a large level of significance

__ C. Obtaining data with a small P-value

__ D. Obtaining data with a large P-value

g) In a statistical test of hypotheses, we say the data are statistically significant at level

if

__A. = 0.05

__B. is small

__C. the P-value is less than

__D. the P-value is larger than

h) An engineer designs an improved light bulb. The previous design had an average

lifetime of 1200 hours. The new bulb has a lifetime of 1201 hours, using a sample of

2000 bulbs. Although the difference is quite small, it is statistically significant. The

explanation for the statistically significant difference is

__ A. that new designs typically have more variability than standard designs

__ B. that the sample size is very large

__ C. that the mean of 1200 is large

__ D. all of the above

i) In a test of statistical hypotheses, the P-value is:

__ A. the probability that the null hypothesis is true

__ B. the probability that the alternative hypothesis is true

__ C. the largest level of significance at which the null hypothesis can be rejected

__ D. the smallest level of significance at which the null hypothesis can be

rejected

112

Question E3 (MT2006–Q4) “Shooting and sleighing”

Olympics again: Shooting has been an Olympic sport since 1896. The “running target”

event has shooters firing at a moving target as it moves across a two-metre opening, from

a distance of 10 meters.

a) In Olympic competition, each shooter has 60 shots. The average shooter has a success

rate of p = 0.70. The sampling distribution of the sample proportion is approximately

normal with:

Mean = ______ SD = ______

b) The probability that a shooter hits the target 48 or more times is closest to:

__A. 0.95

__B. 0.26

__C. 0.74

__D. 0.05

Bobsleigh racing was developed in the 19th

century by the Swiss in search of the ultimate

thrill. Race times are normally distributed with mean 53 seconds and standard deviation 3

seconds. In bobsleigh events, racers complete four runs.

c) A sample of four runs at a particular event gave a mean of 51 seconds, with the same

standard deviation of 3 seconds (as expected). Compute a 90% confidence interval for the

true mean run time. (Report 2 decimal places)

d) The observed margin of error for another sample of four runs was 8.76. What level of

confidence was chosen to compute that confidence interval?

__A. 90%

__B. 95%

__C. 99%

__D. None of the above

113

Question E4 (MT2006–Q8)

a) The Environmental Protection Agency records data on the fuel economy of many

different makes of cars. Some of the variables collected are listed below. Identify each

variable as categorical or measurement (i.e. quantitative). (Circle your choice)

Manufacturer (GM, Ford, Toyota, etc.) Categorical Measurement

Gas mileage (miles per gallon) Categorical Measurement

Weight (in pounds) Categorical Measurement

Size (small, medium, full-size, truck, etc.) Categorical Measurement

b) A study of the caloric content of hot dogs was undertaken. As part of the study, the

number of calories in 20 brands of beef hot dogs were recorded and the five-number

summary computed as follows: Min = 110, Max = 190, Median = 152.5, Quartiles = 140,

180. The researchers did not provide the standard deviation. However, previous work has

shown that calorie count is approximately normally distributed. Which of the following is

the most reasonable estimate of the standard deviation?

__A. 10

__B. 20

__C. 40

__D. 80

c) A television station is interested in predicting whether or not voters in its listening area

are watching their coverage of the Winter Olympics. It asks its viewers to phone in and

report whether or not they have watched at least one hour of Olympic coverage in the

first week of the Games. Of the 1242 viewers who phoned in, 512 (41.22%) said “Yes.”

The number 41.22% is a:

__A. statistic

__B. parameter

__C. sample

__D. population

d) Refer to part e), immediately above. Choose the best statement from the following.

__A. The results are valid because the sample size is very large

__B. The results are valid because people who are undecided do not phone in

__C. The results are not valid because the response is voluntary

__D. The results are not valid because the question is poorly worded

114

e) Which of the following statements is a correctly worded statement about correlation?

(I) There is a high correlation between the gender of Canadian workers and their income.

(II) There is a high correlation (r = 1.09) between students’ ratings of faculty teaching

and ratings made by other faculty members.

(III) The correlation between age and income was found to be r = 0.53 years.

__A. I only

__B. II only

__C. III only

__D. I and III only

__E. None of them

Two Bonus Questions for Algebra Fans

Bonus Question 1: Suppose you have only two data values and . Which of the

following formulas gives the variance of and ?

__A.

__B.

__C.

__D. None of these

__E. Variance cannot be computed with only two data values

Bonus Question 2: Suppose a random variable X takes only two possible values: μ–σ and μ+σ, each with

probability 0.5. What are the mean and standard deviation of X?

__A. Mean = μ, SD = σ

__B. Mean = 0, SD = 1

__C. Mean = μ, SD = 2σ

__D. Mean = 0, SD = σ

__E. You can’t compute mean and SD without actual data.

115

SECTION E: ANSWERS AND EXPLANATIONS

Answer to Question E1 (MT2009–Q5)

a) B. All food-and-drink businesses in the Lower Mainland

b) D. The mean years of employment of the 120 businesses that responded

c) C. Multistage cluster sampling

d) C. Only 120 of the 200 businesses sampled actually responded

e) D. A survey

Details and Comments:

a) Although the population is all businesses, the sampling frame is only those listed in the

Yellow Pages. Sometimes it is not possible to access the entire population of interest.

b) A statistic is computed from the actual sample, hence the 120 responding businesses.

d) A and B are sources of variability, not bias. D is also a source of bias, but not as large

as the non-response bias.

Answer to Question E2 (MT2009–Q8)

a) A b) D c) A d) A e) D f) C g) C h) B i) D

Details and Comments:

c) Since

, the signs on b1 and r will be the same (since SDs are positive).

d) The square root is important here; changing the sample size by a factor of 4 will

change the margin of error by a factor of 2.

e) Work backward from the formulas to solve for SE and then work forward to get the

new margin of error. SE = 62/1.96; so 1.645(SE) = (1.645/1.96)(62) = 52

f) and g) and i) Look at the definition and interpretation of P-value.

h) Statistical significance and practical significance are different. It is possible to detect a

very small difference using a very large sample size, but the difference may not be

meaningful in practice.

Answer to Question E3 (MT2006–Q4)

a) Mean = 0.7; SD = 0.059 (= )

b) D. 0.05

c) Use a multiplier from the t-distribution with 3 df.:

51 ± 2.353(3/√4) = 51 ± 3.53 or (47.47 ; 54.53)

d) C. 99%

Details and Comments:

b) Convert the sample count to a sample proportion: = 48/60 =0.80

Then find Pr ( > 0.80) = Pr (Z > [0.80–0.70]/0.059) = Pr (Z > 1.69) = 0.0455

d) Margin of error = 8.76 = t×SE; hence t = 8.76/SE = 8.75/1.5 = 5.84.

From Table T, Pr (t3 > 5.84) corresponds to a two-tail probability of 0.01 and a

confidence level of 99%.

116

Answer to Question E4 (MT2006–Q8)

a) Categorical: Manufacturer, Size

Quantitative: Gas mileage, Weight

b) B. 20

c) A. statistic

d) C. The results are not valid because the response is voluntary

e) E. None of them

Details and Comments:

b) Use the rule of thumb that, for small samples (n ≈ 20), SD is approximately the range

divided by 4.

c) A statistic is a quantity computed from the sample data.

e) Statement (I) uses a categorical variable; Statement (II) has r > 1, Statement (III)

reports units for r. None of these are possible for r.

Answer to Bonus Question 1:

A.

Use the definition of variance and do the algebra yourself!

Answer to Bonus Question 2: A. Mean = μ, SD = σ

Calculation/algebra involved:

Mean of X: [(μ–σ) + (μ+σ)]/2 = 2μ/2 = μ

Variance of X: {[(μ–σ) – μ]2+ [(μ+σ) – μ]

2}/2 = {σ

2 + σ

2}/2 = σ

2

SD of X = σ

END OF PRACTICE QUESTIONS FOR THE MIDTERM EXAM